[RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning
@ 2016-06-14 18:12 Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 1/6] locking/rwsem: Stop active read lock ASAP Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

This patchset adds an option to enable more aggressive optimistic
spinning on both readers and writers waiting on a writer or reader
owned lock. Spinning on writer is done by looking at the on_cpu
flag of the lock owner. Spinning on readers, on the other hand, is
count-based as there is no easy way to figure out if all the readers
are running. The spinner will stop spinning once the count goes to
0. Because of that, spinning on readers may hurt performance in some
cases. So this feature is optional and has to explicitly enabled for
those rwsems where the readers are unlikely to go to sleep.

Patch 1 reduces the length of the blocking window after a read locking
attempt where writer lock stealing is disabled because of the active
read lock. It can improve rwsem performance for contended lock. It is
independent of the rest of the patchset.

Patches 2-5 lays the groundwork for enabling reader optimistic spinning.

Patch 6 enables reader optimistic spinning in XFS for inodes that
comes from a DAX-enabled mount point on top of persistent memory as
the CPUs can do I/O directly without waiting or sleeping.

Waiman Long (6):
  locking/rwsem: Stop active read lock ASAP
  locking/rwsem: Enable optional count-based spinning on reader
  locking/rwsem: move down rwsem_down_read_failed function
  locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
  locking/rwsem: Enable spinning readers
  xfs: Enable reader optimistic spinning for DAX inodes

 arch/alpha/include/asm/rwsem.h |    7 +-
 arch/ia64/include/asm/rwsem.h  |    6 +-
 arch/s390/include/asm/rwsem.h  |    6 +-
 arch/x86/include/asm/rwsem.h   |   13 ++-
 fs/xfs/xfs_icache.c            |    9 ++
 include/asm-generic/rwsem.h    |    9 +-
 include/linux/rwsem.h          |   21 ++++-
 kernel/locking/rwsem-xadd.c    |  212 ++++++++++++++++++++++++++--------------
 8 files changed, 192 insertions(+), 91 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 1/6] locking/rwsem: Stop active read lock ASAP
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

Currently, when down_read() fails, the active read locking isn't undone
until the rwsem_down_read_failed() function grabs the wait_lock. If the
wait_lock is contended, it may takes a while to get the lock. During
that period, writer lock stealing will be disabled because of the
active read lock.

This patch will release the active read lock ASAP so that writer lock
stealing can happen sooner. The only downside is when the reader is
the first one in the wait queue as it has to issue another atomic
operation to update the count.

On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
the fio test with multithreaded randrw and randwrite tests on the
same file on a XFS partition on top of a NVDIMM with DAX were run,
the aggregated bandwidths before and after the patch were as follows:

  Test      BW before patch     BW after patch  % change
  ----      ---------------     --------------  --------
  randrw        1210 MB/s          1352 MB/s      +12%
  randwrite     1622 MB/s          1710 MB/s      +5.4%

The write-only microbench also showed improvement because some read
locking was done by the XFS code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/rwsem-xadd.c |   19 ++++++++++++++-----
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2031281..29027c6 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -230,11 +230,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
 __visible
 struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 {
-	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+	long count, adjustment = 0;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
 	WAKE_Q(wake_q);
 
+	/*
+	 * Undo read bias from down_read operation, stop active locking.
+	 * Doing that after taking the wait_lock may block writer lock
+	 * stealing for too long.
+	 */
+	atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
 	waiter.type = RWSEM_WAITING_FOR_READ;
@@ -244,8 +251,11 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 		adjustment += RWSEM_WAITING_BIAS;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
+	/* we're now waiting on the lock */
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
 
 	/* If there are no active locks, wake the front queued process(es).
 	 *
@@ -253,8 +263,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	 * wake our own waiter to join the existing active readers !
 	 */
 	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS &&
-	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
+	    (count > RWSEM_WAITING_BIAS && adjustment))
 		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 1/6] locking/rwsem: Stop active read lock ASAP Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:27   ` Davidlohr Bueso
  2016-06-14 18:12 ` [RFC PATCH-tip 3/6] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a way for the kernel code to designate specific
rwsems to be more aggressive in term of optimistic spinning that the
writers will continue to spin for some additional count-based time to
see if it can get the lock before sleeping. This aggressive spinning
mode should only be used on rwsems where the readers are unlikely to
go to sleep.

One can use the following function to designate rwsems that can be
benefited from more aggressive spinning:

void __rwsem_set_rspin_threshold_shift(struct rw_semaphore *sem,
					int shift)

A shift value of 0 will use the default 4K (shift = 12) iteration
count.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/rwsem.h       |   21 ++++++++++++++++++++-
 kernel/locking/rwsem-xadd.c |   28 +++++++++++++++++++---------
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index dd1d142..1c5f6ff 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -32,6 +32,8 @@ struct rw_semaphore {
 	raw_spinlock_t wait_lock;
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
+	int rspin_threshold_shift;	/* reader spinning threshold shift */
+
 	/*
 	 * Write owner. Used as a speculative check to see
 	 * if the owner is running on the cpu.
@@ -70,9 +72,26 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL, \
+		.rspin_threshold_shift = 0
+
+#define	RWSEM_RSPIN_THRESHOLD_SHIFT_DEFAULT	12
+static inline void
+__rwsem_set_rspin_threshold(struct rw_semaphore *sem, int shift)
+{
+	sem->rspin_threshold_shift = shift;
+}
+
+static inline void rwsem_set_rspin_threshold(struct rw_semaphore *sem)
+{
+	__rwsem_set_rspin_threshold(sem, RWSEM_RSPIN_THRESHOLD_SHIFT_DEFAULT);
+}
 #else
 #define __RWSEM_OPT_INIT(lockname)
+
+static inline void
+__rwsem_set_rspin_threshold(struct rw_semaphore *sem, int shift) {}
+static inline void rwsem_set_rspin_threshold(struct rw_semaphore *sem) {}
 #endif
 
 #define __RWSEM_INITIALIZER(name)				\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 29027c6..9703f4a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -85,6 +85,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	INIT_LIST_HEAD(&sem->wait_list);
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	sem->owner = NULL;
+	sem->rspin_threshold_shift = 0;
 	osq_lock_init(&sem->osq);
 #endif
 }
@@ -347,9 +348,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	owner = READ_ONCE(sem->owner);
 	if (!rwsem_owner_is_writer(owner)) {
 		/*
-		 * Don't spin if the rwsem is readers owned.
+		 * Don't spin if the rwsem is readers owned and the
+		 * reader spinning threshold isn't set.
 		 */
-		ret = !rwsem_owner_is_reader(owner);
+		ret = !rwsem_owner_is_reader(owner) ||
+		       sem->rspin_threshold_shift;
 		goto done;
 	}
 
@@ -398,7 +401,8 @@ out:
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
-	bool taken = false;
+	bool taken = false, can_spin;
+	int loopcnt;
 
 	preempt_disable();
 
@@ -409,6 +413,9 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	if (!osq_lock(&sem->osq))
 		goto done;
 
+	loopcnt = sem->rspin_threshold_shift
+		? (1 << sem->rspin_threshold_shift) : 0;
+
 	/*
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
@@ -416,7 +423,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem)) {
+	while ((can_spin = rwsem_spin_on_owner(sem)) || loopcnt) {
 		/*
 		 * Try to acquire the lock
 		 */
@@ -425,13 +432,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 			break;
 		}
 
+		if (!can_spin && loopcnt)
+			loopcnt--;
+
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
+		 * The need_resched() check in rwsem_spin_on_owner() won't
+		 * break the loop anymore. So we need to check this in
+		 * the outer loop. If we're an RT task that will live-lock
+		 * because we won't let the owner complete.
 		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
+		if (need_resched() || rt_task(current))
 			break;
 
 		/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 3/6] locking/rwsem: move down rwsem_down_read_failed function
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 1/6] locking/rwsem: Stop active read lock ASAP Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 4/6] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

Move the rwsem_down_read_failed() function down to below the optimistic
spinning section before enabling optimistic spinning for the readers.
There is no change in code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/rwsem-xadd.c |  116 +++++++++++++++++++++---------------------
 1 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 9703f4a..400e594 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -226,64 +226,6 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
 }
 
 /*
- * Wait for the read lock to be granted
- */
-__visible
-struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	long count, adjustment = 0;
-	struct rwsem_waiter waiter;
-	struct task_struct *tsk = current;
-	WAKE_Q(wake_q);
-
-	/*
-	 * Undo read bias from down_read operation, stop active locking.
-	 * Doing that after taking the wait_lock may block writer lock
-	 * stealing for too long.
-	 */
-	atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
-
-	/* set up my own style of waitqueue */
-	waiter.task = tsk;
-	waiter.type = RWSEM_WAITING_FOR_READ;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-	if (list_empty(&sem->wait_list))
-		adjustment += RWSEM_WAITING_BIAS;
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock */
-	if (adjustment)
-		count = atomic_long_add_return(adjustment, &sem->count);
-	else
-		count = atomic_long_read(&sem->count);
-
-	/* If there are no active locks, wake the front queued process(es).
-	 *
-	 * If there are no writers and we are first in the queue,
-	 * wake our own waiter to join the existing active readers !
-	 */
-	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS && adjustment))
-		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-
-	/* wait to be given the lock */
-	while (true) {
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-		if (!waiter.task)
-			break;
-		schedule();
-	}
-
-	__set_task_state(tsk, TASK_RUNNING);
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-/*
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
@@ -479,6 +421,64 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
 #endif
 
 /*
+ * Wait for the read lock to be granted
+ */
+__visible
+struct rw_semaphore __sched * rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+	long count, adjustment = 0;
+	struct rwsem_waiter waiter;
+	struct task_struct *tsk = current;
+	WAKE_Q(wake_q);
+
+	/*
+	 * Undo read bias from down_read operation, stop active locking.
+	 * Doing that after taking the wait_lock may block writer lock
+	 * stealing for too long.
+	 */
+	atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+
+	/* set up my own style of waitqueue */
+	waiter.task = tsk;
+	waiter.type = RWSEM_WAITING_FOR_READ;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list))
+		adjustment += RWSEM_WAITING_BIAS;
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock */
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
+
+	/* If there are no active locks, wake the front queued process(es).
+	 *
+	 * If there are no writers and we are first in the queue,
+	 * wake our own waiter to join the existing active readers !
+	 */
+	if (count == RWSEM_WAITING_BIAS ||
+	    (count > RWSEM_WAITING_BIAS && adjustment))
+		sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		if (!waiter.task)
+			break;
+		schedule();
+	}
+
+	__set_task_state(tsk, TASK_RUNNING);
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+/*
  * Wait until we successfully acquire the write lock
  */
 static inline struct rw_semaphore *
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 4/6] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
                   ` (2 preceding siblings ...)
  2016-06-14 18:12 ` [RFC PATCH-tip 3/6] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 5/6] locking/rwsem: Enable spinning readers Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes Waiman Long
  5 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

When the count value is in between 0 and RWSEM_WAITING_BIAS, there
are 2 possibilities. Either a writer is present and there is no
waiter or there are waiters and readers. There is no easy way to
know which is true unless the wait_lock is taken.

This patch changes the RWSEM_WAITING_BIAS from 0xffff (32-bit) or
0xffffffff (64-bit) to 0xc0000000 (32-bit) or 0xc000000000000000
(64-bit). By doing so, we will be able to determine if writers
are present by looking at the count value alone without taking the
wait_lock.

This patch has the effect of halving the maximum number of writers
that can attempt to take the write lock simultaneously. However,
even the reduced maximum of about 16k (32-bit) or 1G (64-bit) should
be more than enough for the foreseeable future.

With that change, the following identity is now no longer true:

  RWSEM_ACTIVE_WRITE_BIAS = RWSEM_WAITING_BIAS + RWSEM_ACTIVE_READ_BIAS

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/alpha/include/asm/rwsem.h |    7 ++++---
 arch/ia64/include/asm/rwsem.h  |    6 +++---
 arch/s390/include/asm/rwsem.h  |    6 +++---
 arch/x86/include/asm/rwsem.h   |   13 ++++++++-----
 include/asm-generic/rwsem.h    |    9 +++++----
 kernel/locking/rwsem-xadd.c    |   32 ++++++++++++++++++++++++--------
 6 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index 77873d0..0f08fad 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -17,9 +17,9 @@
 #define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
 #define RWSEM_ACTIVE_BIAS		0x0000000000000001L
 #define RWSEM_ACTIVE_MASK		0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS		(-0x0000000100000000L)
+#define RWSEM_WAITING_BIAS		0xc000000000000000L
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS		(-RWSEM_ACTIVE_MASK)
 
 static inline void __down_read(struct rw_semaphore *sem)
 {
@@ -185,7 +185,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	"2:	br	1b\n"
 	".previous"
 	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (-RWSEM_WAITING_BIAS), "m" (sem->count) : "memory");
+	:"Ir" (-RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS),
+	 "m" (sem->count) : "memory");
 #endif
 	if (unlikely(oldcount < 0))
 		rwsem_downgrade_wake(sem);
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 8fa98dd..db3693b 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -30,9 +30,9 @@
 #define RWSEM_UNLOCKED_VALUE		__IA64_UL_CONST(0x0000000000000000)
 #define RWSEM_ACTIVE_BIAS		(1L)
 #define RWSEM_ACTIVE_MASK		(0xffffffffL)
-#define RWSEM_WAITING_BIAS		(-0x100000000L)
+#define RWSEM_WAITING_BIAS		(-(1L << 62))
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS		(-RWSEM_ACTIVE_MASK)
 
 /*
  * lock for reading
@@ -144,7 +144,7 @@ __downgrade_write (struct rw_semaphore *sem)
 
 	do {
 		old = atomic_long_read(&sem->count);
-		new = old - RWSEM_WAITING_BIAS;
+		new = old - RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS;
 	} while (atomic_long_cmpxchg_release(&sem->count, old, new) != old);
 
 	if (old < 0)
diff --git a/arch/s390/include/asm/rwsem.h b/arch/s390/include/asm/rwsem.h
index 597e7e9..e7f50f5 100644
--- a/arch/s390/include/asm/rwsem.h
+++ b/arch/s390/include/asm/rwsem.h
@@ -42,9 +42,9 @@
 #define RWSEM_UNLOCKED_VALUE	0x0000000000000000L
 #define RWSEM_ACTIVE_BIAS	0x0000000000000001L
 #define RWSEM_ACTIVE_MASK	0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS	(-0x0000000100000000L)
+#define RWSEM_WAITING_BIAS	0xc000000000000000L
 #define RWSEM_ACTIVE_READ_BIAS	RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS	(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS	(-RWSEM_ACTIVE_MASK)
 
 /*
  * lock for reading
@@ -193,7 +193,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 {
 	signed long old, new, tmp;
 
-	tmp = -RWSEM_WAITING_BIAS;
+	tmp = -RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS;
 	asm volatile(
 		"	lg	%0,%2\n"
 		"0:	lgr	%1,%0\n"
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 089ced4..f1fb09f 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -41,21 +41,23 @@
 
 /*
  * The bias values and the counter type limits the number of
- * potential readers/writers to 32767 for 32 bits and 2147483647
- * for 64 bits.
+ * potential writers to 16383 for 32 bits and 1073741823 for 64 bits.
+ * The combined readers and writers can go up to 65534 for 32-bits and
+ * 4294967294 for 64-bits.
  */
 
 #ifdef CONFIG_X86_64
 # define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_WAITING_BIAS		(-(1L << 62))
 #else
 # define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_WAITING_BIAS		(-(1L << 30))
 #endif
 
 #define RWSEM_UNLOCKED_VALUE		0x00000000L
 #define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS		(-RWSEM_ACTIVE_MASK)
 
 /*
  * lock for reading
@@ -209,7 +211,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem), "er" (-RWSEM_ACTIVE_WRITE_BIAS +
+					 RWSEM_ACTIVE_READ_BIAS)
 		     : "memory", "cc");
 }
 
diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index 5be122e..655e405 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -18,15 +18,16 @@
  */
 #ifdef CONFIG_64BIT
 # define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_WAITING_BIAS		(-(1L << 62))
 #else
 # define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_WAITING_BIAS		(-(1L << 30))
 #endif
 
 #define RWSEM_UNLOCKED_VALUE		0x00000000L
 #define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS		(-RWSEM_ACTIVE_MASK)
 
 /*
  * lock for reading
@@ -120,8 +121,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS,
-				     (atomic_long_t *)&sem->count);
+	tmp = atomic_long_add_return_release(-RWSEM_ACTIVE_WRITE_BIAS +
+			RWSEM_ACTIVE_READ_BIAS, (atomic_long_t *)&sem->count);
 	if (tmp < 0)
 		rwsem_downgrade_wake(sem);
 }
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 400e594..689a138 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -29,28 +29,30 @@
  * 0x00000000	rwsem is unlocked, and no one is waiting for the lock or
  *		attempting to read lock or write lock.
  *
- * 0xffff000X	(1) X readers active or attempting lock, with waiters for lock
+ * 0xc000000X	(1) X readers active or attempting lock, with waiters for lock
  *		    X = #active readers + # readers attempting lock
  *		    (X*ACTIVE_BIAS + WAITING_BIAS)
- *		(2) 1 writer attempting lock, no waiters for lock
+ *
+ * 0xffff000X	(1) 1 writer attempting lock, no waiters for lock
  *		    X-1 = #active readers + #readers attempting lock
  *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *		(3) 1 writer active, no waiters for lock
+ *		(2) 1 writer active, no waiters for lock
  *		    X-1 = #active readers + #readers attempting lock
  *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
  *
- * 0xffff0001	(1) 1 reader active or attempting lock, waiters for lock
+ * 0xc0000001	(1) 1 reader active or attempting lock, waiters for lock
  *		    (WAITING_BIAS + ACTIVE_BIAS)
- *		(2) 1 writer active or attempting lock, no waiters for lock
+ *
+ * 0xffff0001	(1) 1 writer active or attempting lock, no waiters for lock
  *		    (ACTIVE_WRITE_BIAS)
  *
- * 0xffff0000	(1) There are writers or readers queued but none active
+ * 0xc0000000	(1) There are writers or readers queued but none active
  *		    or in the process of attempting lock.
  *		    (WAITING_BIAS)
  *		Note: writer can attempt to steal lock for this count by adding
  *		ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
  *
- * 0xfffe0001	(1) 1 writer active, or attempting lock. Waiters on queue.
+ * 0xbfff0001	(1) 1 writer active, or attempting lock. Waiters on queue.
  *		    (ACTIVE_WRITE_BIAS + WAITING_BIAS)
  *
  * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
@@ -62,9 +64,23 @@
  *	 checking the count becomes ACTIVE_WRITE_BIAS for successful lock
  *	 acquisition (i.e. nobody else has lock or attempts lock).  If
  *	 unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- *	 are only waiters but none active (5th case above), and attempt to
+ *	 are only waiters but none active (7th case above), and attempt to
  *	 steal the lock.
  *
+ *	 We can infer the reader/writer/waiter state of the lock by looking
+ *	 at the count value:
+ *	 (1) count > 0
+ *	     Only readers are present.
+ *	 (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
+ *	     Have writers, maybe readers, but no waiter
+ *	 (3) WAITING_BIAS < count <= WAITING_BIAS - ACTIVE_WRITE_BIAS
+ *	     Have readers and waiters, but no writer
+ *	 (4) count < WAITING_BIAS
+ *	     Have writers and waiters, maybe readers
+ *
+ *	 IOW, writers are present when
+ *	 (1) count < WAITING_BIAS, or
+ *	 (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
  */
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 5/6] locking/rwsem: Enable spinning readers
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
                   ` (3 preceding siblings ...)
  2016-06-14 18:12 ` [RFC PATCH-tip 4/6] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:12 ` [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes Waiman Long
  5 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch enables readers to optimistically spin when the
rspin_threshold is non-zero. That threshold value should only
be set when the lock owners of the rwsem are unlikely to go to
sleep. Otherwise enabling reader spinning may make the performance
worse in some cases.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/rwsem-xadd.c |   45 +++++++++++++++++++++++++++++++++++++-----
 1 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 689a138..be2a327 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -83,6 +83,12 @@
  *	 (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
  */
 
+static inline bool count_has_writer(long count)
+{
+	return (count < RWSEM_WAITING_BIAS) || ((count < 0) &&
+	       (count > RWSEM_WAITING_BIAS - RWSEM_ACTIVE_WRITE_BIAS));
+}
+
 /*
  * Initialize an rwsem:
  */
@@ -294,6 +300,25 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 	}
 }
 
+/*
+ * Try to acquire read lock before the reader is put on wait queue
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	if (count_has_writer(count))
+		return false;
+	count = atomic_long_add_return_acquire(RWSEM_ACTIVE_READ_BIAS,
+					       &sem->count);
+	if (!count_has_writer(count))
+		return true;
+
+	/* Back out the change */
+	atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+	return false;
+}
+
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
@@ -357,7 +382,8 @@ out:
 	return !rwsem_owner_is_reader(READ_ONCE(sem->owner));
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+				  enum rwsem_waiter_type type)
 {
 	bool taken = false, can_spin;
 	int loopcnt;
@@ -385,10 +411,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		/*
 		 * Try to acquire the lock
 		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
+		taken = (type == RWSEM_WAITING_FOR_WRITE)
+		      ? rwsem_try_write_lock_unqueued(sem)
+		      : rwsem_try_read_lock_unqueued(sem);
+		if (taken)
 			break;
-		}
 
 		if (!can_spin && loopcnt)
 			loopcnt--;
@@ -425,7 +452,8 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
 }
 
 #else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+				  enum rwsem_waiter_type type)
 {
 	return false;
 }
@@ -454,6 +482,11 @@ struct rw_semaphore __sched * rwsem_down_read_failed(struct rw_semaphore *sem)
 	 */
 	atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
 
+	/* do optimistic spinning and steal lock if possible */
+	if (sem->rspin_threshold_shift &&
+	    rwsem_optimistic_spin(sem, RWSEM_WAITING_FOR_READ))
+		return sem;
+
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
 	waiter.type = RWSEM_WAITING_FOR_READ;
@@ -510,7 +543,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
+	if (rwsem_optimistic_spin(sem, RWSEM_WAITING_FOR_WRITE))
 		return sem;
 
 	/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes
  2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
                   ` (4 preceding siblings ...)
  2016-06-14 18:12 ` [RFC PATCH-tip 5/6] locking/rwsem: Enable spinning readers Waiman Long
@ 2016-06-14 18:12 ` Waiman Long
  2016-06-14 18:24   ` Christoph Hellwig
  2016-06-14 23:06   ` Dave Chinner
  5 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 18:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
	linux-arch, xfs, Davidlohr Bueso, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch enables reader optimistic spinning for inodes that are
under a DAX-based mount point.

On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
the fio test with multithreaded randrw and randwrite tests on the
same file on a XFS partition on top of a NVDIMM with DAX were run,
the aggregated bandwidths before and after the patch were as follows:

  Test      BW before patch     BW after patch  % change
  ----      ---------------     --------------  --------
  randrw        1352 MB/s          2164 MB/s      +60%
  randwrite     1710 MB/s          2550 MB/s      +49%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/xfs/xfs_icache.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 99ee6ee..09f284f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -71,6 +71,15 @@ xfs_inode_alloc(
 
 	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
 
+	/*
+	 * Enable reader spinning for DAX nount point
+	 */
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		rwsem_set_rspin_threshold(&ip->i_iolock.mr_lock);
+		rwsem_set_rspin_threshold(&ip->i_mmaplock.mr_lock);
+		rwsem_set_rspin_threshold(&ip->i_lock.mr_lock);
+	}
+
 	/* initialise the xfs inode */
 	ip->i_ino = ino;
 	ip->i_mount = mp;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes
  2016-06-14 18:12 ` [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes Waiman Long
@ 2016-06-14 18:24   ` Christoph Hellwig
  2016-06-14 19:08     ` Waiman Long
  2016-06-14 23:06   ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-06-14 18:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Davidlohr Bueso,
	Jason Low, Dave Chinner, Scott J Norton, Douglas Hatch

On Tue, Jun 14, 2016 at 02:12:39PM -0400, Waiman Long wrote:
> This patch enables reader optimistic spinning for inodes that are
> under a DAX-based mount point.
> 
> On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
> the fio test with multithreaded randrw and randwrite tests on the
> same file on a XFS partition on top of a NVDIMM with DAX were run,
> the aggregated bandwidths before and after the patch were as follows:

And why is this specific to DAX?  Many I/O operations already never
got out to disk, and ilock is mostly held for operations that have
nothing to do with disk I/O.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader
  2016-06-14 18:12 ` [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader Waiman Long
@ 2016-06-14 18:27   ` Davidlohr Bueso
  2016-06-14 19:11     ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Davidlohr Bueso @ 2016-06-14 18:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Tue, 14 Jun 2016, Waiman Long wrote:

>This patch provides a way for the kernel code to designate specific
>rwsems to be more aggressive in term of optimistic spinning that the
>writers will continue to spin for some additional count-based time to
>see if it can get the lock before sleeping. This aggressive spinning
>mode should only be used on rwsems where the readers are unlikely to
>go to sleep.

Yikes, exposing this sort of thing makes me _very_ uneasy, not to mention
the ad-hoc nature and its easiness to mess up. I'm not really for this, even
if it shows extraordinary performance boosts on benchmarks.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes
  2016-06-14 18:24   ` Christoph Hellwig
@ 2016-06-14 19:08     ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 19:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Davidlohr Bueso,
	Jason Low, Dave Chinner, Scott J Norton, Douglas Hatch

On 06/14/2016 02:24 PM, Christoph Hellwig wrote:
> On Tue, Jun 14, 2016 at 02:12:39PM -0400, Waiman Long wrote:
>> This patch enables reader optimistic spinning for inodes that are
>> under a DAX-based mount point.
>>
>> On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
>> the fio test with multithreaded randrw and randwrite tests on the
>> same file on a XFS partition on top of a NVDIMM with DAX were run,
>> the aggregated bandwidths before and after the patch were as follows:
> And why is this specific to DAX?  Many I/O operations already never
> got out to disk, and ilock is mostly held for operations that have
> nothing to do with disk I/O.

It is just a showcase for the rwsem change. We can certainly have more 
use cases.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader
  2016-06-14 18:27   ` Davidlohr Bueso
@ 2016-06-14 19:11     ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-14 19:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Jason Low, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 06/14/2016 02:27 PM, Davidlohr Bueso wrote:
> On Tue, 14 Jun 2016, Waiman Long wrote:
>
>> This patch provides a way for the kernel code to designate specific
>> rwsems to be more aggressive in term of optimistic spinning that the
>> writers will continue to spin for some additional count-based time to
>> see if it can get the lock before sleeping. This aggressive spinning
>> mode should only be used on rwsems where the readers are unlikely to
>> go to sleep.
>
> Yikes, exposing this sort of thing makes me _very_ uneasy, not to mention
> the ad-hoc nature and its easiness to mess up. I'm not really for 
> this, even
> if it shows extraordinary performance boosts on benchmarks.
>
> Thanks,
> Davidlohr

I understand your concern. I will see if there is a way to autotune 
instead of using explicit enablement.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes
  2016-06-14 18:12 ` [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes Waiman Long
  2016-06-14 18:24   ` Christoph Hellwig
@ 2016-06-14 23:06   ` Dave Chinner
  2016-06-15 18:55     ` Waiman Long
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-06-14 23:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Davidlohr Bueso,
	Jason Low, Scott J Norton, Douglas Hatch

On Tue, Jun 14, 2016 at 02:12:39PM -0400, Waiman Long wrote:
> This patch enables reader optimistic spinning for inodes that are
> under a DAX-based mount point.
> 
> On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
> the fio test with multithreaded randrw and randwrite tests on the
> same file on a XFS partition on top of a NVDIMM with DAX were run,
> the aggregated bandwidths before and after the patch were as follows:
> 
>   Test      BW before patch     BW after patch  % change
>   ----      ---------------     --------------  --------
>   randrw        1352 MB/s          2164 MB/s      +60%
>   randwrite     1710 MB/s          2550 MB/s      +49%
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  fs/xfs/xfs_icache.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 99ee6ee..09f284f 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -71,6 +71,15 @@ xfs_inode_alloc(
>  
>  	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
>  
> +	/*
> +	 * Enable reader spinning for DAX nount point
> +	 */
> +	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		rwsem_set_rspin_threshold(&ip->i_iolock.mr_lock);
> +		rwsem_set_rspin_threshold(&ip->i_mmaplock.mr_lock);
> +		rwsem_set_rspin_threshold(&ip->i_lock.mr_lock);
> +	}

That's wrong. DAX is a per-inode flag, not a mount wide flag. This
needs to be done once the inode has been fully initialised and
IS_DAX(inode) can be run.

Also, the benchmark doesn't show that all these locks are being
tested by this benchmark. e.g. the i_mmaplock isn't involved in
the benchmark's IO paths at all. It's only taken in page faults and
truncate paths....

I'd also like to see how much of the gain comes from the iolock vs
the ilock, as the ilock is nested inside the iolock and so
contention is much rarer....

As it is, I'm *extremely* paranoid when it comes to changes to core
locking like this. Performance is secondary to correctness, and we
need much more than just a few benchmarks to verify there aren't
locking bugs being introduced....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes
  2016-06-14 23:06   ` Dave Chinner
@ 2016-06-15 18:55     ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-06-15 18:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
	linux-ia64, linux-s390, linux-arch, xfs, Davidlohr Bueso,
	Jason Low, Scott J Norton, Douglas Hatch

On 06/14/2016 07:06 PM, Dave Chinner wrote:
> On Tue, Jun 14, 2016 at 02:12:39PM -0400, Waiman Long wrote:
>> This patch enables reader optimistic spinning for inodes that are
>> under a DAX-based mount point.
>>
>> On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
>> the fio test with multithreaded randrw and randwrite tests on the
>> same file on a XFS partition on top of a NVDIMM with DAX were run,
>> the aggregated bandwidths before and after the patch were as follows:
>>
>>    Test      BW before patch     BW after patch  % change
>>    ----      ---------------     --------------  --------
>>    randrw        1352 MB/s          2164 MB/s      +60%
>>    randwrite     1710 MB/s          2550 MB/s      +49%
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   fs/xfs/xfs_icache.c |    9 +++++++++
>>   1 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
>> index 99ee6ee..09f284f 100644
>> --- a/fs/xfs/xfs_icache.c
>> +++ b/fs/xfs/xfs_icache.c
>> @@ -71,6 +71,15 @@ xfs_inode_alloc(
>>
>>   	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
>>
>> +	/*
>> +	 * Enable reader spinning for DAX nount point
>> +	 */
>> +	if (mp->m_flags&  XFS_MOUNT_DAX) {
>> +		rwsem_set_rspin_threshold(&ip->i_iolock.mr_lock);
>> +		rwsem_set_rspin_threshold(&ip->i_mmaplock.mr_lock);
>> +		rwsem_set_rspin_threshold(&ip->i_lock.mr_lock);
>> +	}
> That's wrong. DAX is a per-inode flag, not a mount wide flag. This
> needs to be done once the inode has been fully initialised and
> IS_DAX(inode) can be run.
>
> Also, the benchmark doesn't show that all these locks are being
> tested by this benchmark. e.g. the i_mmaplock isn't involved in
> the benchmark's IO paths at all. It's only taken in page faults and
> truncate paths....
>
> I'd also like to see how much of the gain comes from the iolock vs
> the ilock, as the ilock is nested inside the iolock and so
> contention is much rarer....

This patch has now been superseded by a second one where changes to the 
xfs code is no longer needed. The new patch will enable reader spinning 
for all rwsem and dynamically disable it depending on past history.

> As it is, I'm *extremely* paranoid when it comes to changes to core
> locking like this. Performance is secondary to correctness, and we
> need much more than just a few benchmarks to verify there aren't
> locking bugs being introduced....

The core rwsem locking logic hasn't been changed. There are some minor 
changes, however, on what RWSEM_WAITING_BIAS value to use that requires 
more eyeballs to make sure that it hasn't introduced any new bug.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-06-15 19:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 18:12 [RFC PATCH-tip 0/6] locking/rwsem: Enable reader optimistic spinning Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 1/6] locking/rwsem: Stop active read lock ASAP Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 2/6] locking/rwsem: Enable optional count-based spinning on reader Waiman Long
2016-06-14 18:27   ` Davidlohr Bueso
2016-06-14 19:11     ` Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 3/6] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 4/6] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 5/6] locking/rwsem: Enable spinning readers Waiman Long
2016-06-14 18:12 ` [RFC PATCH-tip 6/6] xfs: Enable reader optimistic spinning for DAX inodes Waiman Long
2016-06-14 18:24   ` Christoph Hellwig
2016-06-14 19:08     ` Waiman Long
2016-06-14 23:06   ` Dave Chinner
2016-06-15 18:55     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).