All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/9] rwsem performance optimizations
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.  We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.
 
The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough.  This significantly reduced lock
contentions from multiple wakeup threads. 

Tim got the following improvement for exim mail server 
workload on 40 core system:

Alex+Tim's patchset:    	   +4.8%
Alex+Tim+Waiman's patchset:        +5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With Alex
and Tim's patches, he got significant improvements to the aim7 
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Jason Low (2):
  MCS Lock: optimizations and extra comments
  MCS Lock: Barrier corrections

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
    own file
  rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
  rwsem: reduce spinlock contention in wakeup code path

 include/asm-generic/rwsem.h  |    8 +-
 include/linux/mcs_spinlock.h |   82 ++++++++++++++
 include/linux/mutex.h        |    5 +-
 include/linux/rwsem.h        |    9 ++-
 kernel/mutex.c               |   60 +---------
 kernel/rwsem.c               |   19 +++-
 lib/rwsem.c                  |  255 +++++++++++++++++++++++++++++++++++++-----
 7 files changed, 349 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.  We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.
 
The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough.  This significantly reduced lock
contentions from multiple wakeup threads. 

Tim got the following improvement for exim mail server 
workload on 40 core system:

Alex+Tim's patchset:    	   +4.8%
Alex+Tim+Waiman's patchset:        +5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With Alex
and Tim's patches, he got significant improvements to the aim7 
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Jason Low (2):
  MCS Lock: optimizations and extra comments
  MCS Lock: Barrier corrections

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
    own file
  rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
  rwsem: reduce spinlock contention in wakeup code path

 include/asm-generic/rwsem.h  |    8 +-
 include/linux/mcs_spinlock.h |   82 ++++++++++++++
 include/linux/mutex.h        |    5 +-
 include/linux/rwsem.h        |    9 ++-
 kernel/mutex.c               |   60 +---------
 kernel/rwsem.c               |   19 +++-
 lib/rwsem.c                  |  255 +++++++++++++++++++++++++++++++++++++-----
 7 files changed, 349 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/asm-generic/rwsem.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+		return 0;
 
-	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
-	return tmp == RWSEM_UNLOCKED_VALUE;
+	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+		      RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/asm-generic/rwsem.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+		return 0;
 
-	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
-	return tmp == RWSEM_UNLOCKED_VALUE;
+	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+		      RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 2/9] rwsem: remove 'out' label in do_wake
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

That make code simple and more readable.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
-		goto out;
+		return sem;
 	}
 
 	/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			/* A writer stole the lock. Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
-				goto out;
+				return sem;
 			/* Last active locker left. Retry waking readers. */
 			goto try_reader_grant;
 		}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
 
- out:
 	return sem;
 }
 
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 2/9] rwsem: remove 'out' label in do_wake
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

That make code simple and more readable.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
-		goto out;
+		return sem;
 	}
 
 	/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			/* A writer stole the lock. Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
-				goto out;
+				return sem;
 			/* Last active locker left. Retry waking readers. */
 			goto try_reader_grant;
 		}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
 
- out:
 	return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

That make code simple and more readable

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	adjustment = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/* A writer stole the lock. Undo our reader grant. */
+		while (1) {
+			oldcount = rwsem_atomic_update(adjustment, sem)
+								- adjustment;
+			if (likely(oldcount >= RWSEM_WAITING_BIAS))
+				break;
+
+			 /* A writer stole the lock.  Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
 				return sem;
 			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
 		}
 	}
 
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

That make code simple and more readable

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	adjustment = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/* A writer stole the lock. Undo our reader grant. */
+		while (1) {
+			oldcount = rwsem_atomic_update(adjustment, sem)
+								- adjustment;
+			if (likely(oldcount >= RWSEM_WAITING_BIAS))
+				break;
+
+			 /* A writer stole the lock.  Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
 				return sem;
 			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
 		}
 	}
 
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 4/9] rwsem/wake: check lock before do atomic update
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso <davidlohr.bueso@hp.com>
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	long oldcount, woken, loop, adjustment;
+	long woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
 		while (1) {
+			long oldcount;
+
+			/* A writer stole the lock. */
+			if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+				return sem;
+
 			oldcount = rwsem_atomic_update(adjustment, sem)
 								- adjustment;
 			if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 4/9] rwsem/wake: check lock before do atomic update
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso <davidlohr.bueso@hp.com>
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 lib/rwsem.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
 	struct list_head *next;
-	long oldcount, woken, loop, adjustment;
+	long woken, loop, adjustment;
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
 		while (1) {
+			long oldcount;
+
+			/* A writer stole the lock. */
+			if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+				return sem;
+
 			oldcount = rwsem_atomic_update(adjustment, sem)
 								- adjustment;
 			if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/mutex.h        |    5 ++-
 kernel/mutex.c               |   60 ++++----------------------------------
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 0000000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/* Lock acquired */
+		node->locked = 1;
+		return;
+	}
+	ACCESS_ONCE(prev->next) = node;
+	smp_wmb();
+	/* Wait until the lock holder passes the lock down */
+	while (!ACCESS_ONCE(node->locked))
+		arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (cmpxchg(lock, node, NULL) == node)
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+	}
+	ACCESS_ONCE(next->locked) = 1;
+	smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ccd4260..e6eaeea 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -55,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	void			*spin_mlock;	/* Spinner MCS lock */
+	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 #define arch_mutex_cpu_relax()	cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..4640731 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->spin_mlock = NULL;
+	lock->mcs_lock = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-	struct mspin_node *next ;
-	int		  locked;	/* 1 if lock acquired */
-};
-#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		node->locked = 1;
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
-		arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (cmpxchg(lock, node, NULL) == node)
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
-}
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	for (;;) {
 		struct task_struct *owner;
-		struct mspin_node  node;
+		struct mcs_spinlock  node;
 
 		if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mspin_lock(MLOCK(lock), &node);
+		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
 		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			goto slowpath;
 		}
 
@@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			}
 
 			mutex_set_owner(lock);
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			preempt_enable();
 			return 0;
 		}
-		mspin_unlock(MLOCK(lock), &node);
+		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/mutex.h        |    5 ++-
 kernel/mutex.c               |   60 ++++----------------------------------
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 0000000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/* Lock acquired */
+		node->locked = 1;
+		return;
+	}
+	ACCESS_ONCE(prev->next) = node;
+	smp_wmb();
+	/* Wait until the lock holder passes the lock down */
+	while (!ACCESS_ONCE(node->locked))
+		arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (cmpxchg(lock, node, NULL) == node)
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+	}
+	ACCESS_ONCE(next->locked) = 1;
+	smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ccd4260..e6eaeea 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
@@ -55,7 +56,7 @@ struct mutex {
 	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	void			*spin_mlock;	/* Spinner MCS lock */
+	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 #define arch_mutex_cpu_relax()	cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..4640731 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 	INIT_LIST_HEAD(&lock->wait_list);
 	mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	lock->spin_mlock = NULL;
+	lock->mcs_lock = NULL;
 #endif
 
 	debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-	struct mspin_node *next ;
-	int		  locked;	/* 1 if lock acquired */
-};
-#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/* Lock acquired */
-		node->locked = 1;
-		return;
-	}
-	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
-	/* Wait until the lock holder passes the lock down */
-	while (!ACCESS_ONCE(node->locked))
-		arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-	struct mspin_node *next = ACCESS_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (cmpxchg(lock, node, NULL) == node)
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = ACCESS_ONCE(node->next)))
-			arch_mutex_cpu_relax();
-	}
-	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
-}
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 
 	for (;;) {
 		struct task_struct *owner;
-		struct mspin_node  node;
+		struct mcs_spinlock  node;
 
 		if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
 			struct ww_mutex *ww;
@@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		mspin_lock(MLOCK(lock), &node);
+		mcs_spin_lock(&lock->mcs_lock, &node);
 		owner = ACCESS_ONCE(lock->owner);
 		if (owner && !mutex_spin_on_owner(lock, owner)) {
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			goto slowpath;
 		}
 
@@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 			}
 
 			mutex_set_owner(lock);
-			mspin_unlock(MLOCK(lock), &node);
+			mcs_spin_unlock(&lock->mcs_lock, &node);
 			preempt_enable();
 			return 0;
 		}
-		mspin_unlock(MLOCK(lock), &node);
+		mcs_spin_unlock(&lock->mcs_lock, &node);
 
 		/*
 		 * When there's no owner, we might have preempted between the
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 6/9] MCS Lock: optimizations and extra comments
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Remove unnecessary operation and make the cmpxchg(lock, node, NULL) == node
check in mcs_spin_unlock() likely() as it is likely that a race did not occur
most of the time.

Also add in more comments describing how the local node is used in MCS locks.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 include/linux/mcs_spinlock.h |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index b5de3b0..96f14299 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -18,6 +18,12 @@ struct mcs_spinlock {
 };
 
 /*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ *
  * We don't inline mcs_spin_lock() so that perf can correctly account for the
  * time spent in this lock function.
  */
@@ -33,7 +39,6 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
-		node->locked = 1;
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
@@ -43,6 +48,10 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		arch_mutex_cpu_relax();
 }
 
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
 static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
@@ -51,7 +60,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 		/*
 		 * Release the lock by setting it to NULL
 		 */
-		if (cmpxchg(lock, node, NULL) == node)
+		if (likely(cmpxchg(lock, node, NULL) == node))
 			return;
 		/* Wait until the next pointer is set */
 		while (!(next = ACCESS_ONCE(node->next)))
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 6/9] MCS Lock: optimizations and extra comments
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

Remove unnecessary operation and make the cmpxchg(lock, node, NULL) == node
check in mcs_spin_unlock() likely() as it is likely that a race did not occur
most of the time.

Also add in more comments describing how the local node is used in MCS locks.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 include/linux/mcs_spinlock.h |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index b5de3b0..96f14299 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -18,6 +18,12 @@ struct mcs_spinlock {
 };
 
 /*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ *
  * We don't inline mcs_spin_lock() so that perf can correctly account for the
  * time spent in this lock function.
  */
@@ -33,7 +39,6 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
-		node->locked = 1;
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
@@ -43,6 +48,10 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		arch_mutex_cpu_relax();
 }
 
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
 static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
@@ -51,7 +60,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 		/*
 		 * Release the lock by setting it to NULL
 		 */
-		if (cmpxchg(lock, node, NULL) == node)
+		if (likely(cmpxchg(lock, node, NULL) == node))
 			return;
 		/* Wait until the next pointer is set */
 		while (!(next = ACCESS_ONCE(node->next)))
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 7/9] MCS Lock: Barrier corrections
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

This patch corrects the way memory barriers are used in the MCS lock
and removes ones that are not needed. Also add comments on all barriers.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 include/linux/mcs_spinlock.h |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index 96f14299..93d445d 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -36,16 +36,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	node->locked = 0;
 	node->next   = NULL;
 
+	/* xchg() provides a memory barrier */
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
 	/* Wait until the lock holder passes the lock down */
 	while (!ACCESS_ONCE(node->locked))
 		arch_mutex_cpu_relax();
+
+	/* Make sure subsequent operations happen after the lock is acquired */
+	smp_rmb();
 }
 
 /*
@@ -58,6 +61,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 
 	if (likely(!next)) {
 		/*
+		 * cmpxchg() provides a memory barrier.
 		 * Release the lock by setting it to NULL
 		 */
 		if (likely(cmpxchg(lock, node, NULL) == node))
@@ -65,9 +69,14 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 		/* Wait until the next pointer is set */
 		while (!(next = ACCESS_ONCE(node->next)))
 			arch_mutex_cpu_relax();
+	} else {
+		/*
+		 * Make sure all operations within the critical section
+		 * happen before the lock is released.
+		 */
+		smp_wmb();
 	}
 	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 7/9] MCS Lock: Barrier corrections
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

This patch corrects the way memory barriers are used in the MCS lock
and removes ones that are not needed. Also add comments on all barriers.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 include/linux/mcs_spinlock.h |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index 96f14299..93d445d 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -36,16 +36,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	node->locked = 0;
 	node->next   = NULL;
 
+	/* xchg() provides a memory barrier */
 	prev = xchg(lock, node);
 	if (likely(prev == NULL)) {
 		/* Lock acquired */
 		return;
 	}
 	ACCESS_ONCE(prev->next) = node;
-	smp_wmb();
 	/* Wait until the lock holder passes the lock down */
 	while (!ACCESS_ONCE(node->locked))
 		arch_mutex_cpu_relax();
+
+	/* Make sure subsequent operations happen after the lock is acquired */
+	smp_rmb();
 }
 
 /*
@@ -58,6 +61,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 
 	if (likely(!next)) {
 		/*
+		 * cmpxchg() provides a memory barrier.
 		 * Release the lock by setting it to NULL
 		 */
 		if (likely(cmpxchg(lock, node, NULL) == node))
@@ -65,9 +69,14 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod
 		/* Wait until the next pointer is set */
 		while (!(next = ACCESS_ONCE(node->next)))
 			arch_mutex_cpu_relax();
+	} else {
+		/*
+		 * Make sure all operations within the critical section
+		 * happen before the lock is released.
+		 */
+		smp_wmb();
 	}
 	ACCESS_ONCE(next->locked) = 1;
-	smp_wmb();
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/rwsem.h |    7 ++-
 kernel/rwsem.c        |   19 +++++-
 lib/rwsem.c           |  201 ++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 206 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..aba7920 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -22,10 +22,13 @@ struct rw_semaphore;
 #include <linux/rwsem-spinlock.h> /* use a generic implementation */
 #else
 /* All arch specific implementations share the same struct */
+struct mcs_spinlock;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner; /* write owner */
+	struct mcs_spinlock	*mcs_lock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
@@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
-	  LIST_HEAD_INIT((name).wait_list)		\
+	  LIST_HEAD_INIT((name).wait_list),		\
+	  NULL,						\
+	  NULL						\
 	  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include <linux/atomic.h>
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
 	int ret = __down_write_trylock(sem);
 
-	if (ret == 1)
+	if (ret == 1) {
 		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+		rwsem_set_owner(sem);
+	}
 	return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
 	__up_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 	 * dependency.
 	 */
 	__downgrade_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..cc3b33e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include <linux/sched.h>
 #include <linux/init.h>
 #include <linux/export.h>
+#include <linux/sched/rt.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	sem->count = RWSEM_UNLOCKED_VALUE;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+	sem->mcs_lock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	if (!(count & RWSEM_ACTIVE_MASK)) {
+		/* Try acquiring the write lock. */
+		if (sem->count == RWSEM_WAITING_BIAS &&
+		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+			if (!list_is_singular(&sem->wait_list))
+				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count;
+
+	count = ACCESS_ONCE(sem->count);
+retry:
+	if (count == RWSEM_WAITING_BIAS) {
+		count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
+		/* allow write lock stealing, try acquiring the write lock. */
+		if (count == RWSEM_WAITING_BIAS)
+			goto acquired;
+		else if (count == 0)
+			goto retry;
+	} else if (count == 0) {
+		count = cmpxchg(&sem->count, 0, RWSEM_ACTIVE_WRITE_BIAS);
+		if (count == 0)
+			goto acquired;
+		else if (count == RWSEM_WAITING_BIAS)
+			goto retry;
+	}
+	return 0;
+
+acquired:
+	return 1;
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	int retval;
+	struct task_struct *owner;
+
+	rcu_read_lock();
+	owner = ACCESS_ONCE(sem->owner);
+
+	/* Spin only if active writer running */
+	if (owner)
+		retval = owner->on_cpu;
+	else
+		retval = false;
+
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the sem owner may have just acquired
+	 * it and not set the owner yet, or the sem has been released, or
+	 * reader active.
+	 */
+	return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock,
+				struct task_struct *owner)
+{
+	if (lock->owner != owner)
+		return false;
+
+	/*
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
+	 */
+	barrier();
+
+	return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+	rcu_read_lock();
+	while (owner_running(lock, owner)) {
+		if (need_resched())
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We break out the loop above on need_resched() or when the
+	 * owner changed, which is a sign for heavy contention. Return
+	 * success only when lock->owner is NULL.
+	 */
+	return lock->owner == NULL;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	int ret = 0;
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		return ret;
+
+	preempt_disable();
+	for (;;) {
+		struct mcs_spinlock node;
+
+		mcs_spin_lock(&sem->mcs_lock, &node);
+		owner = ACCESS_ONCE(sem->owner);
+		if (owner && !rwsem_spin_on_owner(sem, owner)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			break;
+		}
+
+		/* wait_lock will be acquired if write_lock is obtained */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			ret = 1;
+			break;
+		}
+		mcs_spin_unlock(&sem->mcs_lock, &node);
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		arch_mutex_cpu_relax();
+	}
+	preempt_enable();
+
+	return ret;
+}
+
 /*
  * wait until we successfully acquire the write lock
  */
 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+	long count;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
+	bool waiting = true;
+
+	/* undo write bias from down_write operation, stop active locking */
+	count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		goto done;
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -209,33 +376,28 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
-		adjustment += RWSEM_WAITING_BIAS;
+		waiting = false;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = rwsem_atomic_update(adjustment, sem);
+	if (waiting)
+		count = ACCESS_ONCE(sem->count);
+	else
+		count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
 
-	/* If there were already threads queued before us and there are no
+	/*
+	 * If there were already threads queued before us and there are no
 	 * active writers, the lock must be read owned; so we try to wake
-	 * any read locks that were queued ahead of us. */
-	if (count > RWSEM_WAITING_BIAS &&
-	    adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+	 * any read locks that were queued ahead of us.
+	 */
+	if ((count > RWSEM_WAITING_BIAS) && waiting)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
 
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	while (true) {
-		if (!(count & RWSEM_ACTIVE_MASK)) {
-			/* Try acquiring the write lock. */
-			count = RWSEM_ACTIVE_WRITE_BIAS;
-			if (!list_is_singular(&sem->wait_list))
-				count += RWSEM_WAITING_BIAS;
-
-			if (sem->count == RWSEM_WAITING_BIAS &&
-			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
-							RWSEM_WAITING_BIAS)
-				break;
-		}
+	for (;;) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -250,6 +412,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	list_del(&waiter.list);
 	raw_spin_unlock_irq(&sem->wait_lock);
+done:
 	tsk->state = TASK_RUNNING;
 
 	return sem;
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/rwsem.h |    7 ++-
 kernel/rwsem.c        |   19 +++++-
 lib/rwsem.c           |  201 ++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 206 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..aba7920 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -22,10 +22,13 @@ struct rw_semaphore;
 #include <linux/rwsem-spinlock.h> /* use a generic implementation */
 #else
 /* All arch specific implementations share the same struct */
+struct mcs_spinlock;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner; /* write owner */
+	struct mcs_spinlock	*mcs_lock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
@@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
-	  LIST_HEAD_INIT((name).wait_list)		\
+	  LIST_HEAD_INIT((name).wait_list),		\
+	  NULL,						\
+	  NULL						\
 	  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include <linux/atomic.h>
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
 	int ret = __down_write_trylock(sem);
 
-	if (ret == 1)
+	if (ret == 1) {
 		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+		rwsem_set_owner(sem);
+	}
 	return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
 	__up_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 	 * dependency.
 	 */
 	__downgrade_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..cc3b33e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include <linux/sched.h>
 #include <linux/init.h>
 #include <linux/export.h>
+#include <linux/sched/rt.h>
+#include <linux/mcs_spinlock.h>
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	sem->count = RWSEM_UNLOCKED_VALUE;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+	sem->mcs_lock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	if (!(count & RWSEM_ACTIVE_MASK)) {
+		/* Try acquiring the write lock. */
+		if (sem->count == RWSEM_WAITING_BIAS &&
+		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+			if (!list_is_singular(&sem->wait_list))
+				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline int rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count;
+
+	count = ACCESS_ONCE(sem->count);
+retry:
+	if (count == RWSEM_WAITING_BIAS) {
+		count = cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS);
+		/* allow write lock stealing, try acquiring the write lock. */
+		if (count == RWSEM_WAITING_BIAS)
+			goto acquired;
+		else if (count == 0)
+			goto retry;
+	} else if (count == 0) {
+		count = cmpxchg(&sem->count, 0, RWSEM_ACTIVE_WRITE_BIAS);
+		if (count == 0)
+			goto acquired;
+		else if (count == RWSEM_WAITING_BIAS)
+			goto retry;
+	}
+	return 0;
+
+acquired:
+	return 1;
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	int retval;
+	struct task_struct *owner;
+
+	rcu_read_lock();
+	owner = ACCESS_ONCE(sem->owner);
+
+	/* Spin only if active writer running */
+	if (owner)
+		retval = owner->on_cpu;
+	else
+		retval = false;
+
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the sem owner may have just acquired
+	 * it and not set the owner yet, or the sem has been released, or
+	 * reader active.
+	 */
+	return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock,
+				struct task_struct *owner)
+{
+	if (lock->owner != owner)
+		return false;
+
+	/*
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
+	 */
+	barrier();
+
+	return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+	rcu_read_lock();
+	while (owner_running(lock, owner)) {
+		if (need_resched())
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We break out the loop above on need_resched() or when the
+	 * owner changed, which is a sign for heavy contention. Return
+	 * success only when lock->owner is NULL.
+	 */
+	return lock->owner == NULL;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	int ret = 0;
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		return ret;
+
+	preempt_disable();
+	for (;;) {
+		struct mcs_spinlock node;
+
+		mcs_spin_lock(&sem->mcs_lock, &node);
+		owner = ACCESS_ONCE(sem->owner);
+		if (owner && !rwsem_spin_on_owner(sem, owner)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			break;
+		}
+
+		/* wait_lock will be acquired if write_lock is obtained */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			mcs_spin_unlock(&sem->mcs_lock, &node);
+			ret = 1;
+			break;
+		}
+		mcs_spin_unlock(&sem->mcs_lock, &node);
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		arch_mutex_cpu_relax();
+	}
+	preempt_enable();
+
+	return ret;
+}
+
 /*
  * wait until we successfully acquire the write lock
  */
 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
-	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
+	long count;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
+	bool waiting = true;
+
+	/* undo write bias from down_write operation, stop active locking */
+	count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		goto done;
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -209,33 +376,28 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list))
-		adjustment += RWSEM_WAITING_BIAS;
+		waiting = false;
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = rwsem_atomic_update(adjustment, sem);
+	if (waiting)
+		count = ACCESS_ONCE(sem->count);
+	else
+		count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
 
-	/* If there were already threads queued before us and there are no
+	/*
+	 * If there were already threads queued before us and there are no
 	 * active writers, the lock must be read owned; so we try to wake
-	 * any read locks that were queued ahead of us. */
-	if (count > RWSEM_WAITING_BIAS &&
-	    adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
+	 * any read locks that were queued ahead of us.
+	 */
+	if ((count > RWSEM_WAITING_BIAS) && waiting)
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
 
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	while (true) {
-		if (!(count & RWSEM_ACTIVE_MASK)) {
-			/* Try acquiring the write lock. */
-			count = RWSEM_ACTIVE_WRITE_BIAS;
-			if (!list_is_singular(&sem->wait_list))
-				count += RWSEM_WAITING_BIAS;
-
-			if (sem->count == RWSEM_WAITING_BIAS &&
-			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
-							RWSEM_WAITING_BIAS)
-				break;
-		}
+	for (;;) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
@@ -250,6 +412,7 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 
 	list_del(&waiter.list);
 	raw_spin_unlock_irq(&sem->wait_lock);
+done:
 	tsk->state = TASK_RUNNING;
 
 	return sem;
-- 
1.7.4.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path
       [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
@ 2013-10-02 22:38   ` Tim Chen
  2013-10-02 22:38   ` Tim Chen
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

With the 3.12-rc2 kernel, there is sizable spinlock contention on
the rwsem wakeup code path when running AIM7's high_systime workload
on a 8-socket 80-core DL980 (HT off) as reported by perf:

  7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--41.77%-- rwsem_wake
  1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--92.37%-- rwsem_down_write_failed

That was 4.7% of recorded CPU cycles.

On a large NUMA machine, it is entirely possible that a fairly large
number of threads are queuing up in the ticket spinlock queue to do
the wakeup operation. In fact, only one will be needed.  This patch
tries to reduce spinlock contention by doing just that.

A new wakeup field is added to the rwsem structure. This field is
set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
thread is pending to do the wakeup call. It is cleared on exit from
those functions. There is no size increase in 64-bit systems and a
4 bytes size increase in 32-bit systems.

By checking if the wakeup flag is set, a thread can exit rwsem_wake()
immediately if another thread is pending to do the wakeup instead of
waiting to get the spinlock and find out that nothing need to be done.

The setting of the wakeup flag may not be visible on all processors in
some architectures. However, this won't affect program correctness. The
clearing of the wakeup flag before spin_unlock and other barrier-type
atomic instructions will ensure that it is visible to all processors.

With this patch alone, the performance improvement on jobs per minute
(JPM) of an sample run of the high_systime workload (at 1500 users)
on DL980 was as follows:

HT	JPM w/o patch	JPM with patch	% Change
--	-------------	--------------	--------
off	   148265	   170896	 +15.3%
on	   140078	   159319	 +13.7%

The new perf profile (HT off) was as follows:

  2.96%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--0.94%-- rwsem_wake
  1.00%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--88.70%-- rwsem_down_write_failed

Together with the rest of rwsem patches in the series, the JPM number
(HT off) jumps to 195041 which is 32% better.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/linux/rwsem.h |    2 ++
 lib/rwsem.c           |   29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index aba7920..29314d3 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -26,6 +26,7 @@ struct mcs_spinlock;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
+	int			wakeup;	/* Waking-up in progress flag */
 	struct list_head	wait_list;
 	struct task_struct	*owner; /* write owner */
 	struct mcs_spinlock	*mcs_lock;
@@ -61,6 +62,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
+	  0,						\
 	  LIST_HEAD_INIT((name).wait_list),		\
 	  NULL,						\
 	  NULL						\
diff --git a/lib/rwsem.c b/lib/rwsem.c
index cc3b33e..1adee01 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -27,6 +27,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	lockdep_init_map(&sem->dep_map, name, key, 0);
 #endif
 	sem->count = RWSEM_UNLOCKED_VALUE;
+	sem->wakeup = 0;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
 	sem->owner = NULL;
@@ -70,6 +71,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct list_head *next;
 	long woken, loop, adjustment;
 
+	sem->wakeup = 1;	/* Waking up in progress */
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY)
@@ -79,6 +81,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
+		sem->wakeup = 0;	/* Wakeup done */
 		return sem;
 	}
 
@@ -87,6 +90,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	adjustment = 0;
+	sem->wakeup = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
 		while (1) {
@@ -426,11 +430,36 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 
+	if (sem->wakeup)
+		return sem;	/* Waking up in progress already */
+	/*
+	 * Optimistically set the wakeup flag to indicate that the current
+	 * thread is going to wakeup the sleeping waiters so that the
+	 * following threads don't need to wait for doing the wakeup call.
+	 * It is perfectly fine if another thread clears the flag. It just
+	 * leads to one more thread waiting to call __rwsem_do_wake().
+	 *
+	 * Writer lock stealing is not an issue for writers which are
+	 * unconditionally woken up. The woken writer is synchronized with
+	 * the waker via the spinlock. So the writer can't start doing
+	 * anything before the spinlock is released. For readers, the
+	 * situation is more complicated. The write lock stealer or the
+	 * woken readers are not synchronized with the waker. So they may
+	 * finish before the waker clears the wakeup flag. To prevent this
+	 * situation, the wakeup flag is cleared before the atomic update
+	 * of the count which also acts as a barrier.
+	 *
+	 * The spin_unlock() call at the end will force the just-cleared
+	 * wakeup flag to be visible to all the processors.
+	 */
+	sem->wakeup = 1;
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+	else
+		sem->wakeup = 0;	/* Make sure wakeup flag is cleared */
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.4.4



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path
@ 2013-10-02 22:38   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-02 22:38 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, Peter Hurley,
	Paul E.McKenney, Tim Chen, Jason Low, Waiman Long, linux-kernel,
	linux-mm

With the 3.12-rc2 kernel, there is sizable spinlock contention on
the rwsem wakeup code path when running AIM7's high_systime workload
on a 8-socket 80-core DL980 (HT off) as reported by perf:

  7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--41.77%-- rwsem_wake
  1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--92.37%-- rwsem_down_write_failed

That was 4.7% of recorded CPU cycles.

On a large NUMA machine, it is entirely possible that a fairly large
number of threads are queuing up in the ticket spinlock queue to do
the wakeup operation. In fact, only one will be needed.  This patch
tries to reduce spinlock contention by doing just that.

A new wakeup field is added to the rwsem structure. This field is
set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
thread is pending to do the wakeup call. It is cleared on exit from
those functions. There is no size increase in 64-bit systems and a
4 bytes size increase in 32-bit systems.

By checking if the wakeup flag is set, a thread can exit rwsem_wake()
immediately if another thread is pending to do the wakeup instead of
waiting to get the spinlock and find out that nothing need to be done.

The setting of the wakeup flag may not be visible on all processors in
some architectures. However, this won't affect program correctness. The
clearing of the wakeup flag before spin_unlock and other barrier-type
atomic instructions will ensure that it is visible to all processors.

With this patch alone, the performance improvement on jobs per minute
(JPM) of an sample run of the high_systime workload (at 1500 users)
on DL980 was as follows:

HT	JPM w/o patch	JPM with patch	% Change
--	-------------	--------------	--------
off	   148265	   170896	 +15.3%
on	   140078	   159319	 +13.7%

The new perf profile (HT off) was as follows:

  2.96%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
             |--0.94%-- rwsem_wake
  1.00%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
             |--88.70%-- rwsem_down_write_failed

Together with the rest of rwsem patches in the series, the JPM number
(HT off) jumps to 195041 which is 32% better.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/linux/rwsem.h |    2 ++
 lib/rwsem.c           |   29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index aba7920..29314d3 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -26,6 +26,7 @@ struct mcs_spinlock;
 struct rw_semaphore {
 	long			count;
 	raw_spinlock_t		wait_lock;
+	int			wakeup;	/* Waking-up in progress flag */
 	struct list_head	wait_list;
 	struct task_struct	*owner; /* write owner */
 	struct mcs_spinlock	*mcs_lock;
@@ -61,6 +62,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)			\
 	{ RWSEM_UNLOCKED_VALUE,				\
 	  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock),	\
+	  0,						\
 	  LIST_HEAD_INIT((name).wait_list),		\
 	  NULL,						\
 	  NULL						\
diff --git a/lib/rwsem.c b/lib/rwsem.c
index cc3b33e..1adee01 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -27,6 +27,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	lockdep_init_map(&sem->dep_map, name, key, 0);
 #endif
 	sem->count = RWSEM_UNLOCKED_VALUE;
+	sem->wakeup = 0;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
 	sem->owner = NULL;
@@ -70,6 +71,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct list_head *next;
 	long woken, loop, adjustment;
 
+	sem->wakeup = 1;	/* Waking up in progress */
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY)
@@ -79,6 +81,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
+		sem->wakeup = 0;	/* Wakeup done */
 		return sem;
 	}
 
@@ -87,6 +90,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	adjustment = 0;
+	sem->wakeup = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
 		while (1) {
@@ -426,11 +430,36 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 
+	if (sem->wakeup)
+		return sem;	/* Waking up in progress already */
+	/*
+	 * Optimistically set the wakeup flag to indicate that the current
+	 * thread is going to wakeup the sleeping waiters so that the
+	 * following threads don't need to wait for doing the wakeup call.
+	 * It is perfectly fine if another thread clears the flag. It just
+	 * leads to one more thread waiting to call __rwsem_do_wake().
+	 *
+	 * Writer lock stealing is not an issue for writers which are
+	 * unconditionally woken up. The woken writer is synchronized with
+	 * the waker via the spinlock. So the writer can't start doing
+	 * anything before the spinlock is released. For readers, the
+	 * situation is more complicated. The write lock stealer or the
+	 * woken readers are not synchronized with the waker. So they may
+	 * finish before the waker clears the wakeup flag. To prevent this
+	 * situation, the wakeup flag is cleared before the atomic update
+	 * of the count which also acts as a barrier.
+	 *
+	 * The spin_unlock() call at the end will force the just-cleared
+	 * wakeup flag to be visible to all the processors.
+	 */
+	sem->wakeup = 1;
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
 		sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+	else
+		sem->wakeup = 0;	/* Make sure wakeup flag is cleared */
 
 	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
 
-- 
1.7.4.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-02 22:38   ` Tim Chen
@ 2013-10-03  7:32     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-03  7:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> For version 8 of the patchset, we included the patch from Waiman to 
> streamline wakeup operations and also optimize the MCS lock used in 
> rwsem and mutex.

I'd be feeling a lot easier about this patch series if you also had 
performance figures that show how mmap_sem is affected.

These:

> Tim got the following improvement for exim mail server 
> workload on 40 core system:
> 
> Alex+Tim's patchset:    	   +4.8%
> Alex+Tim+Waiman's patchset:        +5.3%

appear to be mostly related to the anon_vma->rwsem. But once that lock is 
changed to an rwlock_t, this measurement falls away.

Peter Zijlstra suggested the following testcase:

===============================>
In fact, try something like this from userspace:

n-threads:

  pthread_mutex_lock(&mutex);
  foo = mmap();
  pthread_mutex_lock(&mutex);

  /* work */

  pthread_mutex_unlock(&mutex);
  munma(foo);
  pthread_mutex_unlock(&mutex);

vs

n-threads:

  foo = mmap();
  /* work */
  munmap(foo);

I've had reports that the former was significantly faster than the
latter.
<===============================

this could be put into a standalone testcase, or you could add it as a new 
subcommand of 'perf bench', which already has some pthread code, see for 
example in tools/perf/bench/sched-messaging.c. Adding:

   perf bench mm threads

or so would be a natural thing to have.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-03  7:32     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-03  7:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> For version 8 of the patchset, we included the patch from Waiman to 
> streamline wakeup operations and also optimize the MCS lock used in 
> rwsem and mutex.

I'd be feeling a lot easier about this patch series if you also had 
performance figures that show how mmap_sem is affected.

These:

> Tim got the following improvement for exim mail server 
> workload on 40 core system:
> 
> Alex+Tim's patchset:    	   +4.8%
> Alex+Tim+Waiman's patchset:        +5.3%

appear to be mostly related to the anon_vma->rwsem. But once that lock is 
changed to an rwlock_t, this measurement falls away.

Peter Zijlstra suggested the following testcase:

===============================>
In fact, try something like this from userspace:

n-threads:

  pthread_mutex_lock(&mutex);
  foo = mmap();
  pthread_mutex_lock(&mutex);

  /* work */

  pthread_mutex_unlock(&mutex);
  munma(foo);
  pthread_mutex_unlock(&mutex);

vs

n-threads:

  foo = mmap();
  /* work */
  munmap(foo);

I've had reports that the former was significantly faster than the
latter.
<===============================

this could be put into a standalone testcase, or you could add it as a new 
subcommand of 'perf bench', which already has some pthread code, see for 
example in tools/perf/bench/sched-messaging.c. Adding:

   perf bench mm threads

or so would be a natural thing to have.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-03  7:32     ` Ingo Molnar
@ 2013-10-07 22:57       ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-07 22:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > For version 8 of the patchset, we included the patch from Waiman to 
> > streamline wakeup operations and also optimize the MCS lock used in 
> > rwsem and mutex.
> 
> I'd be feeling a lot easier about this patch series if you also had 
> performance figures that show how mmap_sem is affected.
> 
> These:
> 
> > Tim got the following improvement for exim mail server 
> > workload on 40 core system:
> > 
> > Alex+Tim's patchset:    	   +4.8%
> > Alex+Tim+Waiman's patchset:        +5.3%
> 
> appear to be mostly related to the anon_vma->rwsem. But once that lock is 
> changed to an rwlock_t, this measurement falls away.
> 
> Peter Zijlstra suggested the following testcase:
> 
> ===============================>
> In fact, try something like this from userspace:
> 
> n-threads:
> 
>   pthread_mutex_lock(&mutex);
>   foo = mmap();
>   pthread_mutex_lock(&mutex);
> 
>   /* work */
> 
>   pthread_mutex_unlock(&mutex);
>   munma(foo);
>   pthread_mutex_unlock(&mutex);
> 
> vs
> 
> n-threads:
> 
>   foo = mmap();
>   /* work */
>   munmap(foo);


Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.  
I am listing two presentations of the data.  Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threads	all	without optspin
mmap only		
1		1.9%	1.6%
5		43.8%	2.6%
10		22.7%	-3.0%
20		-12.0%	-4.5%
40		-26.9%	-2.0%
mmap with mutex acquisition		
1		-2.1%	-3.0%
5		-1.9%	1.0%
10		4.2%	12.5%
20		-4.1%	0.6%
40		-2.8%	-1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads	vanilla	all	without optspin
1		3.0%	-1.0%	-1.7%
5		7.2%	-26.8%	5.5%
10		5.2%	-10.6%	22.1%
20		6.8%	16.4%	12.5%
40		-0.2%	32.7%	0.0%

In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with 
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

> 
> I've had reports that the former was significantly faster than the
> latter.
> <===============================
> 
> this could be put into a standalone testcase, or you could add it as a new 
> subcommand of 'perf bench', which already has some pthread code, see for 
> example in tools/perf/bench/sched-messaging.c. Adding:
> 
>    perf bench mm threads
> 
> or so would be a natural thing to have.
> 
> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-07 22:57       ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-07 22:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > For version 8 of the patchset, we included the patch from Waiman to 
> > streamline wakeup operations and also optimize the MCS lock used in 
> > rwsem and mutex.
> 
> I'd be feeling a lot easier about this patch series if you also had 
> performance figures that show how mmap_sem is affected.
> 
> These:
> 
> > Tim got the following improvement for exim mail server 
> > workload on 40 core system:
> > 
> > Alex+Tim's patchset:    	   +4.8%
> > Alex+Tim+Waiman's patchset:        +5.3%
> 
> appear to be mostly related to the anon_vma->rwsem. But once that lock is 
> changed to an rwlock_t, this measurement falls away.
> 
> Peter Zijlstra suggested the following testcase:
> 
> ===============================>
> In fact, try something like this from userspace:
> 
> n-threads:
> 
>   pthread_mutex_lock(&mutex);
>   foo = mmap();
>   pthread_mutex_lock(&mutex);
> 
>   /* work */
> 
>   pthread_mutex_unlock(&mutex);
>   munma(foo);
>   pthread_mutex_unlock(&mutex);
> 
> vs
> 
> n-threads:
> 
>   foo = mmap();
>   /* work */
>   munmap(foo);


Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.  
I am listing two presentations of the data.  Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threads	all	without optspin
mmap only		
1		1.9%	1.6%
5		43.8%	2.6%
10		22.7%	-3.0%
20		-12.0%	-4.5%
40		-26.9%	-2.0%
mmap with mutex acquisition		
1		-2.1%	-3.0%
5		-1.9%	1.0%
10		4.2%	12.5%
20		-4.1%	0.6%
40		-2.8%	-1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads	vanilla	all	without optspin
1		3.0%	-1.0%	-1.7%
5		7.2%	-26.8%	5.5%
10		5.2%	-10.6%	22.1%
20		6.8%	16.4%	12.5%
40		-0.2%	32.7%	0.0%

In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with 
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

> 
> I've had reports that the former was significantly faster than the
> latter.
> <===============================
> 
> this could be put into a standalone testcase, or you could add it as a new 
> subcommand of 'perf bench', which already has some pthread code, see for 
> example in tools/perf/bench/sched-messaging.c. Adding:
> 
>    perf bench mm threads
> 
> or so would be a natural thing to have.
> 
> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
  2013-10-02 22:38   ` Tim Chen
@ 2013-10-08 19:51     ` Rafael Aquini
  -1 siblings, 0 replies; 54+ messages in thread
From: Rafael Aquini @ 2013-10-08 19:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> We will need the MCS lock code for doing optimistic spinning for rwsem.
> Extracting the MCS code from mutex.c and put into its own file allow us
> to reuse this code easily for rwsem.
> 
> Reviewed-by: Ingo Molnar <mingo@elte.hu>
> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mutex.h        |    5 ++-
>  kernel/mutex.c               |   60 ++++----------------------------------
>  3 files changed, 74 insertions(+), 55 deletions(-)
>  create mode 100644 include/linux/mcs_spinlock.h
> 
> diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> new file mode 100644
> index 0000000..b5de3b0
> --- /dev/null
> +++ b/include/linux/mcs_spinlock.h
> @@ -0,0 +1,64 @@
> +/*
> + * MCS lock defines
> + *
> + * This file contains the main data structure and API definitions of MCS lock.
> + *
> + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> + * with the desirable properties of being fair, and with each cpu trying
> + * to acquire the lock spinning on a local variable.
> + * It avoids expensive cache bouncings that common test-and-set spin-lock
> + * implementations incur.
> + */

nitpick:

I believe you need 

+#include <asm/processor.h>

here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
(arch/s390 is one case)

> +#ifndef __LINUX_MCS_SPINLOCK_H
> +#define __LINUX_MCS_SPINLOCK_H
> +
> +struct mcs_spinlock {
> +	struct mcs_spinlock *next;
> +	int locked; /* 1 if lock acquired */
> +};
> +
> +/*
> + * We don't inline mcs_spin_lock() so that perf can correctly account for the
> + * time spent in this lock function.
> + */
> +static noinline
> +void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *prev;
> +
> +	/* Init node */
> +	node->locked = 0;
> +	node->next   = NULL;
> +
> +	prev = xchg(lock, node);
> +	if (likely(prev == NULL)) {
> +		/* Lock acquired */
> +		node->locked = 1;
> +		return;
> +	}
> +	ACCESS_ONCE(prev->next) = node;
> +	smp_wmb();
> +	/* Wait until the lock holder passes the lock down */
> +	while (!ACCESS_ONCE(node->locked))
> +		arch_mutex_cpu_relax();
> +}
> +
> +static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
> +
> +	if (likely(!next)) {
> +		/*
> +		 * Release the lock by setting it to NULL
> +		 */
> +		if (cmpxchg(lock, node, NULL) == node)
> +			return;
> +		/* Wait until the next pointer is set */
> +		while (!(next = ACCESS_ONCE(node->next)))
> +			arch_mutex_cpu_relax();
> +	}
> +	ACCESS_ONCE(next->locked) = 1;
> +	smp_wmb();
> +}
> +
> +#endif /* __LINUX_MCS_SPINLOCK_H */
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index ccd4260..e6eaeea 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -46,6 +46,7 @@
>   * - detects multi-task circular deadlocks and prints out all affected
>   *   locks and tasks (and only those tasks)
>   */
> +struct mcs_spinlock;
>  struct mutex {
>  	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
>  	atomic_t		count;
> @@ -55,7 +56,7 @@ struct mutex {
>  	struct task_struct	*owner;
>  #endif
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> -	void			*spin_mlock;	/* Spinner MCS lock */
> +	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
>  #endif
>  #ifdef CONFIG_DEBUG_MUTEXES
>  	const char 		*name;
> @@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
>  #define arch_mutex_cpu_relax()	cpu_relax()
>  #endif
>  
> -#endif
> +#endif /* __LINUX_MUTEX_H */
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 6d647ae..4640731 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -25,6 +25,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/interrupt.h>
>  #include <linux/debug_locks.h>
> +#include <linux/mcs_spinlock.h>
>  
>  /*
>   * In the DEBUG case we are using the "NULL fastpath" for mutexes,
> @@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
>  	INIT_LIST_HEAD(&lock->wait_list);
>  	mutex_clear_owner(lock);
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> -	lock->spin_mlock = NULL;
> +	lock->mcs_lock = NULL;
>  #endif
>  
>  	debug_mutex_init(lock, name, key);
> @@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
>   * more or less simultaneously, the spinners need to acquire a MCS lock
>   * first before spinning on the owner field.
>   *
> - * We don't inline mspin_lock() so that perf can correctly account for the
> - * time spent in this lock function.
>   */
> -struct mspin_node {
> -	struct mspin_node *next ;
> -	int		  locked;	/* 1 if lock acquired */
> -};
> -#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
> -
> -static noinline
> -void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
> -{
> -	struct mspin_node *prev;
> -
> -	/* Init node */
> -	node->locked = 0;
> -	node->next   = NULL;
> -
> -	prev = xchg(lock, node);
> -	if (likely(prev == NULL)) {
> -		/* Lock acquired */
> -		node->locked = 1;
> -		return;
> -	}
> -	ACCESS_ONCE(prev->next) = node;
> -	smp_wmb();
> -	/* Wait until the lock holder passes the lock down */
> -	while (!ACCESS_ONCE(node->locked))
> -		arch_mutex_cpu_relax();
> -}
> -
> -static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
> -{
> -	struct mspin_node *next = ACCESS_ONCE(node->next);
> -
> -	if (likely(!next)) {
> -		/*
> -		 * Release the lock by setting it to NULL
> -		 */
> -		if (cmpxchg(lock, node, NULL) == node)
> -			return;
> -		/* Wait until the next pointer is set */
> -		while (!(next = ACCESS_ONCE(node->next)))
> -			arch_mutex_cpu_relax();
> -	}
> -	ACCESS_ONCE(next->locked) = 1;
> -	smp_wmb();
> -}
>  
>  /*
>   * Mutex spinning code migrated from kernel/sched/core.c
> @@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  
>  	for (;;) {
>  		struct task_struct *owner;
> -		struct mspin_node  node;
> +		struct mcs_spinlock  node;
>  
>  		if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
>  			struct ww_mutex *ww;
> @@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  		 * If there's an owner, wait for it to either
>  		 * release the lock or go to sleep.
>  		 */
> -		mspin_lock(MLOCK(lock), &node);
> +		mcs_spin_lock(&lock->mcs_lock, &node);
>  		owner = ACCESS_ONCE(lock->owner);
>  		if (owner && !mutex_spin_on_owner(lock, owner)) {
> -			mspin_unlock(MLOCK(lock), &node);
> +			mcs_spin_unlock(&lock->mcs_lock, &node);
>  			goto slowpath;
>  		}
>  
> @@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  			}
>  
>  			mutex_set_owner(lock);
> -			mspin_unlock(MLOCK(lock), &node);
> +			mcs_spin_unlock(&lock->mcs_lock, &node);
>  			preempt_enable();
>  			return 0;
>  		}
> -		mspin_unlock(MLOCK(lock), &node);
> +		mcs_spin_unlock(&lock->mcs_lock, &node);
>  
>  		/*
>  		 * When there's no owner, we might have preempted between the
> -- 
> 1.7.4.4
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
@ 2013-10-08 19:51     ` Rafael Aquini
  0 siblings, 0 replies; 54+ messages in thread
From: Rafael Aquini @ 2013-10-08 19:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> We will need the MCS lock code for doing optimistic spinning for rwsem.
> Extracting the MCS code from mutex.c and put into its own file allow us
> to reuse this code easily for rwsem.
> 
> Reviewed-by: Ingo Molnar <mingo@elte.hu>
> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mutex.h        |    5 ++-
>  kernel/mutex.c               |   60 ++++----------------------------------
>  3 files changed, 74 insertions(+), 55 deletions(-)
>  create mode 100644 include/linux/mcs_spinlock.h
> 
> diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> new file mode 100644
> index 0000000..b5de3b0
> --- /dev/null
> +++ b/include/linux/mcs_spinlock.h
> @@ -0,0 +1,64 @@
> +/*
> + * MCS lock defines
> + *
> + * This file contains the main data structure and API definitions of MCS lock.
> + *
> + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> + * with the desirable properties of being fair, and with each cpu trying
> + * to acquire the lock spinning on a local variable.
> + * It avoids expensive cache bouncings that common test-and-set spin-lock
> + * implementations incur.
> + */

nitpick:

I believe you need 

+#include <asm/processor.h>

here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
(arch/s390 is one case)

> +#ifndef __LINUX_MCS_SPINLOCK_H
> +#define __LINUX_MCS_SPINLOCK_H
> +
> +struct mcs_spinlock {
> +	struct mcs_spinlock *next;
> +	int locked; /* 1 if lock acquired */
> +};
> +
> +/*
> + * We don't inline mcs_spin_lock() so that perf can correctly account for the
> + * time spent in this lock function.
> + */
> +static noinline
> +void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *prev;
> +
> +	/* Init node */
> +	node->locked = 0;
> +	node->next   = NULL;
> +
> +	prev = xchg(lock, node);
> +	if (likely(prev == NULL)) {
> +		/* Lock acquired */
> +		node->locked = 1;
> +		return;
> +	}
> +	ACCESS_ONCE(prev->next) = node;
> +	smp_wmb();
> +	/* Wait until the lock holder passes the lock down */
> +	while (!ACCESS_ONCE(node->locked))
> +		arch_mutex_cpu_relax();
> +}
> +
> +static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next = ACCESS_ONCE(node->next);
> +
> +	if (likely(!next)) {
> +		/*
> +		 * Release the lock by setting it to NULL
> +		 */
> +		if (cmpxchg(lock, node, NULL) == node)
> +			return;
> +		/* Wait until the next pointer is set */
> +		while (!(next = ACCESS_ONCE(node->next)))
> +			arch_mutex_cpu_relax();
> +	}
> +	ACCESS_ONCE(next->locked) = 1;
> +	smp_wmb();
> +}
> +
> +#endif /* __LINUX_MCS_SPINLOCK_H */
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index ccd4260..e6eaeea 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -46,6 +46,7 @@
>   * - detects multi-task circular deadlocks and prints out all affected
>   *   locks and tasks (and only those tasks)
>   */
> +struct mcs_spinlock;
>  struct mutex {
>  	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
>  	atomic_t		count;
> @@ -55,7 +56,7 @@ struct mutex {
>  	struct task_struct	*owner;
>  #endif
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> -	void			*spin_mlock;	/* Spinner MCS lock */
> +	struct mcs_spinlock	*mcs_lock;	/* Spinner MCS lock */
>  #endif
>  #ifdef CONFIG_DEBUG_MUTEXES
>  	const char 		*name;
> @@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
>  #define arch_mutex_cpu_relax()	cpu_relax()
>  #endif
>  
> -#endif
> +#endif /* __LINUX_MUTEX_H */
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 6d647ae..4640731 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -25,6 +25,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/interrupt.h>
>  #include <linux/debug_locks.h>
> +#include <linux/mcs_spinlock.h>
>  
>  /*
>   * In the DEBUG case we are using the "NULL fastpath" for mutexes,
> @@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
>  	INIT_LIST_HEAD(&lock->wait_list);
>  	mutex_clear_owner(lock);
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> -	lock->spin_mlock = NULL;
> +	lock->mcs_lock = NULL;
>  #endif
>  
>  	debug_mutex_init(lock, name, key);
> @@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
>   * more or less simultaneously, the spinners need to acquire a MCS lock
>   * first before spinning on the owner field.
>   *
> - * We don't inline mspin_lock() so that perf can correctly account for the
> - * time spent in this lock function.
>   */
> -struct mspin_node {
> -	struct mspin_node *next ;
> -	int		  locked;	/* 1 if lock acquired */
> -};
> -#define	MLOCK(mutex)	((struct mspin_node **)&((mutex)->spin_mlock))
> -
> -static noinline
> -void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
> -{
> -	struct mspin_node *prev;
> -
> -	/* Init node */
> -	node->locked = 0;
> -	node->next   = NULL;
> -
> -	prev = xchg(lock, node);
> -	if (likely(prev == NULL)) {
> -		/* Lock acquired */
> -		node->locked = 1;
> -		return;
> -	}
> -	ACCESS_ONCE(prev->next) = node;
> -	smp_wmb();
> -	/* Wait until the lock holder passes the lock down */
> -	while (!ACCESS_ONCE(node->locked))
> -		arch_mutex_cpu_relax();
> -}
> -
> -static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
> -{
> -	struct mspin_node *next = ACCESS_ONCE(node->next);
> -
> -	if (likely(!next)) {
> -		/*
> -		 * Release the lock by setting it to NULL
> -		 */
> -		if (cmpxchg(lock, node, NULL) == node)
> -			return;
> -		/* Wait until the next pointer is set */
> -		while (!(next = ACCESS_ONCE(node->next)))
> -			arch_mutex_cpu_relax();
> -	}
> -	ACCESS_ONCE(next->locked) = 1;
> -	smp_wmb();
> -}
>  
>  /*
>   * Mutex spinning code migrated from kernel/sched/core.c
> @@ -448,7 +402,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  
>  	for (;;) {
>  		struct task_struct *owner;
> -		struct mspin_node  node;
> +		struct mcs_spinlock  node;
>  
>  		if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) {
>  			struct ww_mutex *ww;
> @@ -470,10 +424,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  		 * If there's an owner, wait for it to either
>  		 * release the lock or go to sleep.
>  		 */
> -		mspin_lock(MLOCK(lock), &node);
> +		mcs_spin_lock(&lock->mcs_lock, &node);
>  		owner = ACCESS_ONCE(lock->owner);
>  		if (owner && !mutex_spin_on_owner(lock, owner)) {
> -			mspin_unlock(MLOCK(lock), &node);
> +			mcs_spin_unlock(&lock->mcs_lock, &node);
>  			goto slowpath;
>  		}
>  
> @@ -488,11 +442,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  			}
>  
>  			mutex_set_owner(lock);
> -			mspin_unlock(MLOCK(lock), &node);
> +			mcs_spin_unlock(&lock->mcs_lock, &node);
>  			preempt_enable();
>  			return 0;
>  		}
> -		mspin_unlock(MLOCK(lock), &node);
> +		mcs_spin_unlock(&lock->mcs_lock, &node);
>  
>  		/*
>  		 * When there's no owner, we might have preempted between the
> -- 
> 1.7.4.4
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
  2013-10-08 19:51     ` Rafael Aquini
@ 2013-10-08 20:34       ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-08 20:34 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Reviewed-by: Ingo Molnar <mingo@elte.hu>
> > Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > ---
> >  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/mutex.h        |    5 ++-
> >  kernel/mutex.c               |   60 ++++----------------------------------
> >  3 files changed, 74 insertions(+), 55 deletions(-)
> >  create mode 100644 include/linux/mcs_spinlock.h
> > 
> > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > new file mode 100644
> > index 0000000..b5de3b0
> > --- /dev/null
> > +++ b/include/linux/mcs_spinlock.h
> > @@ -0,0 +1,64 @@
> > +/*
> > + * MCS lock defines
> > + *
> > + * This file contains the main data structure and API definitions of MCS lock.
> > + *
> > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> > + * with the desirable properties of being fair, and with each cpu trying
> > + * to acquire the lock spinning on a local variable.
> > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > + * implementations incur.
> > + */
> 
> nitpick:
> 
> I believe you need 
> 
> +#include <asm/processor.h>
> 
> here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> (arch/s390 is one case)

Probably 

+#include <linux/mutex.h> 

should be added instead?
It defines arch_mutex_cpu_relax when there's no 
architecture specific version.

Thanks.
Tim



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
@ 2013-10-08 20:34       ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-08 20:34 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Reviewed-by: Ingo Molnar <mingo@elte.hu>
> > Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > ---
> >  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/mutex.h        |    5 ++-
> >  kernel/mutex.c               |   60 ++++----------------------------------
> >  3 files changed, 74 insertions(+), 55 deletions(-)
> >  create mode 100644 include/linux/mcs_spinlock.h
> > 
> > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > new file mode 100644
> > index 0000000..b5de3b0
> > --- /dev/null
> > +++ b/include/linux/mcs_spinlock.h
> > @@ -0,0 +1,64 @@
> > +/*
> > + * MCS lock defines
> > + *
> > + * This file contains the main data structure and API definitions of MCS lock.
> > + *
> > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> > + * with the desirable properties of being fair, and with each cpu trying
> > + * to acquire the lock spinning on a local variable.
> > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > + * implementations incur.
> > + */
> 
> nitpick:
> 
> I believe you need 
> 
> +#include <asm/processor.h>
> 
> here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> (arch/s390 is one case)

Probably 

+#include <linux/mutex.h> 

should be added instead?
It defines arch_mutex_cpu_relax when there's no 
architecture specific version.

Thanks.
Tim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
  2013-10-08 20:34       ` Tim Chen
@ 2013-10-08 21:31         ` Rafael Aquini
  -1 siblings, 0 replies; 54+ messages in thread
From: Rafael Aquini @ 2013-10-08 21:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Tue, Oct 08, 2013 at 01:34:55PM -0700, Tim Chen wrote:
> On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> > On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > > Extracting the MCS code from mutex.c and put into its own file allow us
> > > to reuse this code easily for rwsem.
> > > 
> > > Reviewed-by: Ingo Molnar <mingo@elte.hu>
> > > Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > ---
> > >  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/mutex.h        |    5 ++-
> > >  kernel/mutex.c               |   60 ++++----------------------------------
> > >  3 files changed, 74 insertions(+), 55 deletions(-)
> > >  create mode 100644 include/linux/mcs_spinlock.h
> > > 
> > > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > > new file mode 100644
> > > index 0000000..b5de3b0
> > > --- /dev/null
> > > +++ b/include/linux/mcs_spinlock.h
> > > @@ -0,0 +1,64 @@
> > > +/*
> > > + * MCS lock defines
> > > + *
> > > + * This file contains the main data structure and API definitions of MCS lock.
> > > + *
> > > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> > > + * with the desirable properties of being fair, and with each cpu trying
> > > + * to acquire the lock spinning on a local variable.
> > > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > > + * implementations incur.
> > > + */
> > 
> > nitpick:
> > 
> > I believe you need 
> > 
> > +#include <asm/processor.h>
> > 
> > here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> > (arch/s390 is one case)
> 

Humm... sorry by my noise as I was looking into an old tree, before this commit:
commit 083986e8248d978b6c961d3da6beb0c921c68220
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date:   Sat Sep 28 11:23:59 2013 +0200

    mutex: replace CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX with simple ifdef


> Probably 
> 
> +#include <linux/mutex.h> 
>

Yeah, but I guess right now you're ok without it, as the only place this 
header is included is in kernel/mutex.c and it linux/mutex.h get in before us.

If the plan is to extend usage for other places where mutex.h doesn't go, then
perhaps the better thing would be just copycat the same #ifdef here.

Cheers! (and sorry again for the noise)

> should be added instead?
> It defines arch_mutex_cpu_relax when there's no 
> architecture specific version.
> 
> Thanks.
> Tim
> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
@ 2013-10-08 21:31         ` Rafael Aquini
  0 siblings, 0 replies; 54+ messages in thread
From: Rafael Aquini @ 2013-10-08 21:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Tue, Oct 08, 2013 at 01:34:55PM -0700, Tim Chen wrote:
> On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> > On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > > Extracting the MCS code from mutex.c and put into its own file allow us
> > > to reuse this code easily for rwsem.
> > > 
> > > Reviewed-by: Ingo Molnar <mingo@elte.hu>
> > > Reviewed-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > ---
> > >  include/linux/mcs_spinlock.h |   64 ++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/mutex.h        |    5 ++-
> > >  kernel/mutex.c               |   60 ++++----------------------------------
> > >  3 files changed, 74 insertions(+), 55 deletions(-)
> > >  create mode 100644 include/linux/mcs_spinlock.h
> > > 
> > > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > > new file mode 100644
> > > index 0000000..b5de3b0
> > > --- /dev/null
> > > +++ b/include/linux/mcs_spinlock.h
> > > @@ -0,0 +1,64 @@
> > > +/*
> > > + * MCS lock defines
> > > + *
> > > + * This file contains the main data structure and API definitions of MCS lock.
> > > + *
> > > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
> > > + * with the desirable properties of being fair, and with each cpu trying
> > > + * to acquire the lock spinning on a local variable.
> > > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > > + * implementations incur.
> > > + */
> > 
> > nitpick:
> > 
> > I believe you need 
> > 
> > +#include <asm/processor.h>
> > 
> > here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> > (arch/s390 is one case)
> 

Humm... sorry by my noise as I was looking into an old tree, before this commit:
commit 083986e8248d978b6c961d3da6beb0c921c68220
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date:   Sat Sep 28 11:23:59 2013 +0200

    mutex: replace CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX with simple ifdef


> Probably 
> 
> +#include <linux/mutex.h> 
>

Yeah, but I guess right now you're ok without it, as the only place this 
header is included is in kernel/mutex.c and it linux/mutex.h get in before us.

If the plan is to extend usage for other places where mutex.h doesn't go, then
perhaps the better thing would be just copycat the same #ifdef here.

Cheers! (and sorry again for the noise)

> should be added instead?
> It defines arch_mutex_cpu_relax when there's no 
> architecture specific version.
> 
> Thanks.
> Tim
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-07 22:57       ` Tim Chen
@ 2013-10-09  6:15         ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-09  6:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> Ingo,
> 
> I ran the vanilla kernel, the kernel with all rwsem patches and the 
> kernel with all patches except the optimistic spin one.  I am listing 
> two presentations of the data.  Please note that there is about 5% 
> run-run variation.
> 
> % change in performance vs vanilla kernel
> #threads	all	without optspin
> mmap only		
> 1		1.9%	1.6%
> 5		43.8%	2.6%
> 10		22.7%	-3.0%
> 20		-12.0%	-4.5%
> 40		-26.9%	-2.0%
> mmap with mutex acquisition		
> 1		-2.1%	-3.0%
> 5		-1.9%	1.0%
> 10		4.2%	12.5%
> 20		-4.1%	0.6%
> 40		-2.8%	-1.9%

Silly question: how do the two methods of starting N threads compare to 
each other? Do they have identical runtimes? I think PeterZ's point was 
that the pthread_mutex case, despite adding extra serialization, actually 
runs faster in some circumstances.

Also, mind posting the testcase? What 'work' do the threads do - clear 
some memory area? How big is the memory area?

I'd expect this to be about large enough mmap()s showing page fault 
processing to be mmap_sem bound and the serialization via pthread_mutex() 
sets up a 'train' of threads in one case, while the parallel startup would 
run into the mmap_sem in the regular case.

So I'd expect this to be a rather sensitive workload and you'd have to 
actively engineer it to hit the effect PeterZ mentioned. I could imagine 
MPI workloads to run into such patterns - but not deterministically.

Only once you've convinced yourself that you are hitting that kind of 
effect reliably on the vanilla kernel, could/should the effects of an 
improved rwsem implementation be measured.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-09  6:15         ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-09  6:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> Ingo,
> 
> I ran the vanilla kernel, the kernel with all rwsem patches and the 
> kernel with all patches except the optimistic spin one.  I am listing 
> two presentations of the data.  Please note that there is about 5% 
> run-run variation.
> 
> % change in performance vs vanilla kernel
> #threads	all	without optspin
> mmap only		
> 1		1.9%	1.6%
> 5		43.8%	2.6%
> 10		22.7%	-3.0%
> 20		-12.0%	-4.5%
> 40		-26.9%	-2.0%
> mmap with mutex acquisition		
> 1		-2.1%	-3.0%
> 5		-1.9%	1.0%
> 10		4.2%	12.5%
> 20		-4.1%	0.6%
> 40		-2.8%	-1.9%

Silly question: how do the two methods of starting N threads compare to 
each other? Do they have identical runtimes? I think PeterZ's point was 
that the pthread_mutex case, despite adding extra serialization, actually 
runs faster in some circumstances.

Also, mind posting the testcase? What 'work' do the threads do - clear 
some memory area? How big is the memory area?

I'd expect this to be about large enough mmap()s showing page fault 
processing to be mmap_sem bound and the serialization via pthread_mutex() 
sets up a 'train' of threads in one case, while the parallel startup would 
run into the mmap_sem in the regular case.

So I'd expect this to be a rather sensitive workload and you'd have to 
actively engineer it to hit the effect PeterZ mentioned. I could imagine 
MPI workloads to run into such patterns - but not deterministically.

Only once you've convinced yourself that you are hitting that kind of 
effect reliably on the vanilla kernel, could/should the effects of an 
improved rwsem implementation be measured.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-09  6:15         ` Ingo Molnar
@ 2013-10-09  7:28           ` Peter Zijlstra
  -1 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2013-10-09  7:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote:
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.

The workload that I got the report from was a virus scanner, it would
spawn nr_cpus threads and {mmap file, scan content, munmap} through your
filesystem.

Now if I only could remember who reported this.. :/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-09  7:28           ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2013-10-09  7:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote:
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.

The workload that I got the report from was a virus scanner, it would
spawn nr_cpus threads and {mmap file, scan content, munmap} through your
filesystem.

Now if I only could remember who reported this.. :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-09  6:15         ` Ingo Molnar
@ 2013-10-09 16:34           ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-09 16:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > Ingo,
> > 
> > I ran the vanilla kernel, the kernel with all rwsem patches and the 
> > kernel with all patches except the optimistic spin one.  I am listing 
> > two presentations of the data.  Please note that there is about 5% 
> > run-run variation.
> > 
> > % change in performance vs vanilla kernel
> > #threads	all	without optspin
> > mmap only		
> > 1		1.9%	1.6%
> > 5		43.8%	2.6%
> > 10		22.7%	-3.0%
> > 20		-12.0%	-4.5%
> > 40		-26.9%	-2.0%
> > mmap with mutex acquisition		
> > 1		-2.1%	-3.0%
> > 5		-1.9%	1.0%
> > 10		4.2%	12.5%
> > 20		-4.1%	0.6%
> > 40		-2.8%	-1.9%
> 
> Silly question: how do the two methods of starting N threads compare to 
> each other? 

They both started N pthreads and run for a fixed time. 
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads        vanilla 	all rwsem    	without optspin
				patches
1               3.0%    	-1.0%   	-1.7%
5               7.2%    	-26.8%  	5.5%
10              5.2%    	-10.6%  	22.1%
20              6.8%    	16.4%   	12.5%
40              -0.2%   	32.7%   	0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster.  This is consistent with what Peter reported.  With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

> Do they have identical runtimes? 

Yes, they both have identical runtimes.  I look at the number 
of mmap and munmap operations I could push through.

> I think PeterZ's point was 
> that the pthread_mutex case, despite adding extra serialization, actually 
> runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

> 
> Also, mind posting the testcase? What 'work' do the threads do - clear 
> some memory area? 

The test case do simple mmap and munmap 1MB memory per iteration.

> How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB with
mutex";

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
        while (1) {
                pthread_mutex_lock(&mutex);
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                munmap(c, MEMSIZE);
                pthread_mutex_unlock(&mutex);

                (*iterations)++;
        }
}

and run as a pthread.
> 
> I'd expect this to be about large enough mmap()s showing page fault 
> processing to be mmap_sem bound and the serialization via pthread_mutex() 
> sets up a 'train' of threads in one case, while the parallel startup would 
> run into the mmap_sem in the regular case.
> 
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.
> 
> Only once you've convinced yourself that you are hitting that kind of 
> effect reliably on the vanilla kernel, could/should the effects of an 
> improved rwsem implementation be measured.
> 
> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-09 16:34           ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-09 16:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > Ingo,
> > 
> > I ran the vanilla kernel, the kernel with all rwsem patches and the 
> > kernel with all patches except the optimistic spin one.  I am listing 
> > two presentations of the data.  Please note that there is about 5% 
> > run-run variation.
> > 
> > % change in performance vs vanilla kernel
> > #threads	all	without optspin
> > mmap only		
> > 1		1.9%	1.6%
> > 5		43.8%	2.6%
> > 10		22.7%	-3.0%
> > 20		-12.0%	-4.5%
> > 40		-26.9%	-2.0%
> > mmap with mutex acquisition		
> > 1		-2.1%	-3.0%
> > 5		-1.9%	1.0%
> > 10		4.2%	12.5%
> > 20		-4.1%	0.6%
> > 40		-2.8%	-1.9%
> 
> Silly question: how do the two methods of starting N threads compare to 
> each other? 

They both started N pthreads and run for a fixed time. 
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threads        vanilla 	all rwsem    	without optspin
				patches
1               3.0%    	-1.0%   	-1.7%
5               7.2%    	-26.8%  	5.5%
10              5.2%    	-10.6%  	22.1%
20              6.8%    	16.4%   	12.5%
40              -0.2%   	32.7%   	0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster.  This is consistent with what Peter reported.  With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

> Do they have identical runtimes? 

Yes, they both have identical runtimes.  I look at the number 
of mmap and munmap operations I could push through.

> I think PeterZ's point was 
> that the pthread_mutex case, despite adding extra serialization, actually 
> runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

> 
> Also, mind posting the testcase? What 'work' do the threads do - clear 
> some memory area? 

The test case do simple mmap and munmap 1MB memory per iteration.

> How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB with
mutex";

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
        while (1) {
                pthread_mutex_lock(&mutex);
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                munmap(c, MEMSIZE);
                pthread_mutex_unlock(&mutex);

                (*iterations)++;
        }
}

and run as a pthread.
> 
> I'd expect this to be about large enough mmap()s showing page fault 
> processing to be mmap_sem bound and the serialization via pthread_mutex() 
> sets up a 'train' of threads in one case, while the parallel startup would 
> run into the mmap_sem in the regular case.
> 
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.
> 
> Only once you've convinced yourself that you are hitting that kind of 
> effect reliably on the vanilla kernel, could/should the effects of an 
> improved rwsem implementation be measured.
> 
> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-09  7:28           ` Peter Zijlstra
@ 2013-10-10  3:14             ` Linus Torvalds
  -1 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-10-10  3:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Tim Chen, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	Linux Kernel Mailing List, linux-mm

On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The workload that I got the report from was a virus scanner, it would
> spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> filesystem.

So I suspect we could make the mmap_sem write area *much* smaller for
the normal cases.

Look at do_mmap_pgoff(), for example: it is run entirely under
mmap_sem, but 99% of what it does doesn't actually need the lock.

The part that really needs the lock is

        addr = get_unmapped_area(file, addr, len, pgoff, flags);
        addr = mmap_region(file, addr, len, vm_flags, pgoff);

but we hold it over all the other stuff too.

In fact, even if we moved the mmap_sem down into do_mmap(), and moved
code around a bit to only hold it over those functions, it would still
cover unnecessarily much. For example, while merging is common, not
merging is pretty common too, and we do that

        vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

allocation under the lock. We could easily do things like preallocate
it outside the lock.

Right now mmap_sem covers pretty much the whole system call (we do do
some security checks outside of it).

I think the main issue is that nobody has ever cared deeply enough to
see how far this could be pushed. I suspect there is some low-hanging
fruit for anybody who is willing to handle the pain..

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-10  3:14             ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-10-10  3:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Tim Chen, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	Linux Kernel Mailing List, linux-mm

On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The workload that I got the report from was a virus scanner, it would
> spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> filesystem.

So I suspect we could make the mmap_sem write area *much* smaller for
the normal cases.

Look at do_mmap_pgoff(), for example: it is run entirely under
mmap_sem, but 99% of what it does doesn't actually need the lock.

The part that really needs the lock is

        addr = get_unmapped_area(file, addr, len, pgoff, flags);
        addr = mmap_region(file, addr, len, vm_flags, pgoff);

but we hold it over all the other stuff too.

In fact, even if we moved the mmap_sem down into do_mmap(), and moved
code around a bit to only hold it over those functions, it would still
cover unnecessarily much. For example, while merging is common, not
merging is pretty common too, and we do that

        vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

allocation under the lock. We could easily do things like preallocate
it outside the lock.

Right now mmap_sem covers pretty much the whole system call (we do do
some security checks outside of it).

I think the main issue is that nobody has ever cared deeply enough to
see how far this could be pushed. I suspect there is some low-hanging
fruit for anybody who is willing to handle the pain..

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-10  3:14             ` Linus Torvalds
@ 2013-10-10  5:03               ` Davidlohr Bueso
  -1 siblings, 0 replies; 54+ messages in thread
From: Davidlohr Bueso @ 2013-10-10  5:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Tim Chen, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Rik van Riel, Peter Hurley, Paul E.McKenney,
	Jason Low, Waiman Long, Linux Kernel Mailing List, linux-mm

On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote:
> On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > The workload that I got the report from was a virus scanner, it would
> > spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> > filesystem.
> 
> So I suspect we could make the mmap_sem write area *much* smaller for
> the normal cases.
> 
> Look at do_mmap_pgoff(), for example: it is run entirely under
> mmap_sem, but 99% of what it does doesn't actually need the lock.
> 
> The part that really needs the lock is
> 
>         addr = get_unmapped_area(file, addr, len, pgoff, flags);
>         addr = mmap_region(file, addr, len, vm_flags, pgoff);
> 
> but we hold it over all the other stuff too.
> 

True. By looking at the callers, we're always doing:

down_write(&mm->mmap_sem);
do_mmap_pgoff()
...
up_write(&mm->mmap_sem);

That goes for shm, aio, and of course mmap_pgoff().

While I know you hate two level locking, one way to go about this is to
take the lock inside do_mmap_pgoff() after the initial checks (flags,
page align, etc.) and return with the lock held, leaving the caller to
unlock it. 

> In fact, even if we moved the mmap_sem down into do_mmap(), and moved
> code around a bit to only hold it over those functions, it would still
> cover unnecessarily much. For example, while merging is common, not
> merging is pretty common too, and we do that
> 
>         vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> 
> allocation under the lock. We could easily do things like preallocate
> it outside the lock.
> 

AFAICT there are also checks that should be done at the beginning of the
function, such as checking for MAP_LOCKED and VM_LOCKED flags before
calling get_unmapped_area().

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-10  5:03               ` Davidlohr Bueso
  0 siblings, 0 replies; 54+ messages in thread
From: Davidlohr Bueso @ 2013-10-10  5:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Tim Chen, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Alex Shi, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Rik van Riel, Peter Hurley, Paul E.McKenney,
	Jason Low, Waiman Long, Linux Kernel Mailing List, linux-mm

On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote:
> On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > The workload that I got the report from was a virus scanner, it would
> > spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> > filesystem.
> 
> So I suspect we could make the mmap_sem write area *much* smaller for
> the normal cases.
> 
> Look at do_mmap_pgoff(), for example: it is run entirely under
> mmap_sem, but 99% of what it does doesn't actually need the lock.
> 
> The part that really needs the lock is
> 
>         addr = get_unmapped_area(file, addr, len, pgoff, flags);
>         addr = mmap_region(file, addr, len, vm_flags, pgoff);
> 
> but we hold it over all the other stuff too.
> 

True. By looking at the callers, we're always doing:

down_write(&mm->mmap_sem);
do_mmap_pgoff()
...
up_write(&mm->mmap_sem);

That goes for shm, aio, and of course mmap_pgoff().

While I know you hate two level locking, one way to go about this is to
take the lock inside do_mmap_pgoff() after the initial checks (flags,
page align, etc.) and return with the lock held, leaving the caller to
unlock it. 

> In fact, even if we moved the mmap_sem down into do_mmap(), and moved
> code around a bit to only hold it over those functions, it would still
> cover unnecessarily much. For example, while merging is common, not
> merging is pretty common too, and we do that
> 
>         vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> 
> allocation under the lock. We could easily do things like preallocate
> it outside the lock.
> 

AFAICT there are also checks that should be done at the beginning of the
function, such as checking for MAP_LOCKED and VM_LOCKED flags before
calling get_unmapped_area().

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-09 16:34           ` Tim Chen
@ 2013-10-10  7:54             ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-10  7:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> The throughput of pure mmap with mutex is below vs pure mmap is below:
> 
> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads        vanilla 	all rwsem    	without optspin
> 				patches
> 1               3.0%    	-1.0%   	-1.7%
> 5               7.2%    	-26.8%  	5.5%
> 10              5.2%    	-10.6%  	22.1%
> 20              6.8%    	16.4%   	12.5%
> 40              -0.2%   	32.7%   	0.0%
> 
> So with mutex, the vanilla kernel and the one without optspin both run 
> faster.  This is consistent with what Peter reported.  With optspin, the 
> picture is more mixed, with lower throughput at low to moderate number 
> of threads and higher throughput with high number of threads.

So, going back to your orignal table:

> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads        vanilla all     without optspin
> 1               3.0%    -1.0%   -1.7%
> 5               7.2%    -26.8%  5.5%
> 10              5.2%    -10.6%  22.1%
> 20              6.8%    16.4%   12.5%
> 40              -0.2%   32.7%   0.0%
>
> In general, vanilla and no-optspin case perform better with 
> pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> worse at low to moderate contention and better at high contention.

it appears that 'without optspin' appears to be a pretty good choice - if 
it wasn't for that '1 thread' number, which, if I correctly assume is the 
uncontended case, is one of the most common usecases ...

How can the single-threaded case get slower? None of the patches should 
really cause noticeable overhead in the non-contended case. That looks 
weird.

It would also be nice to see the 2, 3, 4 thread numbers - those are the 
most common contention scenarios in practice - where do we see the first 
improvement in performance?

Also, it would be nice to include a noise/sttdev figure, it's really hard 
to tell whether -1.7% is statistically significant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-10  7:54             ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-10  7:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> The throughput of pure mmap with mutex is below vs pure mmap is below:
> 
> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads        vanilla 	all rwsem    	without optspin
> 				patches
> 1               3.0%    	-1.0%   	-1.7%
> 5               7.2%    	-26.8%  	5.5%
> 10              5.2%    	-10.6%  	22.1%
> 20              6.8%    	16.4%   	12.5%
> 40              -0.2%   	32.7%   	0.0%
> 
> So with mutex, the vanilla kernel and the one without optspin both run 
> faster.  This is consistent with what Peter reported.  With optspin, the 
> picture is more mixed, with lower throughput at low to moderate number 
> of threads and higher throughput with high number of threads.

So, going back to your orignal table:

> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threads        vanilla all     without optspin
> 1               3.0%    -1.0%   -1.7%
> 5               7.2%    -26.8%  5.5%
> 10              5.2%    -10.6%  22.1%
> 20              6.8%    16.4%   12.5%
> 40              -0.2%   32.7%   0.0%
>
> In general, vanilla and no-optspin case perform better with 
> pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> worse at low to moderate contention and better at high contention.

it appears that 'without optspin' appears to be a pretty good choice - if 
it wasn't for that '1 thread' number, which, if I correctly assume is the 
uncontended case, is one of the most common usecases ...

How can the single-threaded case get slower? None of the patches should 
really cause noticeable overhead in the non-contended case. That looks 
weird.

It would also be nice to see the 2, 3, 4 thread numbers - those are the 
most common contention scenarios in practice - where do we see the first 
improvement in performance?

Also, it would be nice to include a noise/sttdev figure, it's really hard 
to tell whether -1.7% is statistically significant.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-10  7:54             ` Ingo Molnar
@ 2013-10-16  0:09               ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16  0:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla 	all rwsem    	without optspin
> > 				patches
> > 1               3.0%    	-1.0%   	-1.7%
> > 5               7.2%    	-26.8%  	5.5%
> > 10              5.2%    	-10.6%  	22.1%
> > 20              6.8%    	16.4%   	12.5%
> > 40              -0.2%   	32.7%   	0.0%
> > 
> > So with mutex, the vanilla kernel and the one without optspin both run 
> > faster.  This is consistent with what Peter reported.  With optspin, the 
> > picture is more mixed, with lower throughput at low to moderate number 
> > of threads and higher throughput with high number of threads.
> 
> So, going back to your orignal table:
> 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla all     without optspin
> > 1               3.0%    -1.0%   -1.7%
> > 5               7.2%    -26.8%  5.5%
> > 10              5.2%    -10.6%  22.1%
> > 20              6.8%    16.4%   12.5%
> > 40              -0.2%   32.7%   0.0%
> >
> > In general, vanilla and no-optspin case perform better with 
> > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > worse at low to moderate contention and better at high contention.
> 
> it appears that 'without optspin' appears to be a pretty good choice - if 
> it wasn't for that '1 thread' number, which, if I correctly assume is the 
> uncontended case, is one of the most common usecases ...
> 
> How can the single-threaded case get slower? None of the patches should 
> really cause noticeable overhead in the non-contended case. That looks 
> weird.
> 
> It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> most common contention scenarios in practice - where do we see the first 
> improvement in performance?
> 
> Also, it would be nice to include a noise/sttdev figure, it's really hard 
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        int i;

        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                for (i=0; i<MEMSIZE; i+=8) {
                        c[i] = 0xa;
                }
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

I compare the throughput where I have the complete rwsem 
patchset against vanilla and the case where I take out the 
optimistic spin patch.  I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case.  The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads	all	No-optspin
1	+0.4%	-0.1%
2	+2.0%	+0.2%
3	+1.1%	+1.5%
4	-0.5%	-1.4%
5	-0.1%	-0.1%
10	+2.2%	-1.2%
20	+237.3%	-2.3%
40	+548.1%	+0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads.  Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel.  There's little contention on mmap sem 
acquisition in this case.

n	all	No-optspin
1	+0.8%	-1.2%
2	+1.0%	-0.5%
3	+1.8%	+0.2%
4	+1.5%	-0.4%
5	+1.1%	+0.4%
10	+1.5%	-0.3%
20	+1.4%	-0.2%
40	+1.3%	+0.4%

Thanks.

Tim




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-16  0:09               ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16  0:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla 	all rwsem    	without optspin
> > 				patches
> > 1               3.0%    	-1.0%   	-1.7%
> > 5               7.2%    	-26.8%  	5.5%
> > 10              5.2%    	-10.6%  	22.1%
> > 20              6.8%    	16.4%   	12.5%
> > 40              -0.2%   	32.7%   	0.0%
> > 
> > So with mutex, the vanilla kernel and the one without optspin both run 
> > faster.  This is consistent with what Peter reported.  With optspin, the 
> > picture is more mixed, with lower throughput at low to moderate number 
> > of threads and higher throughput with high number of threads.
> 
> So, going back to your orignal table:
> 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla all     without optspin
> > 1               3.0%    -1.0%   -1.7%
> > 5               7.2%    -26.8%  5.5%
> > 10              5.2%    -10.6%  22.1%
> > 20              6.8%    16.4%   12.5%
> > 40              -0.2%   32.7%   0.0%
> >
> > In general, vanilla and no-optspin case perform better with 
> > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > worse at low to moderate contention and better at high contention.
> 
> it appears that 'without optspin' appears to be a pretty good choice - if 
> it wasn't for that '1 thread' number, which, if I correctly assume is the 
> uncontended case, is one of the most common usecases ...
> 
> How can the single-threaded case get slower? None of the patches should 
> really cause noticeable overhead in the non-contended case. That looks 
> weird.
> 
> It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> most common contention scenarios in practice - where do we see the first 
> improvement in performance?
> 
> Also, it would be nice to include a noise/sttdev figure, it's really hard 
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        int i;

        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                for (i=0; i<MEMSIZE; i+=8) {
                        c[i] = 0xa;
                }
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

I compare the throughput where I have the complete rwsem 
patchset against vanilla and the case where I take out the 
optimistic spin patch.  I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case.  The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads	all	No-optspin
1	+0.4%	-0.1%
2	+2.0%	+0.2%
3	+1.1%	+1.5%
4	-0.5%	-1.4%
5	-0.1%	-0.1%
10	+2.2%	-1.2%
20	+237.3%	-2.3%
40	+548.1%	+0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads.  Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel.  There's little contention on mmap sem 
acquisition in this case.

n	all	No-optspin
1	+0.8%	-1.2%
2	+1.0%	-0.5%
3	+1.8%	+0.2%
4	+1.5%	-0.4%
5	+1.1%	+0.4%
10	+1.5%	-0.3%
20	+1.4%	-0.2%
40	+1.3%	+0.4%

Thanks.

Tim



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-16  0:09               ` Tim Chen
@ 2013-10-16  6:55                 ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-16  6:55 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads        vanilla 	all rwsem    	without optspin
> > > 				patches
> > > 1               3.0%    	-1.0%   	-1.7%
> > > 5               7.2%    	-26.8%  	5.5%
> > > 10              5.2%    	-10.6%  	22.1%
> > > 20              6.8%    	16.4%   	12.5%
> > > 40              -0.2%   	32.7%   	0.0%
> > > 
> > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > faster.  This is consistent with what Peter reported.  With optspin, the 
> > > picture is more mixed, with lower throughput at low to moderate number 
> > > of threads and higher throughput with high number of threads.
> > 
> > So, going back to your orignal table:
> > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads        vanilla all     without optspin
> > > 1               3.0%    -1.0%   -1.7%
> > > 5               7.2%    -26.8%  5.5%
> > > 10              5.2%    -10.6%  22.1%
> > > 20              6.8%    16.4%   12.5%
> > > 40              -0.2%   32.7%   0.0%
> > >
> > > In general, vanilla and no-optspin case perform better with 
> > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > worse at low to moderate contention and better at high contention.
> > 
> > it appears that 'without optspin' appears to be a pretty good choice - if 
> > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > uncontended case, is one of the most common usecases ...
> > 
> > How can the single-threaded case get slower? None of the patches should 
> > really cause noticeable overhead in the non-contended case. That looks 
> > weird.
> > 
> > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > most common contention scenarios in practice - where do we see the first 
> > improvement in performance?
> > 
> > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > to tell whether -1.7% is statistically significant.
> 
> Ingo,
> 
> I think that the optimistic spin changes to rwsem should enhance 
> performance to real workloads after all.
> 
> In my previous tests, I was doing mmap followed immediately by 
> munmap without doing anything to the memory.  No real workload
> will behave that way and it is not the scenario that we 
> should optimize for.  A much better approximation of
> real usages will be doing mmap, then touching 
> the memories being mmaped, followed by munmap.  

That's why I asked for a working testcase to be posted ;-) Not just 
pseudocode - send the real .c thing please.

> This changes the dynamics of the rwsem as we are now dominated by read 
> acquisitions of mmap sem due to the page faults, instead of having only 
> write acquisitions from mmap. [...]

Absolutely, the page fault read case is the #1 optimization target of 
rwsems.

> [...] In this case, any delay in write acquisitions will be costly as we 
> will be blocking a lot of readers.  This is where optimistic spinning on 
> write acquisitions of mmap sem can provide a very significant boost to 
> the throughput.
> 
> I change the test case to the following with writes to
> the mmaped memory:
> 
> #define MEMSIZE (1 * 1024 * 1024)
> 
> char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> 
> void testcase(unsigned long long *iterations)
> {
>         int i;
> 
>         while (1) {
>                 char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
>                                MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>                 assert(c != MAP_FAILED);
>                 for (i=0; i<MEMSIZE; i+=8) {
>                         c[i] = 0xa;
>                 }
>                 munmap(c, MEMSIZE);
> 
>                 (*iterations)++;
>         }
> }

It would be _really_ nice to stick this into tools/perf/bench/ as:

	perf bench mem pagefaults

or so, with a number of parallelism and workload patterns. See 
tools/perf/bench/numa.c for a couple of workload generators - although 
those are not page fault intense.

So that future generations can run all these tests too and such.

> I compare the throughput where I have the complete rwsem patchset 
> against vanilla and the case where I take out the optimistic spin patch.  
> I have increased the run time by 10x from my pervious experiments and do 
> 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> under 1.5% is statistically significant.
> 
> % change in throughput vs the vanilla kernel.
> Threads	all	No-optspin
> 1		+0.4%	-0.1%
> 2		+2.0%	+0.2%
> 3		+1.1%	+1.5%
> 4		-0.5%	-1.4%
> 5		-0.1%	-0.1%
> 10		+2.2%	-1.2%
> 20		+237.3%	-2.3%
> 40		+548.1%	+0.3%

The tail is impressive. The early parts are important as well, but it's 
really hard to tell the significance of the early portion without having 
an sttdev column.

( "perf stat --repeat N" will give you sttdev output, in handy percentage 
  form. )

> Now when I test the case where we acquire mutex in the
> user space before mmap, I got the following data versus
> vanilla kernel.  There's little contention on mmap sem 
> acquisition in this case.
> 
> n	all	No-optspin
> 1	+0.8%	-1.2%
> 2	+1.0%	-0.5%
> 3	+1.8%	+0.2%
> 4	+1.5%	-0.4%
> 5	+1.1%	+0.4%
> 10	+1.5%	-0.3%
> 20	+1.4%	-0.2%
> 40	+1.3%	+0.4%
> 
> Thanks.

A bit hard to see as there's no comparison _between_ the pthread_mutex and 
plain-parallel versions. No contention isn't a great result if performance 
suffers because it's all serialized.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-16  6:55                 ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-16  6:55 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads        vanilla 	all rwsem    	without optspin
> > > 				patches
> > > 1               3.0%    	-1.0%   	-1.7%
> > > 5               7.2%    	-26.8%  	5.5%
> > > 10              5.2%    	-10.6%  	22.1%
> > > 20              6.8%    	16.4%   	12.5%
> > > 40              -0.2%   	32.7%   	0.0%
> > > 
> > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > faster.  This is consistent with what Peter reported.  With optspin, the 
> > > picture is more mixed, with lower throughput at low to moderate number 
> > > of threads and higher throughput with high number of threads.
> > 
> > So, going back to your orignal table:
> > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threads        vanilla all     without optspin
> > > 1               3.0%    -1.0%   -1.7%
> > > 5               7.2%    -26.8%  5.5%
> > > 10              5.2%    -10.6%  22.1%
> > > 20              6.8%    16.4%   12.5%
> > > 40              -0.2%   32.7%   0.0%
> > >
> > > In general, vanilla and no-optspin case perform better with 
> > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > worse at low to moderate contention and better at high contention.
> > 
> > it appears that 'without optspin' appears to be a pretty good choice - if 
> > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > uncontended case, is one of the most common usecases ...
> > 
> > How can the single-threaded case get slower? None of the patches should 
> > really cause noticeable overhead in the non-contended case. That looks 
> > weird.
> > 
> > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > most common contention scenarios in practice - where do we see the first 
> > improvement in performance?
> > 
> > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > to tell whether -1.7% is statistically significant.
> 
> Ingo,
> 
> I think that the optimistic spin changes to rwsem should enhance 
> performance to real workloads after all.
> 
> In my previous tests, I was doing mmap followed immediately by 
> munmap without doing anything to the memory.  No real workload
> will behave that way and it is not the scenario that we 
> should optimize for.  A much better approximation of
> real usages will be doing mmap, then touching 
> the memories being mmaped, followed by munmap.  

That's why I asked for a working testcase to be posted ;-) Not just 
pseudocode - send the real .c thing please.

> This changes the dynamics of the rwsem as we are now dominated by read 
> acquisitions of mmap sem due to the page faults, instead of having only 
> write acquisitions from mmap. [...]

Absolutely, the page fault read case is the #1 optimization target of 
rwsems.

> [...] In this case, any delay in write acquisitions will be costly as we 
> will be blocking a lot of readers.  This is where optimistic spinning on 
> write acquisitions of mmap sem can provide a very significant boost to 
> the throughput.
> 
> I change the test case to the following with writes to
> the mmaped memory:
> 
> #define MEMSIZE (1 * 1024 * 1024)
> 
> char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> 
> void testcase(unsigned long long *iterations)
> {
>         int i;
> 
>         while (1) {
>                 char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
>                                MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>                 assert(c != MAP_FAILED);
>                 for (i=0; i<MEMSIZE; i+=8) {
>                         c[i] = 0xa;
>                 }
>                 munmap(c, MEMSIZE);
> 
>                 (*iterations)++;
>         }
> }

It would be _really_ nice to stick this into tools/perf/bench/ as:

	perf bench mem pagefaults

or so, with a number of parallelism and workload patterns. See 
tools/perf/bench/numa.c for a couple of workload generators - although 
those are not page fault intense.

So that future generations can run all these tests too and such.

> I compare the throughput where I have the complete rwsem patchset 
> against vanilla and the case where I take out the optimistic spin patch.  
> I have increased the run time by 10x from my pervious experiments and do 
> 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> under 1.5% is statistically significant.
> 
> % change in throughput vs the vanilla kernel.
> Threads	all	No-optspin
> 1		+0.4%	-0.1%
> 2		+2.0%	+0.2%
> 3		+1.1%	+1.5%
> 4		-0.5%	-1.4%
> 5		-0.1%	-0.1%
> 10		+2.2%	-1.2%
> 20		+237.3%	-2.3%
> 40		+548.1%	+0.3%

The tail is impressive. The early parts are important as well, but it's 
really hard to tell the significance of the early portion without having 
an sttdev column.

( "perf stat --repeat N" will give you sttdev output, in handy percentage 
  form. )

> Now when I test the case where we acquire mutex in the
> user space before mmap, I got the following data versus
> vanilla kernel.  There's little contention on mmap sem 
> acquisition in this case.
> 
> n	all	No-optspin
> 1	+0.8%	-1.2%
> 2	+1.0%	-0.5%
> 3	+1.8%	+0.2%
> 4	+1.5%	-0.4%
> 5	+1.1%	+0.4%
> 10	+1.5%	-0.3%
> 20	+1.4%	-0.2%
> 40	+1.3%	+0.4%
> 
> Thanks.

A bit hard to see as there's no comparison _between_ the pthread_mutex and 
plain-parallel versions. No contention isn't a great result if performance 
suffers because it's all serialized.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-16  6:55                 ` Ingo Molnar
@ 2013-10-16 18:28                   ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > > * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > 
> > > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads        vanilla 	all rwsem    	without optspin
> > > > 				patches
> > > > 1               3.0%    	-1.0%   	-1.7%
> > > > 5               7.2%    	-26.8%  	5.5%
> > > > 10              5.2%    	-10.6%  	22.1%
> > > > 20              6.8%    	16.4%   	12.5%
> > > > 40              -0.2%   	32.7%   	0.0%
> > > > 
> > > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > > faster.  This is consistent with what Peter reported.  With optspin, the 
> > > > picture is more mixed, with lower throughput at low to moderate number 
> > > > of threads and higher throughput with high number of threads.
> > > 
> > > So, going back to your orignal table:
> > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads        vanilla all     without optspin
> > > > 1               3.0%    -1.0%   -1.7%
> > > > 5               7.2%    -26.8%  5.5%
> > > > 10              5.2%    -10.6%  22.1%
> > > > 20              6.8%    16.4%   12.5%
> > > > 40              -0.2%   32.7%   0.0%
> > > >
> > > > In general, vanilla and no-optspin case perform better with 
> > > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > > worse at low to moderate contention and better at high contention.
> > > 
> > > it appears that 'without optspin' appears to be a pretty good choice - if 
> > > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > > uncontended case, is one of the most common usecases ...
> > > 
> > > How can the single-threaded case get slower? None of the patches should 
> > > really cause noticeable overhead in the non-contended case. That looks 
> > > weird.
> > > 
> > > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > > most common contention scenarios in practice - where do we see the first 
> > > improvement in performance?
> > > 
> > > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > > to tell whether -1.7% is statistically significant.
> > 
> > Ingo,
> > 
> > I think that the optimistic spin changes to rwsem should enhance 
> > performance to real workloads after all.
> > 
> > In my previous tests, I was doing mmap followed immediately by 
> > munmap without doing anything to the memory.  No real workload
> > will behave that way and it is not the scenario that we 
> > should optimize for.  A much better approximation of
> > real usages will be doing mmap, then touching 
> > the memories being mmaped, followed by munmap.  
> 
> That's why I asked for a working testcase to be posted ;-) Not just 
> pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test.  I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

> 
> > This changes the dynamics of the rwsem as we are now dominated by read 
> > acquisitions of mmap sem due to the page faults, instead of having only 
> > write acquisitions from mmap. [...]
> 
> Absolutely, the page fault read case is the #1 optimization target of 
> rwsems.
> 
> > [...] In this case, any delay in write acquisitions will be costly as we 
> > will be blocking a lot of readers.  This is where optimistic spinning on 
> > write acquisitions of mmap sem can provide a very significant boost to 
> > the throughput.
> > 
> > I change the test case to the following with writes to
> > the mmaped memory:
> > 
> > #define MEMSIZE (1 * 1024 * 1024)
> > 
> > char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> > 
> > void testcase(unsigned long long *iterations)
> > {
> >         int i;
> > 
> >         while (1) {
> >                 char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> >                                MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> >                 assert(c != MAP_FAILED);
> >                 for (i=0; i<MEMSIZE; i+=8) {
> >                         c[i] = 0xa;
> >                 }
> >                 munmap(c, MEMSIZE);
> > 
> >                 (*iterations)++;
> >         }
> > }
> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
> 	perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.

Okay, will do.

> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take out the optimistic spin patch.  
> > I have increased the run time by 10x from my pervious experiments and do 
> > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > under 1.5% is statistically significant.
> > 
> > % change in throughput vs the vanilla kernel.
> > Threads	all	No-optspin
> > 1		+0.4%	-0.1%
> > 2		+2.0%	+0.2%
> > 3		+1.1%	+1.5%
> > 4		-0.5%	-1.4%
> > 5		-0.1%	-0.1%
> > 10		+2.2%	-1.2%
> > 20		+237.3%	-2.3%
> > 40		+548.1%	+0.3%
> 
> The tail is impressive. The early parts are important as well, but it's 
> really hard to tell the significance of the early portion without having 
> an sttdev column.

Here's the data with sdv column:

n	all	sdv	No-optspin	sdv
1	+0.4%	0.9%	-0.1%		0.8%
2	+2.0%	0.8%	+0.2%		1.2%
3	+1.1%	0.8%	+1.5%		0.6%
4	-0.5%	0.9%	-1.4%		1.1%
5	-0.1%	1.1%	-0.1%		1.1%
10	+2.2%	0.8%	-1.2%		1.0%
20	+237.3%	0.7%	-2.3%		1.3%
40	+548.1%	0.8%	+0.3%		1.2%


> ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
>   form. )
> 
> > Now when I test the case where we acquire mutex in the
> > user space before mmap, I got the following data versus
> > vanilla kernel.  There's little contention on mmap sem 
> > acquisition in this case.
> > 
> > n	all	No-optspin
> > 1	+0.8%	-1.2%
> > 2	+1.0%	-0.5%
> > 3	+1.8%	+0.2%
> > 4	+1.5%	-0.4%
> > 5	+1.1%	+0.4%
> > 10	+1.5%	-0.3%
> > 20	+1.4%	-0.2%
> > 40	+1.3%	+0.4%

Adding std-dev to above data:

n	all	sdv	No-optspin	sdv
1	+0.8%	1.0%	-1.2%		1.2%
2	+1.0%	1.0%	-0.5%		1.0%
3	+1.8%	0.7%	+0.2%		0.8%
4	+1.5%	0.8%	-0.4%		0.7%
5	+1.1%	1.1%	+0.4%		0.3%
10	+1.5%	0.7%	-0.3%		0.7%
20	+1.4%	0.8%	-0.2%		1.0%
40	+1.3%	0.7%	+0.4%		0.5%

> > 
> > Thanks.
> 
> A bit hard to see as there's no comparison _between_ the pthread_mutex and 
> plain-parallel versions. No contention isn't a great result if performance 
> suffers because it's all serialized.

Now the data for pthread-mutex vs plain-parallel vanilla testcase 
with std-dev
						
n	vanilla	sdv	Rwsem-all	sdv	No-optspin	sdv
1	+0.5%	0.9%	+1.4%		0.9%	-0.7%		1.0%
2	-39.3%	1.0%	-38.7%		1.1%	-39.6%		1.1%
3	-52.6%	1.2%	-51.8%		0.7%	-52.5%		0.7%
4	-59.8%	0.8%	-59.2%		1.0%	-59.9%		0.9%
5	-63.5%	1.4%	-63.1%		1.4%	-63.4%		1.0%
10	-66.1%	1.3%	-65.6%		1.3%	-66.2%		1.3%
20	+178.3%	0.9%	+182.3%		1.0%	+177.7%		1.1%
40	+604.8%	1.1%	+614.0%		1.0%	+607.9%		0.9%

The version with full rwsem patchset perform best across the threads.  
Serialization actually hurts for smaller number of threads even for
current vanilla kernel.

I'll rerun the tests once I ported them to the perf bench.  It may take
me a couple of days.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-16 18:28                   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm

On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > > * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > 
> > > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads        vanilla 	all rwsem    	without optspin
> > > > 				patches
> > > > 1               3.0%    	-1.0%   	-1.7%
> > > > 5               7.2%    	-26.8%  	5.5%
> > > > 10              5.2%    	-10.6%  	22.1%
> > > > 20              6.8%    	16.4%   	12.5%
> > > > 40              -0.2%   	32.7%   	0.0%
> > > > 
> > > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > > faster.  This is consistent with what Peter reported.  With optspin, the 
> > > > picture is more mixed, with lower throughput at low to moderate number 
> > > > of threads and higher throughput with high number of threads.
> > > 
> > > So, going back to your orignal table:
> > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads        vanilla all     without optspin
> > > > 1               3.0%    -1.0%   -1.7%
> > > > 5               7.2%    -26.8%  5.5%
> > > > 10              5.2%    -10.6%  22.1%
> > > > 20              6.8%    16.4%   12.5%
> > > > 40              -0.2%   32.7%   0.0%
> > > >
> > > > In general, vanilla and no-optspin case perform better with 
> > > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > > worse at low to moderate contention and better at high contention.
> > > 
> > > it appears that 'without optspin' appears to be a pretty good choice - if 
> > > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > > uncontended case, is one of the most common usecases ...
> > > 
> > > How can the single-threaded case get slower? None of the patches should 
> > > really cause noticeable overhead in the non-contended case. That looks 
> > > weird.
> > > 
> > > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > > most common contention scenarios in practice - where do we see the first 
> > > improvement in performance?
> > > 
> > > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > > to tell whether -1.7% is statistically significant.
> > 
> > Ingo,
> > 
> > I think that the optimistic spin changes to rwsem should enhance 
> > performance to real workloads after all.
> > 
> > In my previous tests, I was doing mmap followed immediately by 
> > munmap without doing anything to the memory.  No real workload
> > will behave that way and it is not the scenario that we 
> > should optimize for.  A much better approximation of
> > real usages will be doing mmap, then touching 
> > the memories being mmaped, followed by munmap.  
> 
> That's why I asked for a working testcase to be posted ;-) Not just 
> pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test.  I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

> 
> > This changes the dynamics of the rwsem as we are now dominated by read 
> > acquisitions of mmap sem due to the page faults, instead of having only 
> > write acquisitions from mmap. [...]
> 
> Absolutely, the page fault read case is the #1 optimization target of 
> rwsems.
> 
> > [...] In this case, any delay in write acquisitions will be costly as we 
> > will be blocking a lot of readers.  This is where optimistic spinning on 
> > write acquisitions of mmap sem can provide a very significant boost to 
> > the throughput.
> > 
> > I change the test case to the following with writes to
> > the mmaped memory:
> > 
> > #define MEMSIZE (1 * 1024 * 1024)
> > 
> > char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> > 
> > void testcase(unsigned long long *iterations)
> > {
> >         int i;
> > 
> >         while (1) {
> >                 char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> >                                MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> >                 assert(c != MAP_FAILED);
> >                 for (i=0; i<MEMSIZE; i+=8) {
> >                         c[i] = 0xa;
> >                 }
> >                 munmap(c, MEMSIZE);
> > 
> >                 (*iterations)++;
> >         }
> > }
> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
> 	perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.

Okay, will do.

> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take out the optimistic spin patch.  
> > I have increased the run time by 10x from my pervious experiments and do 
> > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > under 1.5% is statistically significant.
> > 
> > % change in throughput vs the vanilla kernel.
> > Threads	all	No-optspin
> > 1		+0.4%	-0.1%
> > 2		+2.0%	+0.2%
> > 3		+1.1%	+1.5%
> > 4		-0.5%	-1.4%
> > 5		-0.1%	-0.1%
> > 10		+2.2%	-1.2%
> > 20		+237.3%	-2.3%
> > 40		+548.1%	+0.3%
> 
> The tail is impressive. The early parts are important as well, but it's 
> really hard to tell the significance of the early portion without having 
> an sttdev column.

Here's the data with sdv column:

n	all	sdv	No-optspin	sdv
1	+0.4%	0.9%	-0.1%		0.8%
2	+2.0%	0.8%	+0.2%		1.2%
3	+1.1%	0.8%	+1.5%		0.6%
4	-0.5%	0.9%	-1.4%		1.1%
5	-0.1%	1.1%	-0.1%		1.1%
10	+2.2%	0.8%	-1.2%		1.0%
20	+237.3%	0.7%	-2.3%		1.3%
40	+548.1%	0.8%	+0.3%		1.2%


> ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
>   form. )
> 
> > Now when I test the case where we acquire mutex in the
> > user space before mmap, I got the following data versus
> > vanilla kernel.  There's little contention on mmap sem 
> > acquisition in this case.
> > 
> > n	all	No-optspin
> > 1	+0.8%	-1.2%
> > 2	+1.0%	-0.5%
> > 3	+1.8%	+0.2%
> > 4	+1.5%	-0.4%
> > 5	+1.1%	+0.4%
> > 10	+1.5%	-0.3%
> > 20	+1.4%	-0.2%
> > 40	+1.3%	+0.4%

Adding std-dev to above data:

n	all	sdv	No-optspin	sdv
1	+0.8%	1.0%	-1.2%		1.2%
2	+1.0%	1.0%	-0.5%		1.0%
3	+1.8%	0.7%	+0.2%		0.8%
4	+1.5%	0.8%	-0.4%		0.7%
5	+1.1%	1.1%	+0.4%		0.3%
10	+1.5%	0.7%	-0.3%		0.7%
20	+1.4%	0.8%	-0.2%		1.0%
40	+1.3%	0.7%	+0.4%		0.5%

> > 
> > Thanks.
> 
> A bit hard to see as there's no comparison _between_ the pthread_mutex and 
> plain-parallel versions. No contention isn't a great result if performance 
> suffers because it's all serialized.

Now the data for pthread-mutex vs plain-parallel vanilla testcase 
with std-dev
						
n	vanilla	sdv	Rwsem-all	sdv	No-optspin	sdv
1	+0.5%	0.9%	+1.4%		0.9%	-0.7%		1.0%
2	-39.3%	1.0%	-38.7%		1.1%	-39.6%		1.1%
3	-52.6%	1.2%	-51.8%		0.7%	-52.5%		0.7%
4	-59.8%	0.8%	-59.2%		1.0%	-59.9%		0.9%
5	-63.5%	1.4%	-63.1%		1.4%	-63.4%		1.0%
10	-66.1%	1.3%	-65.6%		1.3%	-66.2%		1.3%
20	+178.3%	0.9%	+182.3%		1.0%	+177.7%		1.1%
40	+604.8%	1.1%	+614.0%		1.0%	+607.9%		0.9%

The version with full rwsem patchset perform best across the threads.  
Serialization actually hurts for smaller number of threads even for
current vanilla kernel.

I'll rerun the tests once I ported them to the perf bench.  It may take
me a couple of days.

Thanks.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-16  6:55                 ` Ingo Molnar
@ 2013-10-16 21:55                   ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16 21:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
> 	perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.
> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take out the optimistic spin patch.  
> > I have increased the run time by 10x from my pervious experiments and do 
> > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > under 1.5% is statistically significant.
> > 
> > % change in throughput vs the vanilla kernel.
> > Threads	all	No-optspin
> > 1		+0.4%	-0.1%
> > 2		+2.0%	+0.2%
> > 3		+1.1%	+1.5%
> > 4		-0.5%	-1.4%
> > 5		-0.1%	-0.1%
> > 10		+2.2%	-1.2%
> > 20		+237.3%	-2.3%
> > 40		+548.1%	+0.3%
> 
> The tail is impressive. The early parts are important as well, but it's 
> really hard to tell the significance of the early portion without having 
> an sttdev column.
> 
> ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
>   form. )

Quick naive question as I haven't hacked perf bench before.  
Now perf stat gives the statistics of the performance counter or events.
How do I get it to compute the stats of 
the throughput reported by perf bench?

Something like

perf stat -r 10 -- perf bench mm memset --iterations 10

doesn't quite give what I need.

Pointers appreciated.

Tim


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-16 21:55                   ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-10-16 21:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
> 	perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.
> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take out the optimistic spin patch.  
> > I have increased the run time by 10x from my pervious experiments and do 
> > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > under 1.5% is statistically significant.
> > 
> > % change in throughput vs the vanilla kernel.
> > Threads	all	No-optspin
> > 1		+0.4%	-0.1%
> > 2		+2.0%	+0.2%
> > 3		+1.1%	+1.5%
> > 4		-0.5%	-1.4%
> > 5		-0.1%	-0.1%
> > 10		+2.2%	-1.2%
> > 20		+237.3%	-2.3%
> > 40		+548.1%	+0.3%
> 
> The tail is impressive. The early parts are important as well, but it's 
> really hard to tell the significance of the early portion without having 
> an sttdev column.
> 
> ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
>   form. )

Quick naive question as I haven't hacked perf bench before.  
Now perf stat gives the statistics of the performance counter or events.
How do I get it to compute the stats of 
the throughput reported by perf bench?

Something like

perf stat -r 10 -- perf bench mm memset --iterations 10

doesn't quite give what I need.

Pointers appreciated.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-16 21:55                   ` Tim Chen
@ 2013-10-18  6:52                     ` Ingo Molnar
  -1 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> 
> > 
> > It would be _really_ nice to stick this into tools/perf/bench/ as:
> > 
> > 	perf bench mem pagefaults
> > 
> > or so, with a number of parallelism and workload patterns. See 
> > tools/perf/bench/numa.c for a couple of workload generators - although 
> > those are not page fault intense.
> > 
> > So that future generations can run all these tests too and such.
> > 
> > > I compare the throughput where I have the complete rwsem patchset 
> > > against vanilla and the case where I take out the optimistic spin patch.  
> > > I have increased the run time by 10x from my pervious experiments and do 
> > > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > > under 1.5% is statistically significant.
> > > 
> > > % change in throughput vs the vanilla kernel.
> > > Threads	all	No-optspin
> > > 1		+0.4%	-0.1%
> > > 2		+2.0%	+0.2%
> > > 3		+1.1%	+1.5%
> > > 4		-0.5%	-1.4%
> > > 5		-0.1%	-0.1%
> > > 10		+2.2%	-1.2%
> > > 20		+237.3%	-2.3%
> > > 40		+548.1%	+0.3%
> > 
> > The tail is impressive. The early parts are important as well, but it's 
> > really hard to tell the significance of the early portion without having 
> > an sttdev column.
> > 
> > ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
> >   form. )
> 
> Quick naive question as I haven't hacked perf bench before.  

Btw., please use tip:master, I've got a few cleanups in there that should 
make it easier to hack.

> Now perf stat gives the statistics of the performance counter or events.
> How do I get it to compute the stats of 
> the throughput reported by perf bench?

What I do is that I measure the execution time, via:

  perf stat --null --repeat 10 perf bench ...

instead of relying on benchmark output.

> Something like
> 
> perf stat -r 10 -- perf bench mm memset --iterations 10
> 
> doesn't quite give what I need.

Yeha. So, perf bench also has a 'simple' output format:

  comet:~/tip> perf bench -f simple sched pipe
  10.378

We could extend 'perf stat' with an option to not measure time, but to 
take any numeric data output from the executed task and use that as the 
measurement result.

If you'd be interested in such a feature I can give it a try.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-10-18  6:52                     ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	linux-kernel, linux-mm


* Tim Chen <tim.c.chen@linux.intel.com> wrote:

> 
> > 
> > It would be _really_ nice to stick this into tools/perf/bench/ as:
> > 
> > 	perf bench mem pagefaults
> > 
> > or so, with a number of parallelism and workload patterns. See 
> > tools/perf/bench/numa.c for a couple of workload generators - although 
> > those are not page fault intense.
> > 
> > So that future generations can run all these tests too and such.
> > 
> > > I compare the throughput where I have the complete rwsem patchset 
> > > against vanilla and the case where I take out the optimistic spin patch.  
> > > I have increased the run time by 10x from my pervious experiments and do 
> > > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > > under 1.5% is statistically significant.
> > > 
> > > % change in throughput vs the vanilla kernel.
> > > Threads	all	No-optspin
> > > 1		+0.4%	-0.1%
> > > 2		+2.0%	+0.2%
> > > 3		+1.1%	+1.5%
> > > 4		-0.5%	-1.4%
> > > 5		-0.1%	-0.1%
> > > 10		+2.2%	-1.2%
> > > 20		+237.3%	-2.3%
> > > 40		+548.1%	+0.3%
> > 
> > The tail is impressive. The early parts are important as well, but it's 
> > really hard to tell the significance of the early portion without having 
> > an sttdev column.
> > 
> > ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
> >   form. )
> 
> Quick naive question as I haven't hacked perf bench before.  

Btw., please use tip:master, I've got a few cleanups in there that should 
make it easier to hack.

> Now perf stat gives the statistics of the performance counter or events.
> How do I get it to compute the stats of 
> the throughput reported by perf bench?

What I do is that I measure the execution time, via:

  perf stat --null --repeat 10 perf bench ...

instead of relying on benchmark output.

> Something like
> 
> perf stat -r 10 -- perf bench mm memset --iterations 10
> 
> doesn't quite give what I need.

Yeha. So, perf bench also has a 'simple' output format:

  comet:~/tip> perf bench -f simple sched pipe
  10.378

We could extend 'perf stat' with an option to not measure time, but to 
take any numeric data output from the executed task and use that as the 
measurement result.

If you'd be interested in such a feature I can give it a try.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
  2013-10-16 18:28                   ` Tim Chen
@ 2013-11-04 22:36                     ` Tim Chen
  -1 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-11-04 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	YuanhanLiu, linux-kernel, linux-mm

Ingo,

Sorry for the late response.  My old 4 socket Westmere
test machine went down and I have to find a new one, 
which is a 4 socket Ivybridge machine with 15 cores per socket.

I've updated the workload as a perf benchmark (see patch)
attached.  The workload will mmap, then access memory
in the mmaped area and then unmap, doing so repeatedly
for a specified time.  Each thread is pinned to a
particular core, with the threads distributed evenly between
the sockets. The throughput is reported with standard deviation
info.

First some baseline comparing the workload with serialized mmap vs
without serialized mmap running under vanilla kernel.

Threads		Throughput	std dev(%)
		serail vs non serial
		mmap(%)
1		0.10		0.16
2		0.78		0.09
3		-5.00		0.12
4		-3.27		0.08
5		-0.11		0.09
10		5.32		0.10
20		-2.05		0.05
40		-9.75		0.15
60		11.69		0.05


Here's the data for complete rwsem patch vs the plain vanilla kernel
case.  Overall there's improvement except for the 3 thread case.

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.62		0.11
2		3.86		0.10
3		-7.02		0.19
4		-0.01		0.13
5		2.74		0.06
10		5.66		0.03
20		1.44		0.09
40		5.54		0.09
60		15.63		0.13

Now testing with both patched kernel and vanilla kernel
running serialized mmap with mutex acquisition in user space.

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.60		0.02
2		6.40		0.11
3		14.13		0.07
4		-2.41		0.07
5		1.05		0.08
10		4.15		0.05
20		-0.26		0.06
40		-3.45		0.13
60		-4.33		0.07

Here's another run with the rwsem patchset without
optimistic spinning

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.81		0.04
2		2.85		0.17
3		-4.09		0.05
4		-8.31		0.07
5		-3.19		0.03
10		1.02		0.05
20		-4.77		0.04
40		-3.11		0.10
60		2.06		0.10

No-optspin comparing serialized mmaped workload under
patched kernel vs vanilla kernel

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.57		0.03
2		2.13		0.17
3		14.78		0.33
4		-1.23		0.11
5		2.99		0.08
10		-0.43		0.10
20		0.01		0.03
40		3.03		0.10
60		-1.74		0.09


The data is a bit of a mixed bag.  I'll spin off
the MCS cleanup patch separately so we can merge that first
for Waiman's qrwlock work.

Tim

---
>From 6c5916315c1515fb2281d9344b2c4f371ca99879 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 30 Oct 2013 05:18:29 -0700
Subject: [PATCH] perf mmap and memory write test

This patch add a perf benchmark to mmap a piece of memory,
write to the memory and unmap the memory for a given
number of threads.  The threads are distributed and pinned
evenly to the sockets on the machine.  The options
for the benchmark are as follow:

 usage: perf bench mem mmap <options>

    -l, --length <1MB>    Specify length of memory to set. Available units: B, KB, MB, GB and TB (upper and lower)
    -i, --iterations <n>  repeat mmap() invocation this number of times
    -n, --threads <n>     number of threads doing mmap() invocation
    -r, --runtime <n>     runtime per iteration in sec
    -w, --warmup <n>      warmup time in sec
    -s, --serialize       serialize the mmap() operations with mutex
    -v, --verbose         verbose output giving info about each iteration

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 tools/perf/Makefile         |   1 +
 tools/perf/bench/bench.h    |   1 +
 tools/perf/bench/mem-mmap.c | 312 ++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c  |   3 +
 4 files changed, 317 insertions(+)
 create mode 100644 tools/perf/bench/mem-mmap.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 64c043b..80e32d1 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -408,6 +408,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o
 endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-mmap.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 0fdc852..dbd0515 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -31,6 +31,7 @@ extern int bench_sched_pipe(int argc, const char **argv, const char *prefix);
 extern int bench_mem_memcpy(int argc, const char **argv,
 			    const char *prefix __maybe_unused);
 extern int bench_mem_memset(int argc, const char **argv, const char *prefix);
+extern int bench_mem_mmap(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
diff --git a/tools/perf/bench/mem-mmap.c b/tools/perf/bench/mem-mmap.c
new file mode 100644
index 0000000..11d96ad
--- /dev/null
+++ b/tools/perf/bench/mem-mmap.c
@@ -0,0 +1,312 @@
+/*
+ * mem-mmap.c
+ *
+ * memset: Simple parallel mmap and touch maped memory
+ */
+
+#include "../perf.h"
+#include "../util/util.h"
+#include "../util/parse-options.h"
+#include "../util/header.h"
+#include "bench.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <math.h>
+#include <unistd.h>
+#include <errno.h>
+#include <pthread.h>
+#include <numa.h>
+#include <numaif.h>
+#include <sched.h>
+
+#define K 1024
+#define CACHELINE_SIZE 128
+
+static const char	*length_str	= "1MB";
+static int		iterations	= 10;
+static int		threads		= 1;
+static int		stride		= 8;
+static int		warmup		= 5;
+static int		runtime		= 10;
+static bool		serialize_mmap  = false;
+static bool		verbose		= false;
+static bool		do_cnt		= false;
+static size_t		len		= 1024*1024;
+static unsigned long long	**results;
+static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+static int		nr_cpus;
+static int		nr_nodes;
+static struct bitmask	*nodemask	= NULL;
+static int		*cur_cpu;
+static int		cur_node;
+static struct bitmask	*cpumask;
+
+static const struct option options[] = {
+	OPT_STRING('l', "length", &length_str, "1MB",
+		    "Specify length of memory to set. "
+		    "Available units: B, KB, MB, GB and TB (upper and lower)"),
+	OPT_INTEGER('i', "iterations", &iterations,
+		    "repeat mmap() invocation this number of times"),
+	OPT_INTEGER('n', "threads", &threads,
+		    "number of threads doing mmap() invocation"),
+	OPT_INTEGER('r', "runtime", &runtime,
+		    "runtime per iteration in sec"),
+	OPT_INTEGER('w', "warmup", &warmup,
+		    "warmup time in sec"),
+	OPT_BOOLEAN('s', "serialize", &serialize_mmap,
+		    "serialize the mmap() operations with mutex"),
+	OPT_BOOLEAN('v', "verbose", &verbose,
+		    "verbose output giving info about each iteration"),
+	OPT_END()
+};
+
+static const char * const bench_mem_mmap_usage[] = {
+	"perf bench mem mmap <options>",
+	NULL
+};
+
+static double timeval2double(struct timeval *ts)
+{
+	return (double)ts->tv_sec +
+		(double)ts->tv_usec / (double)1000000;
+}
+
+static void alloc_mem(void **dst, size_t length)
+{
+	if (serialize_mmap) {
+		pthread_mutex_lock(&mutex);
+		*dst = mmap(NULL, length, PROT_READ|PROT_WRITE,
+				MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+		pthread_mutex_unlock(&mutex);
+	} else
+		*dst = mmap(NULL, length, PROT_READ|PROT_WRITE,
+				MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	if (!*dst)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+static void free_mem(void *dst, size_t length)
+{
+	if (serialize_mmap) {
+		pthread_mutex_lock(&mutex);
+		munmap(dst, length);
+		pthread_mutex_unlock(&mutex);
+	} else
+		munmap(dst, length);
+}
+
+static void *do_mmap(void *itr)
+{
+	void *dst = NULL;
+	char *c;
+	size_t l;
+	unsigned long long *cnt;
+
+	cnt = (unsigned long long *) itr;
+	while (1) {
+		/* alloc memory with mmap */
+		alloc_mem(&dst, len);
+		c = (char *) dst;
+
+		/* touch memory allocated */
+		for (l = 0; l < len; l += stride)
+			c[l] = 0xa;
+		free_mem(dst, len);
+		if (do_cnt)
+			(*cnt)++;
+	}
+	return NULL;
+}
+
+static int get_next_node(int node)
+{
+	int i;
+
+	i = (node+1) % nr_nodes;
+	while (i != node) {
+		if (numa_bitmask_isbitset(nodemask, i))
+			return i;
+		i = (i+1) % nr_nodes;
+	}
+	return (-1);
+}
+
+static int get_next_cpu(int node)
+{
+	int i, prev_cpu;
+
+	prev_cpu = cur_cpu[node];
+	i = (prev_cpu + 1) % nr_cpus;
+	numa_node_to_cpus(node, cpumask);
+	while (i != prev_cpu) {
+		if (numa_bitmask_isbitset(cpumask, i)) {
+			cur_cpu[node] = i;
+			return i;
+		}
+		i = (i+1) % nr_cpus;
+	}
+	return (-1);
+}
+
+static void set_attr_to_cpu(pthread_attr_t *attr, int cpu)
+{
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	pthread_attr_setaffinity_np(attr, sizeof(cpuset), &cpuset);
+}
+
+int bench_mem_mmap(int argc, const char **argv,
+		     const char *prefix __maybe_unused)
+{
+	int i, itr = 0;
+	char *m, *r;
+	struct timeval tv_start, tv_end, tv_diff;
+	double sum = 0.0, min = -1.0, max = 0.0, tput_total = 0.0;
+	double mean = 0, sdv = 0, tsq = 0.0;
+	pthread_t tid;
+	pthread_attr_t attr;
+	u64 addr;
+
+	argc = parse_options(argc, argv, options,
+			     bench_mem_mmap_usage, 0);
+
+	nodemask = numa_get_run_node_mask();
+	nr_nodes = numa_max_node() + 1;
+	nr_cpus = numa_num_configured_cpus();
+	cur_node = 0;
+	cur_cpu = (int *) malloc(nr_nodes * sizeof(int));
+	if (cur_cpu == NULL) {
+		fprintf(stderr, "Not enough memory to set up benchmark\n");
+	}
+	for (i = 0; i < nr_nodes; ++i)
+		cur_cpu[i] = 0;
+
+	cpumask = numa_allocate_cpumask();
+	if ((s64)len <= 0) {
+		fprintf(stderr, "Invalid length:%s\n", length_str);
+		return 1;
+	}
+
+	m = (char *) malloc(CACHELINE_SIZE * (threads+1));
+	addr = (u64) m;
+	if (m == NULL) {
+		fprintf(stderr, "Not enough memory to store results\n");
+		return 1;
+	}
+	results = (unsigned long long **) malloc(sizeof(unsigned long long *)
+							* threads);
+	if (results == NULL) {
+		fprintf(stderr, "Not enough memory to store results\n");
+		free (m);
+		return 1;
+	}
+	r = (char *) ((addr + CACHELINE_SIZE - 1) & ~(CACHELINE_SIZE - 1));
+	for (i = 0; i < threads; i++)
+		results[i] = (unsigned long long *) &r[i * CACHELINE_SIZE];
+
+	for (i = 0; i < threads; i++) {
+		int cpu;
+
+		pthread_attr_init(&attr);
+		cur_node = get_next_node(cur_node);
+		cpu = get_next_cpu(cur_node);
+		if (cur_node < 0 || cpu < 0) {
+			fprintf(stderr, "Cannot set thread to cpu. \n");
+			return 1;
+		}
+		set_attr_to_cpu(&attr, cpu);
+		pthread_create(&tid, &attr, do_mmap, results[i]);
+		pthread_attr_destroy(&attr);
+	}
+
+	if (bench_format == BENCH_FORMAT_DEFAULT) {
+		printf("# Repeatedly memory map %s bytes and touch %llu bytes"
+			" for %d sec with %d threads ...\n", length_str,
+			(unsigned long long) len/stride, runtime, threads);
+		printf("# Warming up ");
+		fflush(stdout);
+	}
+
+	while (1) {
+		double elapsed, tput;
+
+		sleep(1);
+		BUG_ON(gettimeofday(&tv_start, NULL));
+		do_cnt = true;
+		sleep(runtime);
+		do_cnt = false;
+		BUG_ON(gettimeofday(&tv_end, NULL));
+		timersub(&tv_end, &tv_start, &tv_diff);
+		elapsed = timeval2double(&tv_diff);
+
+		sum = 0.0;
+		max = 0.0;
+		if (bench_format == BENCH_FORMAT_DEFAULT) {
+			if (itr < warmup)
+				printf(".");
+			else if (itr == warmup)
+				printf("\n\n");
+			fflush(stdout);
+		}
+
+		for (i = 0; i < threads; i++) {
+			unsigned long long val = *(results[i]);
+
+			*(results[i]) = 0;
+			tput = val / elapsed;
+			if (i == 0)
+				min = tput;
+			else if (tput < min)
+				min = tput;
+
+			if (tput > max)
+				max = tput;
+
+			sum += tput;
+		}
+
+		if (itr++ > warmup) {
+			tput_total += sum;
+			tsq += (double) (sum*sum);
+		}
+
+		if (verbose && (itr > warmup))
+			printf("iteration:%d   %10lf (mmap/unmap per sec)"
+				" %10lf (MB accessed per sec)\n",
+				(itr-warmup), sum, sum*len/(stride*1024*1024));
+
+		if (itr > (iterations + warmup)) {
+			mean = tput_total/iterations;
+			sdv = sqrt((tsq - mean*mean*iterations)/iterations);
+			/* convert to percentage */
+			sdv = 100.0*(sdv/mean);
+			break;
+		}
+	}
+
+	switch (bench_format) {
+	case BENCH_FORMAT_DEFAULT:
+		printf(" %d threads %10lf (mmap/munmap per sec),"
+			"  access %10lf (MB/sec) (std dev +/- %-10lf %%)\n",
+			threads, mean, mean*len/(stride*1024*1024), sdv);
+		break;
+	case BENCH_FORMAT_SIMPLE:
+		printf(" %d %-10lf %-10lf %-10lf \n", threads, mean,
+			mean*len/(stride*1024*1024), sdv);
+		break;
+	default:
+		/* reaching this means there's some disaster: */
+		die("unknown format: %d\n", bench_format);
+		break;
+	}
+
+	free (m);
+	free (results);
+	return 0;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 77298bf..7cd4218 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -64,6 +64,9 @@ static struct bench_suite mem_suites[] = {
 	{ "memcpy",
 	  "Simple memory copy in various ways",
 	  bench_mem_memcpy },
+	{ "mmap",
+	  "Simple memory map + memory set in various ways",
+	  bench_mem_mmap },
 	{ "memset",
 	  "Simple memory set in various ways",
 	  bench_mem_memset },
-- 
1.7.11.7




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v8 0/9] rwsem performance optimizations
@ 2013-11-04 22:36                     ` Tim Chen
  0 siblings, 0 replies; 54+ messages in thread
From: Tim Chen @ 2013-11-04 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Andrea Arcangeli,
	Alex Shi, Andi Kleen, Michel Lespinasse, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	Peter Hurley, Paul E.McKenney, Jason Low, Waiman Long,
	YuanhanLiu, linux-kernel, linux-mm

Ingo,

Sorry for the late response.  My old 4 socket Westmere
test machine went down and I have to find a new one, 
which is a 4 socket Ivybridge machine with 15 cores per socket.

I've updated the workload as a perf benchmark (see patch)
attached.  The workload will mmap, then access memory
in the mmaped area and then unmap, doing so repeatedly
for a specified time.  Each thread is pinned to a
particular core, with the threads distributed evenly between
the sockets. The throughput is reported with standard deviation
info.

First some baseline comparing the workload with serialized mmap vs
without serialized mmap running under vanilla kernel.

Threads		Throughput	std dev(%)
		serail vs non serial
		mmap(%)
1		0.10		0.16
2		0.78		0.09
3		-5.00		0.12
4		-3.27		0.08
5		-0.11		0.09
10		5.32		0.10
20		-2.05		0.05
40		-9.75		0.15
60		11.69		0.05


Here's the data for complete rwsem patch vs the plain vanilla kernel
case.  Overall there's improvement except for the 3 thread case.

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.62		0.11
2		3.86		0.10
3		-7.02		0.19
4		-0.01		0.13
5		2.74		0.06
10		5.66		0.03
20		1.44		0.09
40		5.54		0.09
60		15.63		0.13

Now testing with both patched kernel and vanilla kernel
running serialized mmap with mutex acquisition in user space.

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.60		0.02
2		6.40		0.11
3		14.13		0.07
4		-2.41		0.07
5		1.05		0.08
10		4.15		0.05
20		-0.26		0.06
40		-3.45		0.13
60		-4.33		0.07

Here's another run with the rwsem patchset without
optimistic spinning

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.81		0.04
2		2.85		0.17
3		-4.09		0.05
4		-8.31		0.07
5		-3.19		0.03
10		1.02		0.05
20		-4.77		0.04
40		-3.11		0.10
60		2.06		0.10

No-optspin comparing serialized mmaped workload under
patched kernel vs vanilla kernel

Threads		Throughput	std dev(%)
		vs vanilla(%)
1		0.57		0.03
2		2.13		0.17
3		14.78		0.33
4		-1.23		0.11
5		2.99		0.08
10		-0.43		0.10
20		0.01		0.03
40		3.03		0.10
60		-1.74		0.09


The data is a bit of a mixed bag.  I'll spin off
the MCS cleanup patch separately so we can merge that first
for Waiman's qrwlock work.

Tim

---

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2013-11-04 22:36 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
2013-10-02 22:38 ` [PATCH v8 0/9] rwsem performance optimizations Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-03  7:32   ` Ingo Molnar
2013-10-03  7:32     ` Ingo Molnar
2013-10-07 22:57     ` Tim Chen
2013-10-07 22:57       ` Tim Chen
2013-10-09  6:15       ` Ingo Molnar
2013-10-09  6:15         ` Ingo Molnar
2013-10-09  7:28         ` Peter Zijlstra
2013-10-09  7:28           ` Peter Zijlstra
2013-10-10  3:14           ` Linus Torvalds
2013-10-10  3:14             ` Linus Torvalds
2013-10-10  5:03             ` Davidlohr Bueso
2013-10-10  5:03               ` Davidlohr Bueso
2013-10-09 16:34         ` Tim Chen
2013-10-09 16:34           ` Tim Chen
2013-10-10  7:54           ` Ingo Molnar
2013-10-10  7:54             ` Ingo Molnar
2013-10-16  0:09             ` Tim Chen
2013-10-16  0:09               ` Tim Chen
2013-10-16  6:55               ` Ingo Molnar
2013-10-16  6:55                 ` Ingo Molnar
2013-10-16 18:28                 ` Tim Chen
2013-10-16 18:28                   ` Tim Chen
2013-11-04 22:36                   ` Tim Chen
2013-11-04 22:36                     ` Tim Chen
2013-10-16 21:55                 ` Tim Chen
2013-10-16 21:55                   ` Tim Chen
2013-10-18  6:52                   ` Ingo Molnar
2013-10-18  6:52                     ` Ingo Molnar
2013-10-02 22:38 ` [PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 2/9] rwsem: remove 'out' label in do_wake Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 4/9] rwsem/wake: check lock before do atomic update Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-08 19:51   ` Rafael Aquini
2013-10-08 19:51     ` Rafael Aquini
2013-10-08 20:34     ` Tim Chen
2013-10-08 20:34       ` Tim Chen
2013-10-08 21:31       ` Rafael Aquini
2013-10-08 21:31         ` Rafael Aquini
2013-10-02 22:38 ` [PATCH v8 6/9] MCS Lock: optimizations and extra comments Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 7/9] MCS Lock: Barrier corrections Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path Tim Chen
2013-10-02 22:38   ` Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.