All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning
@ 2013-04-04 14:54 Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Waiman Long @ 2013-04-04 14:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra
  Cc: Davidlohr Bueso, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin

This patch set is a collection of 3 different mutex related patches
aimed at improving mutex performance especially for system with large
number of CPUs. This is achieved by doing less atomic operations and
mutex spinning (when the CONFIG_MUTEX_SPIN_ON_OWNER is on).

The first patch reduces the number of atomic operations executed. It
can produce dramatic performance improvement in the AIM7 benchmark
with large number of CPUs. For example, there was a more than 3X
improvement in the high_systime workload with a 3.7.10 kernel on
an 8-socket x86-64 system with 80 cores. The 3.8 kernels, on the
other hand, are not mutex limited for that workload anymore. So the
performance improvement is only about 1% for the high_systime workload.

Patches 2 and 3 represent different ways to reduce mutex spinning. Of
the 2, the third one is better from both a performance perspective
and the fact that no mutex data structure change is needed. See the
individual patch descriptions for more information on those patches.

The table below shows the performance impact on the AIM7 benchmark with
a 3.8.5 kernel running on the same 8-socket system mentioned above:

+--------------+------------------------------------------------------+
|   Workload   |              Mean % Change 10-100 users              |
|              +-----------------+-----------------+------------------+
|              |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--------------+-----------------+-----------------+------------------+
| fserver      |     +1.7%       |      0.0%       |     -1.7%        |
| new_fserver  |     -0.2%       |     -1.5%       |     -1.2%        |
+--------------+-----------------+-----------------+------------------+
|   Workload   |             Mean % Change 100-1000 users             |
|              +-----------------+-----------------+------------------+
|              |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--------------+-----------------+-----------------+------------------+
| fserver      |    +18.6%       |    +43.4%       |    +21.0%        |
| new_fserver  |    +14.0%       |    +23.4%       |     +8.2%        |
+--------------+-----------------+-----------------+------------------+
|   Workload   |             Mean % Change 1100-2000 users            |
|              +-----------------+-----------------+------------------+
|              |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--------------+-----------------+-----------------+------------------+
| fserver      |    +11.6%       |     +5.1%       |     -5.8%        |
| new_fserver  |    +13.3%       |     +7.6%       |     -5.0%        |
+--------------+-----------------+-----------------+------------------+

So patch 2 is better at low and high load. Patch 3 is better at
intermediate load. For other AIM7 workloads, patch 3 is generally
better.

Waiman Long (3):
  mutex: Make more scalable by doing less atomic operations
  mutex: restrict mutex spinning to only one task per mutex
  mutex: dynamically disable mutex spinning at high load

 arch/x86/include/asm/mutex.h |   16 ++++++++++++++++
 include/linux/mutex.h        |    3 +++
 kernel/mutex.c               |   21 ++++++++++++++++++---
 kernel/mutex.h               |    8 ++++++++
 kernel/sched/core.c          |   22 ++++++++++++++++++++++
 5 files changed, 67 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-04 14:54 [PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning Waiman Long
@ 2013-04-04 14:54 ` Waiman Long
  2013-04-08 12:42   ` Ingo Molnar
  2013-04-04 14:54 ` [PATCH RFC 2/3] mutex: restrict mutex spinning to only one task per mutex Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 3/3] mutex: dynamically disable mutex spinning at high load Waiman Long
  2 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-04-04 14:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra
  Cc: Davidlohr Bueso, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the scalability data on an 8-node 80-core
Westmere box with a 3.7.10 kernel. The numactl command is used to
restrict the running of the high_systime workloads to 1/2/4/8 nodes
with hyperthreading on and off.

+-----------------+------------------+------------------+----------+
|  Configuration  | Mean Transaction | Mean Transaction | % Change |
|		  |  Rate w/o patch  | Rate with patch  |	   |
+-----------------+------------------------------------------------+
|		  |              User Range 1100 - 2000		   |
+-----------------+------------------------------------------------+
| 8 nodes, HT on  |      36980       |      148590      | +301.8%  |
| 8 nodes, HT off |      42799       |      145011      | +238.8%  |
| 4 nodes, HT on  |	 61318       |      118445      |  +51.1%  |
| 4 nodes, HT off |     158481       |      158592      |   +0.1%  |
| 2 nodes, HT on  |     180602       |      173967      |   -3.7%  |
| 2 nodes, HT off |     198409       |      198073      |   -0.2%  |
| 1 node , HT on  |     149042       |      147671      |   -0.9%  |
| 1 node , HT off |     126036       |      126533      |   +0.4%  |
+-----------------+------------------------------------------------+
|		  |              User Range 200 - 1000		   |
+-----------------+------------------------------------------------+
| 8 nodes, HT on  |      41525       |      122349      | +194.6%  |
| 8 nodes, HT off |      49866       |      124032      | +148.7%  |
| 4 nodes, HT on  |	 66409       |      106984      |  +61.1%  |
| 4 nodes, HT off |     119880       |      130508      |   +8.9%  |
| 2 nodes, HT on  |     138003       |      133948      |   -2.9%  |
| 2 nodes, HT off |     132792       |      131997      |   -0.6%  |
| 1 node , HT on  |     116593       |      115859      |   -0.6%  |
| 1 node , HT off |     104499       |      104597      |   +0.1%  |
+-----------------+------------------+------------------+----------+

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count.  The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
|    8 nodes    |      65.34%     |      0.69%       |  -99%   |
|    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
|    2 nodes    |       0.41%     |      0.32%       |  -22%   |
+---------------+-----------------+------------------+---------+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--------------+---------------+----------------+-----------------+
|   Workload   | mean % change | mean % change  | mean % change   |
|              | 10-100 users  | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
| five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
| fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
| new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
| shared       |    +13.1%     |   +146.1%      |   +181.5%       |
| short        |     +7.4%     |     +5.0%      |     +4.2%       |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 arch/x86/include/asm/mutex.h |   16 ++++++++++++++++
 kernel/mutex.c               |    9 ++++++---
 kernel/mutex.h               |    8 ++++++++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..aa6a3ec 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,19 @@
 #else
 # include <asm/mutex_64.h>
 #endif
+
+#ifndef	__ASM_MUTEX_H
+#define	__ASM_MUTEX_H
+
+#ifdef MUTEX_SHOULD_XCHG_COUNT
+#undef MUTEX_SHOULD_XCHG_COUNT
+#endif
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex counter to indicate that some other threads are waiting on the
+ * mutex. So the atomic_xchg() function should not be called in
+ * __mutex_lock_common() if the value of the counter has already been set
+ * to a negative number.
+ */
+#define MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count) >= 0)
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..5e5ea03 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -171,7 +171,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+		if ((atomic_read(&lock->count) == 1) &&
+		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
 			lock_acquired(&lock->dep_map, ip);
 			mutex_set_owner(lock);
 			preempt_enable();
@@ -205,7 +206,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
 
-	if (atomic_xchg(&lock->count, -1) == 1)
+	if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
+	   (atomic_xchg(&lock->count, -1) == 1))
 		goto done;
 
 	lock_contended(&lock->dep_map, ip);
@@ -220,7 +222,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 * that when we release the lock, we properly wake up the
 		 * other waiters:
 		 */
-		if (atomic_xchg(&lock->count, -1) == 1)
+		if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
+		   (atomic_xchg(&lock->count, -1) == 1))
 			break;
 
 		/*
diff --git a/kernel/mutex.h b/kernel/mutex.h
index 4115fbf..b873f8e 100644
--- a/kernel/mutex.h
+++ b/kernel/mutex.h
@@ -46,3 +46,11 @@ static inline void
 debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 }
+
+/*
+ * The atomic_xchg() function should not be called in __mutex_lock_common()
+ * if the value of the counter has already been set to -1.
+ */
+#ifndef MUTEX_SHOULD_XCHG_COUNT
+#define	MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count) != -1)
+#endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/3] mutex: restrict mutex spinning to only one task per mutex
  2013-04-04 14:54 [PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations Waiman Long
@ 2013-04-04 14:54 ` Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 3/3] mutex: dynamically disable mutex spinning at high load Waiman Long
  2 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-04-04 14:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra
  Cc: Davidlohr Bueso, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin

The current mutex spinning code allow multiple tasks to spin on a
single mutex concurrently. There are two major problems with this
approach:

 1. This is not very energy efficient as the spinning tasks are not
    doing useful work. The spinning tasks may also block other more
    important or useful tasks from running as preemption is disabled.
    Only one of the spinners will get the mutex at any time. The
    other spinners will have to wait for much longer to get it.

 2. The mutex data structure on x86-64 should be 32 bytes. The spinning
    code spin on lock->owner which, in most cases, should be in the same
    64-byte cache line as the lock->wait_lock spinlock. As a result,
    the mutex spinners are contending the same cacheline with other
    CPUs trying to get the spinlock leading to increased time spent
    on the spinlock as well as on the mutex spinning.

These problems are worse on system with large number of CPUs. One way
to reduce the effect of these two problems is to allow only one task
to be spinning on a mutex at any time.

This patch adds a new spinner field in the mutex.h to limit the
number of spinner to only one task. That will increase the size of
the mutex by 8 bytes in a 64-bit environment (4 bytes in a 32-bit
environment).

The AIM7 benchmarks were run on 3.7.10 derived kernels to show the
performance changes with this patch on a 8-socket 80-core system
with hyperthreading off.  The table below shows the mean % change
in performance over a range of users for some AIM7 workloads with
just the less atomic operation patch (patch 1) vs the less atomic
operation patch plus this one (patches 1+2).

+--------------+-----------------+-----------------+-----------------+
|   Workload   | mean % change   | mean % change   | mean % change   |
|              | 10-100 users    | 200-1000 users  | 1100-2000 users |
+--------------+-----------------+-----------------+-----------------+
| alltests     |     -0.2%       |     -3.8%       |    -4.2%        |
| five_sec     |     -0.6%       |     -2.0%       |    -2.4%        |
| fserver      |     +2.2%       |    +16.2%       |    +2.2%        |
| high_systime |     -0.3%       |     -4.3%       |    -3.0%        |
| new_fserver  |     +3.9%       |    +16.0%       |    +9.5%        |
| shared       |     -1.7%       |     -5.0%       |    -4.0%        |
| short        |     -7.7%       |     +0.2%       |    +1.3%        |
+--------------+-----------------+-----------------+-----------------+

It can be seen that this patch improves performance for the fserver and
new_fserver workloads while suffering some slight drop in performance
for the other workloads.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 include/linux/mutex.h |    3 +++
 kernel/mutex.c        |   12 ++++++++++++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..dd8fdb8 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -50,6 +50,9 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+	struct task_struct	*spinner;
+#endif
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
 	struct task_struct	*owner;
 #endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 5e5ea03..965f59f 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -158,7 +158,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	 *
 	 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
 	 * to serialize everything.
+	 *
+	 * Only first task is allowed to spin on a given mutex and that
+	 * task will put its task_struct pointer into the spinner field.
 	 */
+	if (lock->spinner || (cmpxchg(&lock->spinner, NULL, current) != NULL))
+		goto slowpath;
 
 	for (;;) {
 		struct task_struct *owner;
@@ -175,6 +180,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
 			lock_acquired(&lock->dep_map, ip);
 			mutex_set_owner(lock);
+			lock->spinner = NULL;
 			preempt_enable();
 			return 0;
 		}
@@ -196,6 +202,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		 */
 		arch_mutex_cpu_relax();
 	}
+
+	/*
+	 * Done with spinning
+	 */
+	lock->spinner = NULL;
+slowpath:
 #endif
 	spin_lock_mutex(&lock->wait_lock, flags);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/3] mutex: dynamically disable mutex spinning at high load
  2013-04-04 14:54 [PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations Waiman Long
  2013-04-04 14:54 ` [PATCH RFC 2/3] mutex: restrict mutex spinning to only one task per mutex Waiman Long
@ 2013-04-04 14:54 ` Waiman Long
  2 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-04-04 14:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra
  Cc: Davidlohr Bueso, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin

The Linux mutex code has a MUTEX_SPIN_ON_OWNER configuration
option that was enabled by default in major distributions like Red
Hat. Allowing threads waiting on mutex to spin while the mutex owner
is running will theoretically reduce latency on the acquisition of
mutex at the expense of energy efficiency as the spinning threads
are doing no useful work.

This is not a problem on a lightly loaded system where the CPU may
be idle anyway. On a highly loaded system, the spinning tasks may be
blocking other tasks from running even if they have higher priority
because the spinning was done with preemption disabled.

This patch will disable mutex spinning if the current load is high
enough. The load is considered high if there are 2 or more active tasks
waiting to run on the current CPU. If there is only one task waiting,
it will check the average load at the past minute (calc_load_tasks).
If it is more than double the number of active CPUs, the load is
considered high too.  This is a rather simple metric that does not
incur that much additional overhead.

The AIM7 benchmarks were run on 3.7.10 derived kernels to show the
performance changes with this patch on a 8-socket 80-core system
with hyperthreading off.  The table below shows the mean % change
in performance over a range of users for some AIM7 workloads with
just the less atomic operation patch (patch 1) vs the less atomic
operation patch plus this one (patches 1+3).

+--------------+-----------------+-----------------+-----------------+
|   Workload   | mean % change   | mean % change   | mean % change   |
|              | 10-100 users    | 200-1000 users  | 1100-2000 users |
+--------------+-----------------+-----------------+-----------------+
| alltests     |      0.0%       |     -0.1%       |    +5.0%        |
| five_sec     |     +1.5%       |     +1.3%       |    +1.3%        |
| fserver      |     +1.5%       |    +25.4%       |    +9.6%        |
| high_systime |     +0.1%       |      0.0%       |    +0.8%        |
| new_fserver  |     +0.2%       |    +11.9%       |   +14.1%        |
| shared       |     -1.2%       |     +0.3%       |    +1.8%        |
| short        |     +6.4%       |     +2.5%       |    +3.0%        |
+--------------+-----------------+-----------------+-----------------+

It can be seen that this patch provides some big performance
improvement for the fserver and new_fserver workloads while is still
generally positive for the other AIM7 workloads.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 kernel/sched/core.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..f667d63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3021,9 +3021,31 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
  */
 int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
+	unsigned int nrun;
+
 	if (!sched_feat(OWNER_SPIN))
 		return 0;
 
+	/*
+	 * Mutex spinning should be temporarily disabled if the load on
+	 * the current CPU is high. The load is considered high if there
+	 * are 2 or more active tasks waiting to run on this CPU. On the
+	 * other hand, if there is another task waiting and the global
+	 * load (calc_load_tasks - including uninterruptible tasks) is
+	 * bigger than 2X the # of CPUs available, it is also considered
+	 * to be high load.
+	 */
+	nrun = this_rq()->nr_running;
+	if (nrun >= 3)
+		return 0;
+	else if (nrun == 2) {
+		long active = atomic_long_read(&calc_load_tasks);
+		int  ncpu   = num_online_cpus();
+
+		if (active > 2*ncpu)
+			return 0;
+	}
+
 	rcu_read_lock();
 	while (owner_running(lock, owner)) {
 		if (need_resched())
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-04 14:54 ` [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations Waiman Long
@ 2013-04-08 12:42   ` Ingo Molnar
  2013-04-08 14:38     ` Linus Torvalds
  2013-04-08 17:42     ` Waiman Long
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2013-04-08 12:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra,
	Davidlohr Bueso, linux-kernel, Chandramouleeswaran, Aswin,
	Linus Torvalds, Peter Zijlstra, Andrew Morton


* Waiman Long <Waiman.Long@hp.com> wrote:

> In the __mutex_lock_common() function, an initial entry into
> the lock slow path will cause two atomic_xchg instructions to be
> issued. Together with the atomic decrement in the fast path, a total
> of three atomic read-modify-write instructions will be issued in
> rapid succession. This can cause a lot of cache bouncing when many
> tasks are trying to acquire the mutex at the same time.
> 
> This patch will reduce the number of atomic_xchg instructions used by
> checking the counter value first before issuing the instruction. The
> atomic_read() function is just a simple memory read. The atomic_xchg()
> function, on the other hand, can be up to 2 order of magnitude or even
> more in cost when compared with atomic_read(). By using atomic_read()
> to check the value first before calling atomic_xchg(), we can avoid a
> lot of unnecessary cache coherency traffic. The only downside with this
> change is that a task on the slow path will have a tiny bit
> less chance of getting the mutex when competing with another task
> in the fast path.
> 
> The same is true for the atomic_cmpxchg() function in the
> mutex-spin-on-owner loop. So an atomic_read() is also performed before
> calling atomic_cmpxchg().
> 
> The mutex locking and unlocking code for the x86 architecture can allow
> any negative number to be used in the mutex count to indicate that some
> tasks are waiting for the mutex. I am not so sure if that is the case
> for the other architectures. So the default is to avoid atomic_xchg()
> if the count has already been set to -1. For x86, the check is modified
> to include all negative numbers to cover a larger case.
> 
> The following table shows the scalability data on an 8-node 80-core
> Westmere box with a 3.7.10 kernel. The numactl command is used to
> restrict the running of the high_systime workloads to 1/2/4/8 nodes
> with hyperthreading on and off.
> 
> +-----------------+------------------+------------------+----------+
> |  Configuration  | Mean Transaction | Mean Transaction | % Change |
> |		  |  Rate w/o patch  | Rate with patch  |	   |
> +-----------------+------------------------------------------------+
> |		  |              User Range 1100 - 2000		   |
> +-----------------+------------------------------------------------+
> | 8 nodes, HT on  |      36980       |      148590      | +301.8%  |
> | 8 nodes, HT off |      42799       |      145011      | +238.8%  |
> | 4 nodes, HT on  |      61318       |      118445      |  +51.1%  |
> | 4 nodes, HT off |     158481       |      158592      |   +0.1%  |
> | 2 nodes, HT on  |     180602       |      173967      |   -3.7%  |
> | 2 nodes, HT off |     198409       |      198073      |   -0.2%  |
> | 1 node , HT on  |     149042       |      147671      |   -0.9%  |
> | 1 node , HT off |     126036       |      126533      |   +0.4%  |
> +-----------------+------------------------------------------------+
> |		  |              User Range 200 - 1000		   |
> +-----------------+------------------------------------------------+
> | 8 nodes, HT on  |      41525       |      122349      | +194.6%  |
> | 8 nodes, HT off |      49866       |      124032      | +148.7%  |
> | 4 nodes, HT on  |      66409       |      106984      |  +61.1%  |
> | 4 nodes, HT off |     119880       |      130508      |   +8.9%  |
> | 2 nodes, HT on  |     138003       |      133948      |   -2.9%  |
> | 2 nodes, HT off |     132792       |      131997      |   -0.6%  |
> | 1 node , HT on  |     116593       |      115859      |   -0.6%  |
> | 1 node , HT off |     104499       |      104597      |   +0.1%  |
> +-----------------+------------------+------------------+----------+

Hm, the most interesting/important measurement range is missing: 1-100 users?

> 
> AIM7 benchmark run has a pretty large run-to-run variance due to random
> nature of the subtests executed. So a difference of less than +-5%
> may not be really significant.
> 
> This patch improves high_systime workload performance at 4 nodes
> and up by maintaining transaction rates without significant drop-off
> at high node count.  The patch has practically no impact on 1 and 2
> nodes system.
> 
> The table below shows the percentage time (as reported by perf
> record -a -s -g) spent on the __mutex_lock_slowpath() function by
> the high_systime workload at 1500 users for 2/4/8-node configurations
> with hyperthreading off.
> 
> +---------------+-----------------+------------------+---------+
> | Configuration | %Time w/o patch | %Time with patch | %Change |
> +---------------+-----------------+------------------+---------+
> |    8 nodes    |      65.34%     |      0.69%       |  -99%   |
> |    4 nodes    |       8.70%	    |      1.02%       |  -88%   |
> |    2 nodes    |       0.41%     |      0.32%       |  -22%   |
> +---------------+-----------------+------------------+---------+
> 
> It is obvious that the dramatic performance improvement at 8
> nodes was due to the drastic cut in the time spent within the
> __mutex_lock_slowpath() function.
> 
> The table below show the improvements in other AIM7 workloads (at 8
> nodes, hyperthreading off).
> 
> +--------------+---------------+----------------+-----------------+
> |   Workload   | mean % change | mean % change  | mean % change   |
> |              | 10-100 users  | 200-1000 users | 1100-2000 users |
> +--------------+---------------+----------------+-----------------+
> | alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
> | five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
> | fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
> | new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
> | shared       |    +13.1%     |   +146.1%      |   +181.5%       |
> | short        |     +7.4%     |     +5.0%      |     +4.2%       |
> +--------------+---------------+----------------+-----------------+
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> ---
>  arch/x86/include/asm/mutex.h |   16 ++++++++++++++++
>  kernel/mutex.c               |    9 ++++++---
>  kernel/mutex.h               |    8 ++++++++
>  3 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
> index 7d3a482..aa6a3ec 100644
> --- a/arch/x86/include/asm/mutex.h
> +++ b/arch/x86/include/asm/mutex.h
> @@ -3,3 +3,19 @@
>  #else
>  # include <asm/mutex_64.h>
>  #endif
> +
> +#ifndef	__ASM_MUTEX_H
> +#define	__ASM_MUTEX_H
> +
> +#ifdef MUTEX_SHOULD_XCHG_COUNT
> +#undef MUTEX_SHOULD_XCHG_COUNT
> +#endif
> +/*
> + * For the x86 architecture, it allows any negative number (besides -1) in
> + * the mutex counter to indicate that some other threads are waiting on the
> + * mutex. So the atomic_xchg() function should not be called in
> + * __mutex_lock_common() if the value of the counter has already been set
> + * to a negative number.
> + */
> +#define MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count) >= 0)
> +#endif
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 52f2301..5e5ea03 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -171,7 +171,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  		if (owner && !mutex_spin_on_owner(lock, owner))
>  			break;
>  
> -		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> +		if ((atomic_read(&lock->count) == 1) &&
> +		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
>  			lock_acquired(&lock->dep_map, ip);
>  			mutex_set_owner(lock);
>  			preempt_enable();
> @@ -205,7 +206,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	list_add_tail(&waiter.list, &lock->wait_list);
>  	waiter.task = task;
>  
> -	if (atomic_xchg(&lock->count, -1) == 1)
> +	if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
> +	   (atomic_xchg(&lock->count, -1) == 1))
>  		goto done;
>  
>  	lock_contended(&lock->dep_map, ip);
> @@ -220,7 +222,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  		 * that when we release the lock, we properly wake up the
>  		 * other waiters:
>  		 */
> -		if (atomic_xchg(&lock->count, -1) == 1)
> +		if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
> +		   (atomic_xchg(&lock->count, -1) == 1))
>  			break;
>  
>  		/*
> diff --git a/kernel/mutex.h b/kernel/mutex.h
> index 4115fbf..b873f8e 100644
> --- a/kernel/mutex.h
> +++ b/kernel/mutex.h
> @@ -46,3 +46,11 @@ static inline void
>  debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
>  {
>  }
> +
> +/*
> + * The atomic_xchg() function should not be called in __mutex_lock_common()
> + * if the value of the counter has already been set to -1.
> + */
> +#ifndef MUTEX_SHOULD_XCHG_COUNT
> +#define	MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count) != -1)
> +#endif

So, AFAICS, what we are doing here is to replace the rather aggressive 
modification+dirtying 'trylock' attempt loop of the mutex cacheline with softer 
atomic_read() based polling, and thus keep the polling nicer.

This helps you, because with 8 nodes and a lot of inter-node cacheline transfer 
latency, all the cores polling the same cachelines is going to show up big time 
and will reduce general system performance by eating up memory bandwidth.

I think doing a change into this general direction (be less aggressive in 
contention) might be reasonable, but its effect on the small range should be 
measured more accurately I think.

AFAICS the main performance trade-off is the following: when the owner CPU unlocks 
the mutex, we'll poll it via a read first, which turns the cacheline into 
shared-read MESI state. Then we notice that its content signals 'lock is 
available', and we attempt the trylock again.

This increases lock latency in the few-contended-tasks case slightly - and we'd 
like to know by precisely how much, not just for a generic '10-100 users' case 
which does not tell much about the contention level.

Furthermore, since you are seeing this effect so profoundly, have you considered 
using another approach, such as queueing all the poll-waiters in some fashion? 

That would optimize your workload additionally: removing the 'stampede' of trylock 
attempts when an unlock happens - only a single wait-poller would get the lock.

I don't think patch #2 is really meaningful - AFAICS it replaces the polling of a 
field in the mutex cacheline by polling another field in the mutex cacheline. That 
does not truly queue tasks, as mutex->spinner will still be stampeded on when it 
gets set to NULL by an unlock.

AFAICS patch #3 is really trying to work around the limitations of patch #1 I 
think, via arbitrary looking tuning limits. Furthermore, its access to the 
SMP-global calc_load_tasks variable is not very good looking either.

So instead of these 3 patches I think we should try a single, conceptually more 
robust, queueing based mutex-owner slow path spin-loop. If done right then I think 
it should improve performance in pretty much _all_ cases, without weird load 
limits.

Btw., I think the sensitivity of your numbers to HyperThreading is interesting. 
You might want to try to remove the arch_mutex_cpu_relax() call (or make it less 
frequent) and see what happens - does that improve the HT numbers?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 12:42   ` Ingo Molnar
@ 2013-04-08 14:38     ` Linus Torvalds
  2013-04-08 15:09       ` Ingo Molnar
                         ` (2 more replies)
  2013-04-08 17:42     ` Waiman Long
  1 sibling, 3 replies; 18+ messages in thread
From: Linus Torvalds @ 2013-04-08 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton

On Mon, Apr 8, 2013 at 5:42 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> AFAICS the main performance trade-off is the following: when the owner CPU unlocks
> the mutex, we'll poll it via a read first, which turns the cacheline into
> shared-read MESI state. Then we notice that its content signals 'lock is
> available', and we attempt the trylock again.
>
> This increases lock latency in the few-contended-tasks case slightly - and we'd
> like to know by precisely how much, not just for a generic '10-100 users' case
> which does not tell much about the contention level.

We had this problem for *some* lock where we used a "read + cmpxchg"
in the hotpath and it caused us problems due to two cacheline state
transitions (first to shared, then to exclusive). It was faster to
just assume it was unlocked and try to do an immediate cmpxchg.

But iirc it is a non-issue for this case, because this is only about
the contended slow path.

I forget where we saw the case where we should *not* read the initial
value, though. Anybody remember?

That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't
all architectures just consider negative counts to be locked? It
doesn't matter that some might only ever see -1.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 14:38     ` Linus Torvalds
@ 2013-04-08 15:09       ` Ingo Molnar
  2013-04-08 17:53       ` Waiman Long
  2013-04-10 14:09       ` Robin Holt
  2 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2013-04-08 15:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Apr 8, 2013 at 5:42 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > AFAICS the main performance trade-off is the following: when the owner CPU unlocks
> > the mutex, we'll poll it via a read first, which turns the cacheline into
> > shared-read MESI state. Then we notice that its content signals 'lock is
> > available', and we attempt the trylock again.
> >
> > This increases lock latency in the few-contended-tasks case slightly - and we'd
> > like to know by precisely how much, not just for a generic '10-100 users' case
> > which does not tell much about the contention level.
> 
> We had this problem for *some* lock where we used a "read + cmpxchg" in the 
> hotpath and it caused us problems due to two cacheline state transitions (first 
> to shared, then to exclusive). It was faster to just assume it was unlocked and 
> try to do an immediate cmpxchg.
> 
> But iirc it is a non-issue for this case, because this is only about the 
> contended slow path.
> 
> I forget where we saw the case where we should *not* read the initial value, 
> though. Anybody remember?

I had this vague recollection too - and some digging suggests that it might have 
been this discussion on lkml about 3 years ago:

  [RFC][PATCH 6/8] mm: handle_speculative_fault()

These numbers PeterZ ran:

  http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/00170.html

Appear to show such an effect, on a smaller NUMA system.

( But I'm quite sure it came up somewhere else as well, just cannot place it.
  Probabilistic biological search indices are annoying.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 12:42   ` Ingo Molnar
  2013-04-08 14:38     ` Linus Torvalds
@ 2013-04-08 17:42     ` Waiman Long
  2013-04-10 10:28       ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-04-08 17:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra,
	Davidlohr Bueso, linux-kernel, Chandramouleeswaran, Aswin,
	Linus Torvalds, Peter Zijlstra, Andrew Morton, Norton, Scott J,
	Rik van Riel

On 04/08/2013 08:42 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com>  wrote:
>
>> In the __mutex_lock_common() function, an initial entry into
>> the lock slow path will cause two atomic_xchg instructions to be
>> issued. Together with the atomic decrement in the fast path, a total
>> of three atomic read-modify-write instructions will be issued in
>> rapid succession. This can cause a lot of cache bouncing when many
>> tasks are trying to acquire the mutex at the same time.
>>
>> This patch will reduce the number of atomic_xchg instructions used by
>> checking the counter value first before issuing the instruction. The
>> atomic_read() function is just a simple memory read. The atomic_xchg()
>> function, on the other hand, can be up to 2 order of magnitude or even
>> more in cost when compared with atomic_read(). By using atomic_read()
>> to check the value first before calling atomic_xchg(), we can avoid a
>> lot of unnecessary cache coherency traffic. The only downside with this
>> change is that a task on the slow path will have a tiny bit
>> less chance of getting the mutex when competing with another task
>> in the fast path.
>>
>> The same is true for the atomic_cmpxchg() function in the
>> mutex-spin-on-owner loop. So an atomic_read() is also performed before
>> calling atomic_cmpxchg().
>>
>> The mutex locking and unlocking code for the x86 architecture can allow
>> any negative number to be used in the mutex count to indicate that some
>> tasks are waiting for the mutex. I am not so sure if that is the case
>> for the other architectures. So the default is to avoid atomic_xchg()
>> if the count has already been set to -1. For x86, the check is modified
>> to include all negative numbers to cover a larger case.
>>
>> The following table shows the scalability data on an 8-node 80-core
>> Westmere box with a 3.7.10 kernel. The numactl command is used to
>> restrict the running of the high_systime workloads to 1/2/4/8 nodes
>> with hyperthreading on and off.
>>
>> +-----------------+------------------+------------------+----------+
>> |  Configuration  | Mean Transaction | Mean Transaction | % Change |
>> |		  |  Rate w/o patch  | Rate with patch  |	   |
>> +-----------------+------------------------------------------------+
>> |		  |              User Range 1100 - 2000		   |
>> +-----------------+------------------------------------------------+
>> | 8 nodes, HT on  |      36980       |      148590      | +301.8%  |
>> | 8 nodes, HT off |      42799       |      145011      | +238.8%  |
>> | 4 nodes, HT on  |      61318       |      118445      |  +51.1%  |
>> | 4 nodes, HT off |     158481       |      158592      |   +0.1%  |
>> | 2 nodes, HT on  |     180602       |      173967      |   -3.7%  |
>> | 2 nodes, HT off |     198409       |      198073      |   -0.2%  |
>> | 1 node , HT on  |     149042       |      147671      |   -0.9%  |
>> | 1 node , HT off |     126036       |      126533      |   +0.4%  |
>> +-----------------+------------------------------------------------+
>> |		  |              User Range 200 - 1000		   |
>> +-----------------+------------------------------------------------+
>> | 8 nodes, HT on  |      41525       |      122349      | +194.6%  |
>> | 8 nodes, HT off |      49866       |      124032      | +148.7%  |
>> | 4 nodes, HT on  |      66409       |      106984      |  +61.1%  |
>> | 4 nodes, HT off |     119880       |      130508      |   +8.9%  |
>> | 2 nodes, HT on  |     138003       |      133948      |   -2.9%  |
>> | 2 nodes, HT off |     132792       |      131997      |   -0.6%  |
>> | 1 node , HT on  |     116593       |      115859      |   -0.6%  |
>> | 1 node , HT off |     104499       |      104597      |   +0.1%  |
>> +-----------------+------------------+------------------+----------+
> Hm, the most interesting/important measurement range is missing: 1-100 users?

I didn't include the 1-100 users range because the JPM number is 
linearly increasing and the difference between kernels with and without 
the patch is within 0-2%. So that range just isn't interesting from my 
point of view.

>> AIM7 benchmark run has a pretty large run-to-run variance due to random
>> nature of the subtests executed. So a difference of less than +-5%
>> may not be really significant.
>>
>> This patch improves high_systime workload performance at 4 nodes
>> and up by maintaining transaction rates without significant drop-off
>> at high node count.  The patch has practically no impact on 1 and 2
>> nodes system.
>>
>> The table below shows the percentage time (as reported by perf
>> record -a -s -g) spent on the __mutex_lock_slowpath() function by
>> the high_systime workload at 1500 users for 2/4/8-node configurations
>> with hyperthreading off.
>>
>> +---------------+-----------------+------------------+---------+
>> | Configuration | %Time w/o patch | %Time with patch | %Change |
>> +---------------+-----------------+------------------+---------+
>> |    8 nodes    |      65.34%     |      0.69%       |  -99%   |
>> |    4 nodes    |       8.70%	    |      1.02%       |  -88%   |
>> |    2 nodes    |       0.41%     |      0.32%       |  -22%   |
>> +---------------+-----------------+------------------+---------+
>>
>> It is obvious that the dramatic performance improvement at 8
>> nodes was due to the drastic cut in the time spent within the
>> __mutex_lock_slowpath() function.
>>
>> The table below show the improvements in other AIM7 workloads (at 8
>> nodes, hyperthreading off).
>>
>> +--------------+---------------+----------------+-----------------+
>> |   Workload   | mean % change | mean % change  | mean % change   |
>> |              | 10-100 users  | 200-1000 users | 1100-2000 users |
>> +--------------+---------------+----------------+-----------------+
>> | alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
>> | five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
>> | fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
>> | new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
>> | shared       |    +13.1%     |   +146.1%      |   +181.5%       |
>> | short        |     +7.4%     |     +5.0%      |     +4.2%       |
>> +--------------+---------------+----------------+-----------------+
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> Reviewed-by: Davidlohr Bueso<davidlohr.bueso@hp.com>
>> ---
>>   arch/x86/include/asm/mutex.h |   16 ++++++++++++++++
>>   kernel/mutex.c               |    9 ++++++---
>>   kernel/mutex.h               |    8 ++++++++
>>   3 files changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
>> index 7d3a482..aa6a3ec 100644
>> --- a/arch/x86/include/asm/mutex.h
>> +++ b/arch/x86/include/asm/mutex.h
>> @@ -3,3 +3,19 @@
>>   #else
>>   # include<asm/mutex_64.h>
>>   #endif
>> +
>> +#ifndef	__ASM_MUTEX_H
>> +#define	__ASM_MUTEX_H
>> +
>> +#ifdef MUTEX_SHOULD_XCHG_COUNT
>> +#undef MUTEX_SHOULD_XCHG_COUNT
>> +#endif
>> +/*
>> + * For the x86 architecture, it allows any negative number (besides -1) in
>> + * the mutex counter to indicate that some other threads are waiting on the
>> + * mutex. So the atomic_xchg() function should not be called in
>> + * __mutex_lock_common() if the value of the counter has already been set
>> + * to a negative number.
>> + */
>> +#define MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count)>= 0)
>> +#endif
>> diff --git a/kernel/mutex.c b/kernel/mutex.c
>> index 52f2301..5e5ea03 100644
>> --- a/kernel/mutex.c
>> +++ b/kernel/mutex.c
>> @@ -171,7 +171,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>>   		if (owner&&  !mutex_spin_on_owner(lock, owner))
>>   			break;
>>
>> -		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
>> +		if ((atomic_read(&lock->count) == 1)&&
>> +		    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
>>   			lock_acquired(&lock->dep_map, ip);
>>   			mutex_set_owner(lock);
>>   			preempt_enable();
>> @@ -205,7 +206,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>>   	list_add_tail(&waiter.list,&lock->wait_list);
>>   	waiter.task = task;
>>
>> -	if (atomic_xchg(&lock->count, -1) == 1)
>> +	if (MUTEX_SHOULD_XCHG_COUNT(lock)&&
>> +	   (atomic_xchg(&lock->count, -1) == 1))
>>   		goto done;
>>
>>   	lock_contended(&lock->dep_map, ip);
>> @@ -220,7 +222,8 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>>   		 * that when we release the lock, we properly wake up the
>>   		 * other waiters:
>>   		 */
>> -		if (atomic_xchg(&lock->count, -1) == 1)
>> +		if (MUTEX_SHOULD_XCHG_COUNT(lock)&&
>> +		   (atomic_xchg(&lock->count, -1) == 1))
>>   			break;
>>
>>   		/*
>> diff --git a/kernel/mutex.h b/kernel/mutex.h
>> index 4115fbf..b873f8e 100644
>> --- a/kernel/mutex.h
>> +++ b/kernel/mutex.h
>> @@ -46,3 +46,11 @@ static inline void
>>   debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
>>   {
>>   }
>> +
>> +/*
>> + * The atomic_xchg() function should not be called in __mutex_lock_common()
>> + * if the value of the counter has already been set to -1.
>> + */
>> +#ifndef MUTEX_SHOULD_XCHG_COUNT
>> +#define	MUTEX_SHOULD_XCHG_COUNT(mutex)	(atomic_read(&(mutex)->count) != -1)
>> +#endif
> So, AFAICS, what we are doing here is to replace the rather aggressive
> modification+dirtying 'trylock' attempt loop of the mutex cacheline with softer
> atomic_read() based polling, and thus keep the polling nicer.
>
> This helps you, because with 8 nodes and a lot of inter-node cacheline transfer
> latency, all the cores polling the same cachelines is going to show up big time
> and will reduce general system performance by eating up memory bandwidth.
>
> I think doing a change into this general direction (be less aggressive in
> contention) might be reasonable, but its effect on the small range should be
> measured more accurately I think.

You are right. That is why I include the scalability testing results of 
1/2/4/8 node configuration. At the smallest 1 node setting (10 cores). 
There is essentially no difference with or without the patch as mutex 
contention isn't an issue.

> AFAICS the main performance trade-off is the following: when the owner CPU unlocks
> the mutex, we'll poll it via a read first, which turns the cacheline into
> shared-read MESI state. Then we notice that its content signals 'lock is
> available', and we attempt the trylock again.
>
> This increases lock latency in the few-contended-tasks case slightly - and we'd
> like to know by precisely how much, not just for a generic '10-100 users' case
> which does not tell much about the contention level.

In the original code, the xchg instruction essentially does the 
following things:
1. Invalidate the cacheline in all the other cores holding it.
2. Lock cacheline and read it from memory.
3. Write the new content into memory

In the new code, if the mutex is unlocked,
1. Read the content from memory into cacheline
2. Invalidate the cacheline lines in other core
3. Write the new content into memory

I don't believe there will be too much difference in term of latency in 
this 2 approaches. There may be a few additional CPU cycles, but it is 
almost negligible compared with hundreds of cycles that are need for the 
whole atomic xchg instruction. In other word, I don't think there is 
much overhead in going through an intermediate shared-read MESI state 
from none to an exclusive state. It would be nice for me to know what 
kind of additional latency testing are you looking for.

> Furthermore, since you are seeing this effect so profoundly, have you considered
> using another approach, such as queueing all the poll-waiters in some fashion?
>
> That would optimize your workload additionally: removing the 'stampede' of trylock
> attempts when an unlock happens - only a single wait-poller would get the lock.
The mutex code in the slowpath has already put the waiters into a sleep 
queue and wait up only one at a time. However, there are 2 additional 
source of mutex lockers besides those in the sleep queue:
1. New tasks trying to acquire the mutex and currently in the fast path.
2. Mutex spinners (CONFIG_MUTEX_SPIN_ON_OWNER on) who are spinning on 
the owner field and ready to acquire the mutex once the owner field change.

The 2nd and 3rd patches are my attempts to limit the second types of 
mutex lockers.

> I don't think patch #2 is really meaningful - AFAICS it replaces the polling of a
> field in the mutex cacheline by polling another field in the mutex cacheline. That
> does not truly queue tasks, as mutex->spinner will still be stampeded on when it
> gets set to NULL by an unlock.

It is probably not a very good idea to introduce another cmpxchg 
instruction to compete with others on the cacheline. However, this patch 
does show benefit in some workloads.

> AFAICS patch #3 is really trying to work around the limitations of patch #1 I
> think, via arbitrary looking tuning limits. Furthermore, its access to the
> SMP-global calc_load_tasks variable is not very good looking either.

Yes, the tuning limit looks arbitrary and any suggestion to make it 
better is welcomed. Anyway, my goal is to get the 1st patch into the 
mainline. The 2nd and 3rd one are nice to have, but not essentially. Of 
the two, I would like to get my 3rd patch in if we could agree on a more 
sensible tuning limit.

> So instead of these 3 patches I think we should try a single, conceptually more
> robust, queueing based mutex-owner slow path spin-loop. If done right then I think
> it should improve performance in pretty much _all_ cases, without weird load
> limits.
>
> Btw., I think the sensitivity of your numbers to HyperThreading is interesting.
> You might want to try to remove the arch_mutex_cpu_relax() call (or make it less
> frequent) and see what happens - does that improve the HT numbers?
With hyperthreading on, more threads can be currently active thus 
increasing the chance that more of them will be trying to acquire the 
mutex leading to increased contention. I don't think removing the 
arch_mutex_cpu_relax() call will help. In fact, the opposite will be 
true as the threads can now query and lock the cacheline more 
frequently. Adding more arch_mutex_cpu_relax() may actually help as it 
is what Rik's ticket spinlock proportional backup patch is trying to do.

Regards,
Longman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 14:38     ` Linus Torvalds
  2013-04-08 15:09       ` Ingo Molnar
@ 2013-04-08 17:53       ` Waiman Long
  2013-04-10 10:31         ` Ingo Molnar
  2013-04-10 14:09       ` Robin Holt
  2 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-04-08 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel

On 04/08/2013 10:38 AM, Linus Torvalds wrote:
> On Mon, Apr 8, 2013 at 5:42 AM, Ingo Molnar<mingo@kernel.org>  wrote:
>> AFAICS the main performance trade-off is the following: when the owner CPU unlocks
>> the mutex, we'll poll it via a read first, which turns the cacheline into
>> shared-read MESI state. Then we notice that its content signals 'lock is
>> available', and we attempt the trylock again.
>>
>> This increases lock latency in the few-contended-tasks case slightly - and we'd
>> like to know by precisely how much, not just for a generic '10-100 users' case
>> which does not tell much about the contention level.
> We had this problem for *some* lock where we used a "read + cmpxchg"
> in the hotpath and it caused us problems due to two cacheline state
> transitions (first to shared, then to exclusive). It was faster to
> just assume it was unlocked and try to do an immediate cmpxchg.
>
> But iirc it is a non-issue for this case, because this is only about
> the contended slow path.
>
> I forget where we saw the case where we should *not* read the initial
> value, though. Anybody remember?
>
> That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't
> all architectures just consider negative counts to be locked? It
> doesn't matter that some might only ever see -1.

I think so too. However, I don't have the machines to test out other 
architectures. The MUTEX_SHOULD_XCHG_COUNT is just a safety measure to 
make sure that my code won't screw up the kernel in other architectures. 
Once it is confirmed that a negative count other than -1 is fine for all 
the other architectures, the macro can certainly go.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 17:42     ` Waiman Long
@ 2013-04-10 10:28       ` Ingo Molnar
  2013-04-10 15:47         ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2013-04-10 10:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra,
	Davidlohr Bueso, linux-kernel, Chandramouleeswaran, Aswin,
	Linus Torvalds, Peter Zijlstra, Andrew Morton, Norton, Scott J,
	Rik van Riel


* Waiman Long <Waiman.Long@hp.com> wrote:

> > Furthermore, since you are seeing this effect so profoundly, have you 
> > considered using another approach, such as queueing all the poll-waiters in 
> > some fashion?
> >
> > That would optimize your workload additionally: removing the 'stampede' of 
> > trylock attempts when an unlock happens - only a single wait-poller would get 
> > the lock.
>
> The mutex code in the slowpath has already put the waiters into a sleep queue 
> and wait up only one at a time.

Yes - but I'm talking about spin/poll-waiters.

> [...] However, there are 2 additional source of mutex lockers besides those in 
> the sleep queue:
>
> 1. New tasks trying to acquire the mutex and currently in the fast path.
> 2. Mutex spinners (CONFIG_MUTEX_SPIN_ON_OWNER on) who are spinning
> on the owner field and ready to acquire the mutex once the owner
> field change.
> 
> The 2nd and 3rd patches are my attempts to limit the second types of mutex 
> lockers.

Even the 1st patch seems to do that, it limits the impact of spin-loopers, right?

I'm fine with patch #1 [your numbers are proof enough that it helps while the low 
client count effect seems to be in the noise] - the questions that seem open to me 
are:

 - Could the approach in patch #1 be further improved by an additional patch that 
   adds queueing to the _spinners_ in some fashion - like ticket spin locks try to
   do in essence? Not queue the blocked waiters (they are already queued), but the
   active spinners. This would have additional benefits, especially with a high
   CPU count and a high NUMA factor, by removing the stampede effect as owners get 
   switched.

 - Why does patch #2 have an effect? (it shouldn't at first glance) It has a 
   non-trivial cost, it increases the size of 'struct mutex' by 8 bytes, which 
   structure is embedded in numerous kernel data structures. When doing 
   comparisons I'd suggest comparing it not to just vanilla, but to a patch that
   only extends the struct mutex data structure (and changes no code) - this
   allows the isolation of cache layout change effects.

 - Patch #3 is rather ugly - and my hope would be that if spinners are queued in
   some fashion it becomes unnecessary.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 17:53       ` Waiman Long
@ 2013-04-10 10:31         ` Ingo Molnar
  2013-04-10 15:52           ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2013-04-10 10:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel


* Waiman Long <Waiman.Long@hp.com> wrote:

> > That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't all 
> > architectures just consider negative counts to be locked? It doesn't matter 
> > that some might only ever see -1.
> 
> I think so too. However, I don't have the machines to test out other 
> architectures. The MUTEX_SHOULD_XCHG_COUNT is just a safety measure to make sure 
> that my code won't screw up the kernel in other architectures. Once it is 
> confirmed that a negative count other than -1 is fine for all the other 
> architectures, the macro can certainly go.

I'd suggest to just remove it in an additional patch, Cc:-ing 
linux-arch@vger.kernel.org. The change is very likely to be fine, if not then it's 
easy to revert it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-08 14:38     ` Linus Torvalds
  2013-04-08 15:09       ` Ingo Molnar
  2013-04-08 17:53       ` Waiman Long
@ 2013-04-10 14:09       ` Robin Holt
  2013-04-10 15:46         ` Linus Torvalds
  2 siblings, 1 reply; 18+ messages in thread
From: Robin Holt @ 2013-04-10 14:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Waiman Long, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Paul E. McKenney, David Howells, Dave Jones,
	Clark Williams, Peter Zijlstra, Davidlohr Bueso,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	Peter Zijlstra, Andrew Morton, tony.luck

On Mon, Apr 08, 2013 at 07:38:39AM -0700, Linus Torvalds wrote:
> On Mon, Apr 8, 2013 at 5:42 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > AFAICS the main performance trade-off is the following: when the owner CPU unlocks
> > the mutex, we'll poll it via a read first, which turns the cacheline into
> > shared-read MESI state. Then we notice that its content signals 'lock is
> > available', and we attempt the trylock again.
> >
> > This increases lock latency in the few-contended-tasks case slightly - and we'd
> > like to know by precisely how much, not just for a generic '10-100 users' case
> > which does not tell much about the contention level.
> 
> We had this problem for *some* lock where we used a "read + cmpxchg"
> in the hotpath and it caused us problems due to two cacheline state
> transitions (first to shared, then to exclusive). It was faster to
> just assume it was unlocked and try to do an immediate cmpxchg.
> 
> But iirc it is a non-issue for this case, because this is only about
> the contended slow path.
> 
> I forget where we saw the case where we should *not* read the initial
> value, though. Anybody remember?

I think you might be remembering ia64.  Fairly early on, I recall there
being a change in the spinlocks where we did not check them before just
trying to acquire.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 14:09       ` Robin Holt
@ 2013-04-10 15:46         ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2013-04-10 15:46 UTC (permalink / raw)
  To: Robin Holt
  Cc: Ingo Molnar, Waiman Long, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Paul E. McKenney, David Howells, Dave Jones,
	Clark Williams, Peter Zijlstra, Davidlohr Bueso,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	Peter Zijlstra, Andrew Morton, Tony Luck

On Wed, Apr 10, 2013 at 7:09 AM, Robin Holt <holt@sgi.com> wrote:
> On Mon, Apr 08, 2013 at 07:38:39AM -0700, Linus Torvalds wrote:
>>
>> I forget where we saw the case where we should *not* read the initial
>> value, though. Anybody remember?
>
> I think you might be remembering ia64.  Fairly early on, I recall there
> being a change in the spinlocks where we did not check them before just
> trying to acquire.

No, I think I found the one I was thinking of. It was the x86-32
version of atomic64_xchg() and atomic64_add_return(). We used to
actually read the old value in order to make the cmpxchg succeed on
the first try most of the time, but when it was cold in the cache that
actually hurt us. We were better off just picking a random value as
our first one, even if it resulted in the loop triggering, because for
the cold-cache case that avoids the unnecessary "bring in in shared
for the read, just to write to it later".

Commits 3a8d1788b37435baf6c296f4ea8beb4fa4955f44 and in particular
824975ef190e7dcb77718d1cc2cb53769b16d918.

Of course, a good OoO CPU would see the write happening later to the
same address, and bring things in for exclusive access immediately.
And it's possible that newer CPU's do that, but we did see this as
being an issue.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 10:28       ` Ingo Molnar
@ 2013-04-10 15:47         ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-04-10 15:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul E. McKenney,
	David Howells, Dave Jones, Clark Williams, Peter Zijlstra,
	Davidlohr Bueso, linux-kernel, Chandramouleeswaran, Aswin,
	Linus Torvalds, Peter Zijlstra, Andrew Morton, Norton, Scott J,
	Rik van Riel

On 04/10/2013 06:28 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com>  wrote:
>
>>> Furthermore, since you are seeing this effect so profoundly, have you
>>> considered using another approach, such as queueing all the poll-waiters in
>>> some fashion?
>>>
>>> That would optimize your workload additionally: removing the 'stampede' of
>>> trylock attempts when an unlock happens - only a single wait-poller would get
>>> the lock.
>> The mutex code in the slowpath has already put the waiters into a sleep queue
>> and wait up only one at a time.
> Yes - but I'm talking about spin/poll-waiters.
>
Thank for the clarification.

>> [...] However, there are 2 additional source of mutex lockers besides those in
>> the sleep queue:
>>
>> 1. New tasks trying to acquire the mutex and currently in the fast path.
>> 2. Mutex spinners (CONFIG_MUTEX_SPIN_ON_OWNER on) who are spinning
>> on the owner field and ready to acquire the mutex once the owner
>> field change.
>>
>> The 2nd and 3rd patches are my attempts to limit the second types of mutex
>> lockers.
> Even the 1st patch seems to do that, it limits the impact of spin-loopers, right?

Yes, that is true.

> I'm fine with patch #1 [your numbers are proof enough that it helps while the low
> client count effect seems to be in the noise] - the questions that seem open to me
> are:
>
>   - Could the approach in patch #1 be further improved by an additional patch that
>     adds queueing to the _spinners_ in some fashion - like ticket spin locks try to
>     do in essence? Not queue the blocked waiters (they are already queued), but the
>     active spinners. This would have additional benefits, especially with a high
>     CPU count and a high NUMA factor, by removing the stampede effect as owners get
>     switched.
Yes, I think we can implement some kind of ticketing system for the 
spinners. Similar to patch #2, we have to add a new field to the mutex 
structure for the head/tail ticketing numbers and hence will add a 
little more contention to the same mutex cacheline when the ticket 
numbers are updated. I can think of an easy way to do that without 
increasing the size of the mutex. I will try it out to see what 
performance impact it will have.

>   - Why does patch #2 have an effect? (it shouldn't at first glance) It has a
>     non-trivial cost, it increases the size of 'struct mutex' by 8 bytes, which
>     structure is embedded in numerous kernel data structures. When doing
>     comparisons I'd suggest comparing it not to just vanilla, but to a patch that
>     only extends the struct mutex data structure (and changes no code) - this
>     allows the isolation of cache layout change effects.
>
>   - Patch #3 is rather ugly - and my hope would be that if spinners are queued in
>     some fashion it becomes unnecessary.
I think these two patches can have some performance impact because it 
allows the CPUs to be used for some other tasks that are waiting for CPU 
instead of allowing the CPUs idle waiting for the mutex to be acquired. 
It isn't a big problem if only 1 or 2 threads are spinning, but it could 
be if most the CPUs in the system are wasting time spinning for the 
mutex. That begs the question that even if we implement a ticket queuing 
system, will it make sense to limit the number of spinners to just a 
few, say 3?

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 10:31         ` Ingo Molnar
@ 2013-04-10 15:52           ` Waiman Long
  2013-04-10 17:16             ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-04-10 15:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel

On 04/10/2013 06:31 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com>  wrote:
>
>>> That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't all
>>> architectures just consider negative counts to be locked? It doesn't matter
>>> that some might only ever see -1.
>> I think so too. However, I don't have the machines to test out other
>> architectures. The MUTEX_SHOULD_XCHG_COUNT is just a safety measure to make sure
>> that my code won't screw up the kernel in other architectures. Once it is
>> confirmed that a negative count other than -1 is fine for all the other
>> architectures, the macro can certainly go.
> I'd suggest to just remove it in an additional patch, Cc:-ing
> linux-arch@vger.kernel.org. The change is very likely to be fine, if not then it's
> easy to revert it.
>
> Thanks,
>
> 	Ingo
Yes, I can do that. So can I put your name down as reviewer or ack'er 
for the 1st patch?

BTW, I am planning to change the code to mimic the 
__mutex_slowpath_needs_to_unlock()  macro so that I only need to update 
the asm/mutex.h only instead of modifying the kernel/mutex.h as well. 
Hope that will make the change more acceptable to others in case we have 
to keep it.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 15:52           ` Waiman Long
@ 2013-04-10 17:16             ` Ingo Molnar
  2013-04-10 21:26               ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2013-04-10 17:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel


* Waiman Long <Waiman.Long@hp.com> wrote:

> On 04/10/2013 06:31 AM, Ingo Molnar wrote:
> >* Waiman Long<Waiman.Long@hp.com>  wrote:
> >
> >>>That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't all
> >>>architectures just consider negative counts to be locked? It doesn't matter
> >>>that some might only ever see -1.
> >>I think so too. However, I don't have the machines to test out other
> >>architectures. The MUTEX_SHOULD_XCHG_COUNT is just a safety measure to make sure
> >>that my code won't screw up the kernel in other architectures. Once it is
> >>confirmed that a negative count other than -1 is fine for all the other
> >>architectures, the macro can certainly go.
> >I'd suggest to just remove it in an additional patch, Cc:-ing
> >linux-arch@vger.kernel.org. The change is very likely to be fine, if not then it's
> >easy to revert it.
> >
> >Thanks,
> >
> >	Ingo
>
> Yes, I can do that. So can I put your name down as reviewer or ack'er for the 
> 1st patch?

Since I'll typically the maintainer applying & pushing kernel/mutex.c changes to 
Linus via the locking tree, the commit will get a Signed-off-by from me once you 
resend the latest state of things - no need to add my Acked-by or Reviewed-by 
right now.

I'm still hoping for another patch from you that adds queueing to the spinners ... 
That approach could offer better performance than current patches 1,2,3. In 
theory.

I'd prefer that approach because you have a testcase that shows the problem and 
you are willing to maximize performance with it - so we could make sure we have 
reached maximum performance instead of dropping patches #2, #3, reaching partial 
performance with patch #1, without having a real full resolution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 17:16             ` Ingo Molnar
@ 2013-04-10 21:26               ` Waiman Long
  2013-04-11  9:07                 ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-04-10 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel

On 04/10/2013 01:16 PM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com>  wrote:
>
>> On 04/10/2013 06:31 AM, Ingo Molnar wrote:
>>> * Waiman Long<Waiman.Long@hp.com>   wrote:
>>>
>>>>> That said, the MUTEX_SHOULD_XCHG_COUNT macro should die. Why shouldn't all
>>>>> architectures just consider negative counts to be locked? It doesn't matter
>>>>> that some might only ever see -1.
>>>> I think so too. However, I don't have the machines to test out other
>>>> architectures. The MUTEX_SHOULD_XCHG_COUNT is just a safety measure to make sure
>>>> that my code won't screw up the kernel in other architectures. Once it is
>>>> confirmed that a negative count other than -1 is fine for all the other
>>>> architectures, the macro can certainly go.
>>> I'd suggest to just remove it in an additional patch, Cc:-ing
>>> linux-arch@vger.kernel.org. The change is very likely to be fine, if not then it's
>>> easy to revert it.
>>>
>>> Thanks,
>>>
>>> 	Ingo
>> Yes, I can do that. So can I put your name down as reviewer or ack'er for the
>> 1st patch?
> Since I'll typically the maintainer applying&  pushing kernel/mutex.c changes to
> Linus via the locking tree, the commit will get a Signed-off-by from me once you
> resend the latest state of things - no need to add my Acked-by or Reviewed-by
> right now.
Thank for the explanation. I am still pretty new to this process of 
upstream kernel development.

> I'm still hoping for another patch from you that adds queueing to the spinners ...
> That approach could offer better performance than current patches 1,2,3. In
> theory.
>
> I'd prefer that approach because you have a testcase that shows the problem and
> you are willing to maximize performance with it - so we could make sure we have
> reached maximum performance instead of dropping patches #2, #3, reaching partial
> performance with patch #1, without having a real full resolution.
>
That is what I hope too. I am going to work on another patch to add 
spinner queuing to see how much performance impact it will have.

BTW, I have also been thinking about extracting the spinlock out from 
the mutex structure for some busy mutex by adding a pointer to an 
external auxiliary structure (separately allocated at init time). The 
idea is to use the external spinlock if available. Otherwise, the 
internal one will be used. That should reduce cacheline contention for 
some of the busiest mutex. The spinner queuing tickets can be in the 
external structure too. However, it requires a one line change in each 
of the mutex initialization code. I haven't actually made the code 
change and try it yet, but that is something that I am thinking of doing 
when I have time.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations
  2013-04-10 21:26               ` Waiman Long
@ 2013-04-11  9:07                 ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2013-04-11  9:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paul E. McKenney, David Howells, Dave Jones, Clark Williams,
	Peter Zijlstra, Davidlohr Bueso, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Peter Zijlstra, Andrew Morton,
	Norton, Scott J, Rik van Riel


* Waiman Long <Waiman.Long@hp.com> wrote:

> BTW, I have also been thinking about extracting the spinlock out from the mutex 
> structure for some busy mutex by adding a pointer to an external auxiliary 
> structure (separately allocated at init time). The idea is to use the external 
> spinlock if available. Otherwise, the internal one will be used. That should 
> reduce cacheline contention for some of the busiest mutex. The spinner queuing 
> tickets can be in the external structure too. However, it requires a one line 
> change in each of the mutex initialization code. I haven't actually made the 
> code change and try it yet, but that is something that I am thinking of doing 
> when I have time.

I'm not sure per mutex allocations are a really good idea - we like our locking 
primitives to be simple, embeddable into data structures and allocatable together 
with the data structure with no other separate memory footprint.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-04-11  9:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-04 14:54 [PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning Waiman Long
2013-04-04 14:54 ` [PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations Waiman Long
2013-04-08 12:42   ` Ingo Molnar
2013-04-08 14:38     ` Linus Torvalds
2013-04-08 15:09       ` Ingo Molnar
2013-04-08 17:53       ` Waiman Long
2013-04-10 10:31         ` Ingo Molnar
2013-04-10 15:52           ` Waiman Long
2013-04-10 17:16             ` Ingo Molnar
2013-04-10 21:26               ` Waiman Long
2013-04-11  9:07                 ` Ingo Molnar
2013-04-10 14:09       ` Robin Holt
2013-04-10 15:46         ` Linus Torvalds
2013-04-08 17:42     ` Waiman Long
2013-04-10 10:28       ` Ingo Molnar
2013-04-10 15:47         ` Waiman Long
2013-04-04 14:54 ` [PATCH RFC 2/3] mutex: restrict mutex spinning to only one task per mutex Waiman Long
2013-04-04 14:54 ` [PATCH RFC 3/3] mutex: dynamically disable mutex spinning at high load Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.