All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2013-02-06 20:03 Rik van Riel
  2013-02-06 20:04 ` [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
                   ` (5 more replies)
  0 siblings, 6 replies; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

[-- Attachment #1: Type: text/plain, Size: 3555 bytes --]

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The v5 series has cleanups suggested by Ingo Molnar and Borislav
Petkov.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.


[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2013-02-06 20:04 ` Rik van Riel
  2013-02-13 12:06   ` [tip:core/locking] x86/smp: Move " tip-bot for Rik van Riel
  2013-02-06 20:04 ` [PATCH -v5 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
---
v2: clean up the code a little, after Michel's suggestion 

 arch/x86/include/asm/spinlock.h |   11 +++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
+
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH -v5 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-02-06 20:04 ` [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-02-06 20:04 ` Rik van Riel
  2013-02-13 12:07   ` [tip:core/locking] x86/smp: Implement " tip-bot for Rik van Riel
  2013-02-06 20:05 ` [PATCH -v5 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.

The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
---
 arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..aa743e9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned loops;
+
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		waiters_ahead = ticket - head - 1;
+		/*
+		 * We are next after the current lock holder. Check often
+		 * to avoid wasting time when the lock is released.
+		 */
+		if (!waiters_ahead) {
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			break;
+		}
+		loops = 50 * waiters_ahead;
+		while (loops--)
+			cpu_relax();
 
-		if (inc.head == inc.tail)
+		head = ACCESS_ONCE(lock->tickets.head);
+		if (head == ticket)
 			break;
 	}
 }


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH -v5 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-02-06 20:04 ` [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
  2013-02-06 20:04 ` [PATCH -v5 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-02-06 20:05 ` Rik van Riel
  2013-02-13 12:08   ` [tip:core/locking] x86/smp: Auto " tip-bot for Rik van Riel
  2013-02-06 20:06 ` [PATCH -v5 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kernel/smp.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..b19648a 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -112,13 +112,34 @@
 static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
+#define DELAY_SHIFT			8
+#define DELAY_FIXED_1			(1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
  */
+DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
+	unsigned delay = __this_cpu_read(spinlock_delay);
 	unsigned loops;
 
 	for (;;) {
@@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
 			break;
 		}
-		loops = 50 * waiters_ahead;
+
+		/* Aggressively increase delay, to minimize lock accesses. */
+		if (delay < MAX_SPINLOCK_DELAY)
+			delay += DELAY_FIXED_1 / 7;
+
+		loops = (delay * waiters_ahead) >> DELAY_SHIFT;
 		while (loops--)
 			cpu_relax();
 
 		head = ACCESS_ONCE(lock->tickets.head);
-		if (head == ticket)
+		if (head == ticket) {
+			/*
+			 * We overslept, and do not know by how much.
+			 * Exponentially decay the value of delay,
+			 * to get it back to a good value quickly.
+			 */
+			if (delay >= 2 * DELAY_FIXED_1)
+				delay -= max(delay/32, DELAY_FIXED_1);
 			break;
+		}
 	}
+	__this_cpu_write(spinlock_delay, delay);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH -v5 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (2 preceding siblings ...)
  2013-02-06 20:05 ` [PATCH -v5 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-02-06 20:06 ` Rik van Riel
  2013-02-13 12:09   ` [tip:core/locking] x86/smp: Keep " tip-bot for Eric Dumazet
  2013-02-06 20:07 ` [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
  2013-02-06 20:08 ` [PATCH -v5 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
  5 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>

---
 arch/x86/kernel/smp.c |   24 ++++++++++++++++++++----
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index a7255a7..64e33ef 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -116,6 +117,18 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
 #define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
+#define DELAY_HASH_SHIFT		6
+struct delay_entry {
+	u32 hash;
+	u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+		.hash = 0,
+		.delay = MIN_SPINLOCK_DELAY,
+	},
+};
+
 /*
  * Wait on a congested ticket spinlock. Many spinlocks are embedded in
  * data structures; having many CPUs pounce on the cache line with the
@@ -134,12 +147,14 @@ static bool smp_no_nmi_ipi = false;
  * the queue, to slowly increase the delay if we sleep for too short a
  * time, and to decrease the delay if we slept for too long.
  */
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned delay = __this_cpu_read(spinlock_delay);
+	u32 hash = hash32_ptr(lock);
+	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
 	unsigned loops;
 
 	for (;;) {
@@ -171,11 +186,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			 * to get it back to a good value quickly.
 			 */
 			if (delay >= 2 * DELAY_FIXED_1)
-				delay -= max(delay/32, DELAY_FIXED_1);
+				delay -= max_t(u32, delay/32, DELAY_FIXED_1);
 			break;
 		}
 	}
-	__this_cpu_write(spinlock_delay, delay);
+	ent->hash = hash;
+	ent->delay = delay;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (3 preceding siblings ...)
  2013-02-06 20:06 ` [PATCH -v5 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-02-06 20:07 ` Rik van Riel
  2013-02-07 11:11   ` Ingo Molnar
  2013-02-07 11:25   ` [PATCH -v5 5/5] x86,smp: limit " Stefano Stabellini
  2013-02-06 20:08 ` [PATCH -v5 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
  5 siblings, 2 replies; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Modern Intel and AMD CPUs will trap to the host when the guest
is spinning on a spinlock, allowing the host to schedule in
something else.

This effectively means the host is taking care of spinlock
backoff for virtual machines. It also means that doing the
spinlock backoff in the guest anyway can lead to totally
unpredictable results, extremely large backoffs, and
performance regressions.

To prevent those problems, we limit the spinlock backoff
delay, when running in a virtual machine, to a small value.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/processor.h |    2 ++
 arch/x86/kernel/cpu/hypervisor.c |    2 ++
 arch/x86/kernel/smp.c            |   21 +++++++++++++++++++--
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 888184b..4118fd8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
 extern unsigned long arch_align_stack(unsigned long sp);
 extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
 
+extern void init_guest_spinlock_delay(void);
+
 void default_idle(void);
 bool set_pm_idle_to_default(void);
 
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index a8f8fa9..4a53724 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
 
 	init_hypervisor(&boot_cpu_data);
 
+	init_guest_spinlock_delay();
+
 	if (x86_hyper->init_platform)
 		x86_hyper->init_platform();
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 64e33ef..fbc5ff3 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_SHIFT			8
 #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
-#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
 #define DELAY_HASH_SHIFT		6
+
+/*
+ * Modern Intel and AMD CPUs tell the hypervisor when a guest is
+ * spinning excessively on a spinlock. The hypervisor will then
+ * schedule something else, effectively taking care of the backoff
+ * for us. Doing our own backoff on top of the hypervisor's pause
+ * loop exit handling can lead to excessively long delays, and
+ * performance degradations. Limit the spinlock delay in virtual
+ * machines to a smaller value. Called from init_hypervisor_platform 
+ */
+static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
+void __init init_guest_spinlock_delay(void)
+{
+	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
+}
+
 struct delay_entry {
 	u32 hash;
 	u32 delay;
@@ -171,7 +188,7 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 		}
 
 		/* Aggressively increase delay, to minimize lock accesses. */
-		if (delay < MAX_SPINLOCK_DELAY)
+		if (delay < max_spinlock_delay)
 			delay += DELAY_FIXED_1 / 7;
 
 		loops = (delay * waiters_ahead) >> DELAY_SHIFT;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH -v5 6/5] x86,smp: add debugging code to track spinlock delay value
  2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (4 preceding siblings ...)
  2013-02-06 20:07 ` [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-02-06 20:08 ` Rik van Riel
  5 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2013-02-06 20:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Subject: x86,smp: add debugging code to track spinlock delay value

From: Eric Dumazet <eric.dumazet@gmail.com>

This code prints out the maximum spinlock delay value and the
backtrace that pushed it that far.

On systems with serial consoles, the act of printing can cause
the spinlock delay value to explode. It can still be useful as
a debugging tool, but is probably too verbose to merge upstream
in this form.

Not-signed-off-by: Rik van Riel <riel@redhat.com>
Not-signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 arch/x86/kernel/smp.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index fbc5ff3..660f0ec 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -146,6 +146,8 @@ static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay
 	},
 };
 
+static DEFINE_PER_CPU(u32, maxdelay);
+
 /*
  * Wait on a congested ticket spinlock. Many spinlocks are embedded in
  * data structures; having many CPUs pounce on the cache line with the
@@ -209,6 +211,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 	}
 	ent->hash = hash;
 	ent->delay = delay;
+
+	if (__this_cpu_read(maxdelay) * 4 < delay * 3) {
+		pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay>>DELAY_SHIFT);
+		__this_cpu_write(maxdelay, delay);
+		WARN_ON(1);
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-06 20:07 ` [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-02-07 11:11   ` Ingo Molnar
  2013-02-07 21:24     ` [PATCH fix " Rik van Riel
  2013-02-07 11:25   ` [PATCH -v5 5/5] x86,smp: limit " Stefano Stabellini
  1 sibling, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2013-02-07 11:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo


* Rik van Riel <riel@redhat.com> wrote:

> --- a/arch/x86/kernel/cpu/hypervisor.c
> +++ b/arch/x86/kernel/cpu/hypervisor.c
> @@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
>  
>  	init_hypervisor(&boot_cpu_data);
>  
> +	init_guest_spinlock_delay();
> +
>  	if (x86_hyper->init_platform)
>  		x86_hyper->init_platform();
>  }
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 64e33ef..fbc5ff3 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
>  #define DELAY_SHIFT			8
>  #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
>  #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
> -#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
>  #define DELAY_HASH_SHIFT		6
> +
> +/*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock. The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value. Called from init_hypervisor_platform 
> + */
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_guest_spinlock_delay(void)
> +{
> +	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
> +}

The kernel build will be sad on !SMP configs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-06 20:07 ` [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
  2013-02-07 11:11   ` Ingo Molnar
@ 2013-02-07 11:25   ` Stefano Stabellini
  2013-02-07 11:59     ` Raghavendra K T
  2013-02-07 13:28     ` Rik van Riel
  1 sibling, 2 replies; 54+ messages in thread
From: Stefano Stabellini @ 2013-02-07 11:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

On Wed, 6 Feb 2013, Rik van Riel wrote:
> Modern Intel and AMD CPUs will trap to the host when the guest
> is spinning on a spinlock, allowing the host to schedule in
> something else.
> 
> This effectively means the host is taking care of spinlock
> backoff for virtual machines. It also means that doing the
> spinlock backoff in the guest anyway can lead to totally
> unpredictable results, extremely large backoffs, and
> performance regressions.
> 
> To prevent those problems, we limit the spinlock backoff
> delay, when running in a virtual machine, to a small value.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/x86/include/asm/processor.h |    2 ++
>  arch/x86/kernel/cpu/hypervisor.c |    2 ++
>  arch/x86/kernel/smp.c            |   21 +++++++++++++++++++--
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 888184b..4118fd8 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
>  extern unsigned long arch_align_stack(unsigned long sp);
>  extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
>  
> +extern void init_guest_spinlock_delay(void);
> +
>  void default_idle(void);
>  bool set_pm_idle_to_default(void);
>  
> diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
> index a8f8fa9..4a53724 100644
> --- a/arch/x86/kernel/cpu/hypervisor.c
> +++ b/arch/x86/kernel/cpu/hypervisor.c
> @@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
>  
>  	init_hypervisor(&boot_cpu_data);
>  
> +	init_guest_spinlock_delay();
> +
>  	if (x86_hyper->init_platform)
>  		x86_hyper->init_platform();
>  }
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 64e33ef..fbc5ff3 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
>  #define DELAY_SHIFT			8
>  #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
>  #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
> -#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
>  #define DELAY_HASH_SHIFT		6
> +
> +/*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock. The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value. Called from init_hypervisor_platform 
> + */
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_guest_spinlock_delay(void)
> +{
> +	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
> +}
> +

Same comment as last time:

"""
Before reducing max_spinlock_delay, shouldn't we check that PAUSE-loop
exiting is available? What if we are running on an older x86 machine
that doesn't support it?
"""

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-07 11:25   ` [PATCH -v5 5/5] x86,smp: limit " Stefano Stabellini
@ 2013-02-07 11:59     ` Raghavendra K T
  2013-02-07 13:28     ` Rik van Riel
  1 sibling, 0 replies; 54+ messages in thread
From: Raghavendra K T @ 2013-02-07 11:59 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Rik van Riel, linux-kernel, aquini, eric.dumazet, lwoodman,
	knoel, chegu_vinod, mingo

On 02/07/2013 04:55 PM, Stefano Stabellini wrote:
> On Wed, 6 Feb 2013, Rik van Riel wrote:
>> Modern Intel and AMD CPUs will trap to the host when the guest
>> is spinning on a spinlock, allowing the host to schedule in
>> something else.
>>
>> This effectively means the host is taking care of spinlock
>> backoff for virtual machines. It also means that doing the
>> spinlock backoff in the guest anyway can lead to totally
>> unpredictable results, extremely large backoffs, and
>> performance regressions.
>>
>> To prevent those problems, we limit the spinlock backoff
>> delay, when running in a virtual machine, to a small value.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> ---
>>   arch/x86/include/asm/processor.h |    2 ++
>>   arch/x86/kernel/cpu/hypervisor.c |    2 ++
>>   arch/x86/kernel/smp.c            |   21 +++++++++++++++++++--
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 888184b..4118fd8 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
>>   extern unsigned long arch_align_stack(unsigned long sp);
>>   extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
>>
>> +extern void init_guest_spinlock_delay(void);
>> +
>>   void default_idle(void);
>>   bool set_pm_idle_to_default(void);
>>
>> diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
>> index a8f8fa9..4a53724 100644
>> --- a/arch/x86/kernel/cpu/hypervisor.c
>> +++ b/arch/x86/kernel/cpu/hypervisor.c
>> @@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
>>
>>   	init_hypervisor(&boot_cpu_data);
>>
>> +	init_guest_spinlock_delay();
>> +
>>   	if (x86_hyper->init_platform)
>>   		x86_hyper->init_platform();
>>   }
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 64e33ef..fbc5ff3 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
>>   #define DELAY_SHIFT			8
>>   #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
>>   #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
>> -#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
>> +#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
>> +#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
>>   #define DELAY_HASH_SHIFT		6
>> +
>> +/*
>> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
>> + * spinning excessively on a spinlock. The hypervisor will then
>> + * schedule something else, effectively taking care of the backoff
>> + * for us. Doing our own backoff on top of the hypervisor's pause
>> + * loop exit handling can lead to excessively long delays, and
>> + * performance degradations. Limit the spinlock delay in virtual
>> + * machines to a smaller value. Called from init_hypervisor_platform
>> + */
>> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
>> +void __init init_guest_spinlock_delay(void)
>> +{
>> +	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
>> +}
>> +
>
> Same comment as last time:
>
> """
> Before reducing max_spinlock_delay, shouldn't we check that PAUSE-loop
> exiting is available? What if we are running on an older x86 machine
> that doesn't support it?
> """

I believe irrespective of whether the machine is PLE/ non PLE since we
are  limiting the number of loops to just 16, (against 16000 for
baremetal) it does not affect the guest much instead it may help some
overcommit cases a little. ( I have not tested on non - PLE guest yet).
Because  otherwise, It is still difficult to tune 'loop' parameter
because of LHP etc.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-07 11:25   ` [PATCH -v5 5/5] x86,smp: limit " Stefano Stabellini
  2013-02-07 11:59     ` Raghavendra K T
@ 2013-02-07 13:28     ` Rik van Riel
  1 sibling, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2013-02-07 13:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

On 02/07/2013 06:25 AM, Stefano Stabellini wrote:
> On Wed, 6 Feb 2013, Rik van Riel wrote:
>> Modern Intel and AMD CPUs will trap to the host when the guest
>> is spinning on a spinlock, allowing the host to schedule in
>> something else.
>>
>> This effectively means the host is taking care of spinlock
>> backoff for virtual machines. It also means that doing the
>> spinlock backoff in the guest anyway can lead to totally
>> unpredictable results, extremely large backoffs, and
>> performance regressions.
>>
>> To prevent those problems, we limit the spinlock backoff
>> delay, when running in a virtual machine, to a small value.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> ---
>>   arch/x86/include/asm/processor.h |    2 ++
>>   arch/x86/kernel/cpu/hypervisor.c |    2 ++
>>   arch/x86/kernel/smp.c            |   21 +++++++++++++++++++--
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 888184b..4118fd8 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
>>   extern unsigned long arch_align_stack(unsigned long sp);
>>   extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
>>
>> +extern void init_guest_spinlock_delay(void);
>> +
>>   void default_idle(void);
>>   bool set_pm_idle_to_default(void);
>>
>> diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
>> index a8f8fa9..4a53724 100644
>> --- a/arch/x86/kernel/cpu/hypervisor.c
>> +++ b/arch/x86/kernel/cpu/hypervisor.c
>> @@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
>>
>>   	init_hypervisor(&boot_cpu_data);
>>
>> +	init_guest_spinlock_delay();
>> +
>>   	if (x86_hyper->init_platform)
>>   		x86_hyper->init_platform();
>>   }
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 64e33ef..fbc5ff3 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
>>   #define DELAY_SHIFT			8
>>   #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
>>   #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
>> -#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
>> +#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
>> +#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
>>   #define DELAY_HASH_SHIFT		6
>> +
>> +/*
>> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
>> + * spinning excessively on a spinlock. The hypervisor will then
>> + * schedule something else, effectively taking care of the backoff
>> + * for us. Doing our own backoff on top of the hypervisor's pause
>> + * loop exit handling can lead to excessively long delays, and
>> + * performance degradations. Limit the spinlock delay in virtual
>> + * machines to a smaller value. Called from init_hypervisor_platform
>> + */
>> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
>> +void __init init_guest_spinlock_delay(void)
>> +{
>> +	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
>> +}
>> +
>
> Same comment as last time:
>
> """
> Before reducing max_spinlock_delay, shouldn't we check that PAUSE-loop
> exiting is available? What if we are running on an older x86 machine
> that doesn't support it?
> """

I don't think this will be much of an issue.  If we are
in a CPU overcommit scenario, the spinlock overhead will
be dominated by host scheduling latencies, not by the
normal spinlock hold time or cache line bouncing.

If we are not in an overcommit scenario, limiting the
spinlock delay will result in a guest only seeing a
small performance boost from these patches, instead of
a larger one.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH fix -v5 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-07 11:11   ` Ingo Molnar
@ 2013-02-07 21:24     ` Rik van Riel
  2013-02-13 12:10       ` [tip:core/locking] x86/smp: Limit " tip-bot for Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-07 21:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo


> The kernel build will be sad on !SMP configs.

Good catch, thank you. Here is a new version.

---8<---

Subject: x86,smp: limit spinlock delay on virtual machines

Modern Intel and AMD CPUs will trap to the host when the guest
is spinning on a spinlock, allowing the host to schedule in
something else.

This effectively means the host is taking care of spinlock
backoff for virtual machines. It also means that doing the
spinlock backoff in the guest anyway can lead to totally
unpredictable results, extremely large backoffs, and
performance regressions.

To prevent those problems, we limit the spinlock backoff
delay, when running in a virtual machine, to a small value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> 
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 arch/x86/include/asm/processor.h |    2 ++
 arch/x86/kernel/cpu/hypervisor.c |    2 ++
 arch/x86/kernel/smp.c            |   21 +++++++++++++++++++--
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 888184b..2856972 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -158,9 +158,11 @@ extern __u32			cpu_caps_set[NCAPINTS];
 #ifdef CONFIG_SMP
 DECLARE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
 #define cpu_data(cpu)		per_cpu(cpu_info, cpu)
+extern void init_guest_spinlock_delay(void);
 #else
 #define cpu_info		boot_cpu_data
 #define cpu_data(cpu)		boot_cpu_data
+static inline void init_guest_spinlock_delay(void) {}
 #endif
 
 extern const struct seq_operations cpuinfo_op;
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index a8f8fa9..4a53724 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
 
 	init_hypervisor(&boot_cpu_data);
 
+	init_guest_spinlock_delay();
+
 	if (x86_hyper->init_platform)
 		x86_hyper->init_platform();
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 8e94469..4965399 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_SHIFT			8
 #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
-#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
 #define DELAY_HASH_SHIFT		6
+
+/*
+ * Modern Intel and AMD CPUs tell the hypervisor when a guest is
+ * spinning excessively on a spinlock. The hypervisor will then
+ * schedule something else, effectively taking care of the backoff
+ * for us. Doing our own backoff on top of the hypervisor's pause
+ * loop exit handling can lead to excessively long delays, and
+ * performance degradations. Limit the spinlock delay in virtual
+ * machines to a smaller value. Called from init_hypervisor_platform 
+ */
+static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
+void __init init_guest_spinlock_delay(void)
+{
+	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
+}
+
 struct delay_entry {
 	u32 hash;
 	u32 delay;
@@ -171,7 +188,7 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 		}
 
 		/* Aggressively increase delay, to minimize lock accesses. */
-		if (delay < MAX_SPINLOCK_DELAY)
+		if (delay < max_spinlock_delay)
 			delay += DELAY_FIXED_1 / 7;
 
 		loops = (delay * waiters_ahead) >> DELAY_SHIFT;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-06 20:04 ` [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-02-13 12:06   ` tip-bot for Rik van Riel
  2013-02-13 16:20     ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: tip-bot for Rik van Riel @ 2013-02-13 12:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, riel, rostedt,
	akpm, aquini, tglx, walken

Commit-ID:  4aef331850b637169ff036ed231f0d236874f310
Gitweb:     http://git.kernel.org/tip/4aef331850b637169ff036ed231f0d236874f310
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 6 Feb 2013 15:04:03 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Feb 2013 09:06:28 +0100

x86/smp: Move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function
allows us to add things to that wait loop, without growing the
size of the kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
Cc: eric.dumazet@gmail.com
Cc: lwoodman@redhat.com
Cc: knoel@redhat.com
Cc: chegu_vinod@hp.com
Cc: raghavendra.kt@linux.vnet.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130206150403.006e5294@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/spinlock.h | 11 +++++------
 arch/x86/kernel/smp.c           | 14 ++++++++++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
+
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/locking] x86/smp: Implement proportional backoff for ticket spinlocks
  2013-02-06 20:04 ` [PATCH -v5 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-02-13 12:07   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Rik van Riel @ 2013-02-13 12:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, riel, aquini,
	akpm, tglx, walken

Commit-ID:  4c13d231339a8dc7ca056476fa10d95369080c8c
Gitweb:     http://git.kernel.org/tip/4c13d231339a8dc7ca056476fa10d95369080c8c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 6 Feb 2013 15:04:42 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Feb 2013 09:06:29 +0100

x86/smp: Implement proportional backoff for ticket spinlocks

Add simple fixed value proportional backoff for ticket spinlocks.

By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.

The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: eric.dumazet@gmail.com
Cc: lwoodman@redhat.com
Cc: knoel@redhat.com
Cc: chegu_vinod@hp.com
Cc: raghavendra.kt@linux.vnet.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130206150442.1ffd017c@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/smp.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..aa743e9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned loops;
+
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		waiters_ahead = ticket - head - 1;
+		/*
+		 * We are next after the current lock holder. Check often
+		 * to avoid wasting time when the lock is released.
+		 */
+		if (!waiters_ahead) {
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			break;
+		}
+		loops = 50 * waiters_ahead;
+		while (loops--)
+			cpu_relax();
 
-		if (inc.head == inc.tail)
+		head = ACCESS_ONCE(lock->tickets.head);
+		if (head == ticket)
 			break;
 	}
 }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/locking] x86/smp: Auto tune spinlock backoff delay factor
  2013-02-06 20:05 ` [PATCH -v5 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-02-13 12:08   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Rik van Riel @ 2013-02-13 12:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, riel, aquini,
	akpm, tglx, walken

Commit-ID:  4cf780c77b5ef7e3170f68820bc239a689c16d5b
Gitweb:     http://git.kernel.org/tip/4cf780c77b5ef7e3170f68820bc239a689c16d5b
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 6 Feb 2013 15:05:29 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Feb 2013 09:06:30 +0100

x86/smp: Auto tune spinlock backoff delay factor

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good
reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: eric.dumazet@gmail.com
Cc: lwoodman@redhat.com
Cc: knoel@redhat.com
Cc: chegu_vinod@hp.com
Cc: raghavendra.kt@linux.vnet.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130206150529.43150afa@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/smp.c | 41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..c1fce41 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -112,13 +112,34 @@
 static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
+#define DELAY_SHIFT			8
+#define DELAY_FIXED_1			(1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
  */
+DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
+	unsigned delay = __this_cpu_read(spinlock_delay);
 	unsigned loops;
 
 	for (;;) {
@@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
 			break;
 		}
-		loops = 50 * waiters_ahead;
+
+		/* Aggressively increase delay, to minimize lock accesses. */
+		if (delay < MAX_SPINLOCK_DELAY)
+			delay += DELAY_FIXED_1 / 7;
+
+		loops = (delay * waiters_ahead) >> DELAY_SHIFT;
 		while (loops--)
 			cpu_relax();
 
 		head = ACCESS_ONCE(lock->tickets.head);
-		if (head == ticket)
+		if (head == ticket) {
+			/*
+			 * We overslept, and do not know by how much.
+			 * Exponentially decay the value of delay,
+			 * to get it back to a good value quickly.
+			 */
+			if (delay >= 2 * DELAY_FIXED_1)
+				delay -= max(delay/32, DELAY_FIXED_1);
 			break;
+		}
 	}
+	__this_cpu_write(spinlock_delay, delay);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/locking] x86/smp: Keep spinlock delay values per hashed spinlock address
  2013-02-06 20:06 ` [PATCH -v5 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-02-13 12:09   ` tip-bot for Eric Dumazet
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Eric Dumazet @ 2013-02-13 12:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, eric.dumazet, a.p.zijlstra, torvalds,
	riel, akpm, aquini, tglx, walken

Commit-ID:  e3c94ba97ac742a4c03713629da131fef53b2237
Gitweb:     http://git.kernel.org/tip/e3c94ba97ac742a4c03713629da131fef53b2237
Author:     Eric Dumazet <eric.dumazet@gmail.com>
AuthorDate: Wed, 6 Feb 2013 15:06:47 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Feb 2013 09:06:30 +0100

x86/smp: Keep spinlock delay values per hashed spinlock address

Eric Dumazet found a regression with the first version of the
spinlock backoff code, in a workload where multiple spinlocks
were contended, each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

  I did some tests with your patches with following configuration:

  tc qdisc add dev eth0 root htb r2q 1000 default 3
  (to force a contention on qdisc lock, even with a multi queue
  net device)

  and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m
  128"

  Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
  (24 threads), and a fast NIC (10Gbps)

  Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

  In this workload we have at least two contended spinlocks, with
  different delays. (spinlocks are not held for the same duration)

  It clearly defeats your assumption of a single per cpu delay
  being OK : Some cpus are spinning too long while the lock was
  released.

  We might try to use a hash on lock address, and an array of 16
  different delays so that different spinlocks have a chance of
  not sharing the same delay.

  With following patch, I get 982 Mbits/s with same bench, so an
  increase of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: lwoodman@redhat.com
Cc: knoel@redhat.com
Cc: chegu_vinod@hp.com
Cc: raghavendra.kt@linux.vnet.ibm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130206150647.633f7b1c@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/smp.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index c1fce41..8e94469 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -116,6 +117,18 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
 #define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
+#define DELAY_HASH_SHIFT		6
+struct delay_entry {
+	u32 hash;
+	u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+		.hash = 0,
+		.delay = MIN_SPINLOCK_DELAY,
+	},
+};
+
 /*
  * Wait on a congested ticket spinlock. Many spinlocks are embedded in
  * data structures; having many CPUs pounce on the cache line with the
@@ -134,12 +147,14 @@ static bool smp_no_nmi_ipi = false;
  * the queue, to slowly increase the delay if we sleep for too short a
  * time, and to decrease the delay if we slept for too long.
  */
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned delay = __this_cpu_read(spinlock_delay);
+	u32 hash = hash32_ptr(lock);
+	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
 	unsigned loops;
 
 	for (;;) {
@@ -171,11 +186,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			 * to get it back to a good value quickly.
 			 */
 			if (delay >= 2 * DELAY_FIXED_1)
-				delay -= max(delay/32, DELAY_FIXED_1);
+				delay -= max_t(u32, delay/32, DELAY_FIXED_1);
 			break;
 		}
 	}
-	__this_cpu_write(spinlock_delay, delay);
+	ent->hash = hash;
+	ent->delay = delay;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/locking] x86/smp: Limit spinlock delay on virtual machines
  2013-02-07 21:24     ` [PATCH fix " Rik van Riel
@ 2013-02-13 12:10       ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Rik van Riel @ 2013-02-13 12:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, riel,
	raghavendra.kt, akpm, chegu_vinod, tglx

Commit-ID:  12b682864a336d72bfd507244649bd1066d90e43
Gitweb:     http://git.kernel.org/tip/12b682864a336d72bfd507244649bd1066d90e43
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 7 Feb 2013 16:24:49 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Feb 2013 09:07:21 +0100

x86/smp: Limit spinlock delay on virtual machines

Modern Intel and AMD CPUs will trap to the host when the guest
is spinning on a spinlock, allowing the host to schedule in
something else.

This effectively means the host is taking care of spinlock
backoff for virtual machines. It also means that doing the
spinlock backoff in the guest anyway can lead to totally
unpredictable results, extremely large backoffs, and
performance regressions.

To prevent those problems, we limit the spinlock backoff
delay, when running in a virtual machine, to a small value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: aquini@redhat.com
Cc: eric.dumazet@gmail.com
Cc: lwoodman@redhat.com
Cc: knoel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130207162449.0292685a@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 arch/x86/include/asm/processor.h |  2 ++
 arch/x86/kernel/cpu/hypervisor.c |  2 ++
 arch/x86/kernel/smp.c            | 21 +++++++++++++++++++--
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 888184b..2856972 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -158,9 +158,11 @@ extern __u32			cpu_caps_set[NCAPINTS];
 #ifdef CONFIG_SMP
 DECLARE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
 #define cpu_data(cpu)		per_cpu(cpu_info, cpu)
+extern void init_guest_spinlock_delay(void);
 #else
 #define cpu_info		boot_cpu_data
 #define cpu_data(cpu)		boot_cpu_data
+static inline void init_guest_spinlock_delay(void) {}
 #endif
 
 extern const struct seq_operations cpuinfo_op;
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index a8f8fa9..4a53724 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -76,6 +76,8 @@ void __init init_hypervisor_platform(void)
 
 	init_hypervisor(&boot_cpu_data);
 
+	init_guest_spinlock_delay();
+
 	if (x86_hyper->init_platform)
 		x86_hyper->init_platform();
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 8e94469..73be656 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -116,8 +116,25 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_SHIFT			8
 #define DELAY_FIXED_1			(1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY		(1 * DELAY_FIXED_1)
-#define MAX_SPINLOCK_DELAY		(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_NATIVE	(16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_GUEST	(16 * DELAY_FIXED_1)
 #define DELAY_HASH_SHIFT		6
+
+/*
+ * Modern Intel and AMD CPUs tell the hypervisor when a guest is
+ * spinning excessively on a spinlock. The hypervisor will then
+ * schedule something else, effectively taking care of the backoff
+ * for us. Doing our own backoff on top of the hypervisor's pause
+ * loop exit handling can lead to excessively long delays, and
+ * performance degradations. Limit the spinlock delay in virtual
+ * machines to a smaller value. Called from init_hypervisor_platform
+ */
+static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
+void __init init_guest_spinlock_delay(void)
+{
+	max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
+}
+
 struct delay_entry {
 	u32 hash;
 	u32 delay;
@@ -171,7 +188,7 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 		}
 
 		/* Aggressively increase delay, to minimize lock accesses. */
-		if (delay < MAX_SPINLOCK_DELAY)
+		if (delay < max_spinlock_delay)
 			delay += DELAY_FIXED_1 / 7;
 
 		loops = (delay * waiters_ahead) >> DELAY_SHIFT;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 12:06   ` [tip:core/locking] x86/smp: Move " tip-bot for Rik van Riel
@ 2013-02-13 16:20     ` Linus Torvalds
  2013-02-13 18:30       ` Linus Torvalds
  2013-02-13 19:08       ` Rik van Riel
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-02-13 16:20 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Peter Zijlstra, Rik van Riel, rostedt, aquini,
	Andrew Morton, Thomas Gleixner, Michel Lespinasse
  Cc: linux-tip-commits

On Wed, Feb 13, 2013 at 4:06 AM, tip-bot for Rik van Riel
<riel@redhat.com> wrote:
>
> x86/smp: Move waiting on contended ticket lock out of line
>
> Moving the wait loop for congested loops to its own function
> allows us to add things to that wait loop, without growing the
> size of the kernel text appreciably.

Did anybody actually look at the code generation of this?

Adding an external function call is *horrible*, and you might almost
as well just uninline the spinlock entirely if you do this. It means
that all the small callers now have their registers trashed, whether
the unlikely function call is taken or not, and now leaf functions
aren't leaves any more.

Guys, the biggest cost of a function call is not the "call"
instruction, it's all the crap around it - trashed registers causing
us to have to have a stack frame in the caller when none was required
otherwise, fixed register allocation by the arguments/return value etc
etc. So having inline functions that then call another function means
that the inline is almost entirely pointless (unless it was just to
set up default arguments or something).

The call is likely also about the same size as the code in the wait
loop. And the excuse is "so that we can add stuff to the wait loop".
What the f*ck? There's nothing to do in the wait-loop, and if you want
debugging or statistis or something, the spinlock should have been
out-of-line.

This is apparently for the auto-tuning, which came with absolutely no
performance numbers (except for the *regressions* it caused), and
which is something we have actively *avoided* in the past, because
back-off is a f*cking idiotic thing, and the only real fix for
contended spinlocks is to try to avoid the contention and fix the
caller to do something smarter to begin with.

In other words, the whole f*cking thing looks incredibly broken. At
least give some good explanations for why crap like this is needed,
instead of just implementing backoff without even numbers for real
loads. And no, don't bother to give numbers for pointless benchmarks.
It's easy to get contention on a benchmark, but spinlock backoff is
only remotely interesting on real loads.

                      Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 16:20     ` Linus Torvalds
@ 2013-02-13 18:30       ` Linus Torvalds
  2013-02-14  0:54         ` H. Peter Anvin
                           ` (2 more replies)
  2013-02-13 19:08       ` Rik van Riel
  1 sibling, 3 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-02-13 18:30 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Peter Zijlstra, Rik van Riel, rostedt, aquini,
	Andrew Morton, Thomas Gleixner, Michel Lespinasse
  Cc: linux-tip-commits

On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Adding an external function call is *horrible*, and you might almost
> as well just uninline the spinlock entirely if you do this. It means
> that all the small callers now have their registers trashed, whether
> the unlikely function call is taken or not, and now leaf functions
> aren't leaves any more.

Btw, we've had things like this before, and I wonder if we could
perhaps introduce the notion of a "light-weight call" for fastpath
code that calls unlikely slow-path code..

In particular, see the out-of-line code used by the rwlocks etc (see
"arch_read_lock()" for an example in arch/x86/include/asm/spinlock.h
and arch/x86/lib/rwlock.S), where we end up calling things from inline
asm, with one big reason being exactly the fact that a "normal" C call
has such horribly detrimental effects on the caller.

Sadly, gcc doesn't seem to allow specifying which registers are
clobbered any easy way, which means that both the caller and the
callee *both* tend to need to have some asm interface. So we bothered
to do this for __read_lock_failed, but we have *not* bothered to do
the same for the otherwise very similar __mutex_fastpath_lock() case,
for example.

So for rwlocks, we actually get very nice code generation with small
leaf functions not necessarily needing stack frames, but for mutexes
we mark a lot of registers "unnecessarily" clobbered in the caller,
exactly because we do *not* do that asm interface for the callee. So
we have to clobber all the standard callee-clobbered registers, which
is really sad, and callers almost always need a stack frame, because
if they have any data live at all across the mutex, they have to save
it in some register that is callee-saved - which basically means that
the function has to have that stack frame in order to save its *own*
callee-saved registers.

So it means that we penalize the fastpath because the slow-path can't
be bothered to do the extra register saving, unless we go to the
lengths we went to for the rwlocks, and build a wrapper in asm to save
the extra registers in the cold path.

Maybe we could introduce some helpers to create these kinds of asm
wrappers to do this? Something that would allow us to say: "this
function only clobbers a minimal set of registers and you can call it
from asm and only mark %rax/rcx/rdx clobbered" and that allows leaf
functions to look like leaf functions for the fastpath?

Hmm? That would make my dislike of uninlining the slow case largely go
away. I still think that back-off tends to be a mistake (and is often
horrible for virtualization etc), but as long as the fastpath stays
close to optimal, I don't care *too* much.

                     Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 16:20     ` Linus Torvalds
  2013-02-13 18:30       ` Linus Torvalds
@ 2013-02-13 19:08       ` Rik van Riel
  2013-02-13 19:36         ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-13 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On 02/13/2013 11:20 AM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 4:06 AM, tip-bot for Rik van Riel
> <riel@redhat.com> wrote:
>>
>> x86/smp: Move waiting on contended ticket lock out of line
>>
>> Moving the wait loop for congested loops to its own function
>> allows us to add things to that wait loop, without growing the
>> size of the kernel text appreciably.
>
> Did anybody actually look at the code generation of this?

Good catch.

This looks like something that may be fixable, though I
do not know whether it actually matters. Adding an unlikely
to the if condition where we call the contention path does
seem to clean up the code a little bit...

> This is apparently for the auto-tuning, which came with absolutely no
> performance numbers (except for the *regressions* it caused), and
> which is something we have actively *avoided* in the past, because
> back-off is a f*cking idiotic thing, and the only real fix for
> contended spinlocks is to try to avoid the contention and fix the
> caller to do something smarter to begin with.
>
> In other words, the whole f*cking thing looks incredibly broken. At
> least give some good explanations for why crap like this is needed,
> instead of just implementing backoff without even numbers for real
> loads. And no, don't bother to give numbers for pointless benchmarks.
> It's easy to get contention on a benchmark, but spinlock backoff is
> only remotely interesting on real loads.

Lock contention falls into two categories. One is contention
on resources that are used inside the kernel, which may be
fixable by changing the data and the code.

The second is lock contention driven by external factors,
like userspace processes all trying to access the same file,
or grab the same semaphore. Not all of these cases may be
fixable on the kernel side.

A further complication is that these kinds of performance
issues often get discovered on production systems, which
are stuck on a particular kernel and cannot introduce
drastic changes.

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.

None of the scalable locking systems magically make things
scale. All they do is prevent catastrophic performance drops
when moving from N to N+x CPUs, allowing user systems to
continue working while kernel developers address the actual
underlying scalability issues.

As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 19:08       ` Rik van Riel
@ 2013-02-13 19:36         ` Linus Torvalds
  2013-02-13 22:21           ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-13 19:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel <riel@redhat.com> wrote:
>
> The spinlock backoff code prevents these last cases from
> experiencing large performance regressions when the hardware
> is upgraded.

I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.

And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.

> As a car analogy, think of this not as an accelerator, but
> as an airbag. Spinlock backoff (or other scalable locking
> code) exists to keep things from going horribly wrong when
> we hit a scalability wall.
>
> Does that make more sense?

Not without tons of numbers from many different platforms, it doesn't.
And not without explaining which spinlock it is that is so contended
in the first place.

We've been very good at fixing spinlock contention. Now, that does
mean that what is likely left isn't exactly low-hanging fruit, but it
also means that the circumstances where it triggers are probably quite
uncommon.

So I claim:

 - it's *really* hard to trigger in real loads on common hardware.

 - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.

 - we absolutely should *not* pessimize the common case for this

So I suspect contention is something that you *may* need on some
particular platforms ("Oh, I have 64 sockets adn 1024 threads, I can
trigger contention easily"), but that tends to be unusual, and any
back-off code should be damn aware of the fact that it only helps the
0.01%.

Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because "back-off" is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off? How do you account for hardware
that notices busy loops and turns them into effectively just mwait?

Btw, the "notice busy loops and turn it into mwait" is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.

And "notice busy loops and turn it into mwait" is actually a big deal
for power use of a CPU. Back-off with busy-looping timing waits can be
an absolutely *horrible* thing for power use.  So we have bigger
issues than just performance, there's complex CPU power behavior too.
Being "smart" can often be really really hard.

I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself, but the patch I did see looked like
it might be absolutely horrible. How long does a "cpu_relax()" wait?
Do you know? How does "cpu_relax()" interface with the rest of the
CPU? Do you know? Because I've heard noises about cpu_relax() actually
affecting the memory pipe behavior of cache accesses of the CPU, and
thus the "cpu_relax()" in a busy loop that does *not* look at the
value (your "backoff delay loop") may actually work very very
differently from the cpu_relax() in the actual "wait for the value to
change" loop.

And how does that account for two different microarchitectures doing
totally different things? Maybe one uarch makes cpu_relax() just shut
down the front-end for a while, while another does something much
fancier and gives hints to in-flight memory accesses etc?

When do you start doing mwait vs just busy-looping with cpu_relax? How
do you tune it to do the right thing for different architectures?

So I think this is complex. At many different levels. And it's *all*
about the details. No handwaving about how "back-off is like a air
bag". Because the big picture is entirely and totally irrelevant, when
the details are almost all that actually matter.

               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 19:36         ` Linus Torvalds
@ 2013-02-13 22:21           ` Rik van Riel
  2013-02-13 22:40             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-13 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On 02/13/2013 02:36 PM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel <riel@redhat.com> wrote:
>> The spinlock backoff code prevents these last cases from
>> experiencing large performance regressions when the hardware
>> is upgraded.
>
> I still want *numbers*.
>
> There are real cases where backoff does exactly the reverse, and makes
> things much much worse. The tuning of the backoff delays are often
> *very* hardware sensitive, and upgrading hardware can turn out to do
> exactly what you say - but for the backoff, not the regular spinning
> code.

What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)

> And we have hardware that actually autodetects some cacheline bouncing
> patterns and may actually do a better job than software. It's *hard*
> for software to know whether it's bouncing within the L1 cache between
> threads, or across fabric in a large machine.

If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.

>> As a car analogy, think of this not as an accelerator, but
>> as an airbag. Spinlock backoff (or other scalable locking
>> code) exists to keep things from going horribly wrong when
>> we hit a scalability wall.
>>
>> Does that make more sense?
>
> Not without tons of numbers from many different platforms, it doesn't.

That makes sense, especially if you are looking to make sure these
patches do not introduce regressions.

> And not without explaining which spinlock it is that is so contended
> in the first place.

This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.

> So I claim:
>
>   - it's *really* hard to trigger in real loads on common hardware.

This is definitely true.  However, with many millions of Linux
users out there, there will always be users who encounter a
performance problem, upgrade their hardware, and then run into
an even worse performance problem because they run into a
scalability issue.

The object of these patches is to go from:

"We doubled the number of CPUs, and now things go even slower.", to

"We doubled the number of CPUs, but things aren't going any faster."

The first is a real disaster for Linux users. Especially when the
workload consists of multiple things, some of which run faster on
the upgraded hardware, and others which run slower. What makes it
worse is that these kinds of issues always seem to happen on the
users' most expensive servers, running their most important
workloads...

The second case is something users can temporarily tolerate, while
we fix the underlying issue.

>   - if it does trigger in any half-way reasonably common setup
> (hardware/software), we most likely should work really hard at fixing
> the underlying problem, not the symptoms.

Agreed.

>   - we absolutely should *not* pessimize the common case for this

Agreed. Are there any preferred benchmarks you would absolutely like
to see, to show that these patches do not introduce regressions?

> Hurting the 99.99% even a tiny amount should be something we should
> never ever do. This is why I think the fast case is so important (and
> I had another email about possibly making it acceptable), but I think
> the *slow* case should be looked at a lot too. Because "back-off" is
> absolutely *not* necessarily hugely better than plain spinning, and it
> needs numbers. How many times do you spin before even looking at
> back-off? How much do you back off?

Whether or not we go into back-off at all depends on a number
of factors, including how many waiters there are ahead of us
waiting for the same ticket spinlock, and whether we have spun
a lot of time on this lock in the past.

> Btw, the "notice busy loops and turn it into mwait" is not some
> theoretical magic thing. And it's exactly the kind of thing that
> back-off *breaks* by making the loop too complex for hardware to
> understand. Even just adding counters with conditionals that are *not*
> about just he value loaded from memory suddently means that hardware
> has a lot harder time doing things like that.

> I don't know if you perhaps had some future plans of looking at using
> mwait in the backoff code itself,

I looked at mwait, but the documentation suggests it only ever
works at cache line granularity (or larger). That means every
time another CPU adds itself to the queue for the ticket lock,
or every time the lock holder writes to something else in the
cache line the lock is in, mwait'ing cpus get woken up.

It looks like mwait is designed for the case of a lock being
on its own cache line, with nothing else. However, Linux does
not seem to have many of those locks left, and the contention
issues I see the most seem to involve locks embedded in data
structures, sharing the cache line with data that the lock
holder modifies.

In fact, it looks like going to mwait has the potential to
increase, rather than decrease, memory traffic.

> but the patch I did see looked like
> it might be absolutely horrible. How long does a "cpu_relax()" wait?
> Do you know? How does "cpu_relax()" interface with the rest of the
> CPU? Do you know? Because I've heard noises about cpu_relax() actually
> affecting the memory pipe behavior of cache accesses of the CPU, and
> thus the "cpu_relax()" in a busy loop that does *not* look at the
> value (your "backoff delay loop") may actually work very very
> differently from the cpu_relax() in the actual "wait for the value to
> change" loop.
>
> And how does that account for two different microarchitectures doing
> totally different things? Maybe one uarch makes cpu_relax() just shut
> down the front-end for a while, while another does something much
> fancier and gives hints to in-flight memory accesses etc?

You summed up all the reasons for doing auto-tuning, aggressively
collapsing the back-off if we overslept, and starting out from 1
if we do not find a cached value.

> When do you start doing mwait vs just busy-looping with cpu_relax? How
> do you tune it to do the right thing for different architectures?

Mwait seems like the wrong thing to do, because the spinlocks where
we tend to see contention, tend to be the ones embedded in data
structures, with other things sharing the same cache line.

> So I think this is complex. At many different levels. And it's *all*
> about the details. No handwaving about how "back-off is like a air
> bag". Because the big picture is entirely and totally irrelevant, when
> the details are almost all that actually matter.

The details can vary a lot from CPU to CPU. I would argue that we
need code that does not break down, regardless of those details.

I agree that the code could use more regression testing, and will
get you test results. Please let me know if there are any particular
scenarios you would like to see tested.

That will also let us determine whether or not the call overhead (in
case we are waiting for a contended lock anyway) is an issue that
needs fixing.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 22:21           ` Rik van Riel
@ 2013-02-13 22:40             ` Linus Torvalds
  2013-02-13 23:41               ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-13 22:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel <riel@redhat.com> wrote:
>
> What kind of numbers would you like?
>
> Numbers showing that the common case is not affected by this
> code?
>
> Or numbers showing that performance of something is improved
> with this code?
>
> Of course, the latter would point out a scalability issue,
> that may be fixable by changing the data structures and code
> involved :)

Yes.

To all three.

> If we have just a few CPUs bouncing the cache line around, we
> have no real performance issues and the loop with cpu_relax
> seems to be doing fine.
>
> Real issues arise when we have a larger number of CPUs contending
> on the same lock. At that point we know we are no longer bouncing
> between sibling cores.

Again. Exactly my point.

So you're potentially making things worse for just about everybody, in
order to fix a problem that almost nobody can actually see. And
apparently you don't even know the problem.

> This patch series is not as much for the spinlocks we know about
> (we can probably fix those), but to prevent catastrophic
> performance degradation when users run into contention on spinlocks
> we do NOT know about.

.. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.

It may make things WORSE. On all the things you haven't run to check
that it does better.

You just stated something that is not at all necessarily true. You
changed the spinlock behavior, and then you blindly state that it will
"fix" things on loads you haven't even tested.

That's pure bullshit.

For all we know, it goes exactly the other way around, and makes some
unknown load perform much much worse, because it turns something that
didn't *use* to be contended into a contended case, because the
spinlock is slower on some piece of hardware (or under
virtualization).

We already saw one virtual load regress quite noticeably, didn't we?

And yet you go on to say that it will fix performance problems THAT WE
DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
behavior! What f*cking planet are you from, again?

Christ, that's hubris.

Besides, out-of-line spinlock loops are *horrible* for performance
monitoring. One of the *best* things about inlining the spinlock
looping code is that it's so very obvious where a hot lock is used. It
shows up as a shining beacon in profiles, without any need for
callchains etc. So not only don't we know what loads it would improve,
but those unknown loads would also be much harder to figure out.

(And yes, this is a real problem - look at every time we have a
problem with scaling sleeping locks, and how all the time just ends up
being random scheduler time, with the actual lock being very hard to
see. Spinlocks are *easy* partly because they can be inlined)

Seriously. I'm going to NAK these kinds of changes unless you can
actually show real improvement on real loads. No more of this "in
theory this will fix all our problems" crap.

             Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 22:40             ` Linus Torvalds
@ 2013-02-13 23:41               ` Rik van Riel
  2013-02-14  1:21                 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-13 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

[-- Attachment #1: Type: text/plain, Size: 5602 bytes --]

On 02/13/2013 05:40 PM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> What kind of numbers would you like?
>>
>> Numbers showing that the common case is not affected by this
>> code?
>>
>> Or numbers showing that performance of something is improved
>> with this code?
>>
>> Of course, the latter would point out a scalability issue,
>> that may be fixable by changing the data structures and code
>> involved :)
>
> Yes.
>
> To all three.

I will run a bunch of tests to demonstrate there are no
regressions with this code. You'll have to wait a few days
before I have a good set of numbers for that.

I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The "prop-N" series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.

>> If we have just a few CPUs bouncing the cache line around, we
>> have no real performance issues and the loop with cpu_relax
>> seems to be doing fine.
>>
>> Real issues arise when we have a larger number of CPUs contending
>> on the same lock. At that point we know we are no longer bouncing
>> between sibling cores.
>
> Again. Exactly my point.
>
> So you're potentially making things worse for just about everybody, in
> order to fix a problem that almost nobody can actually see. And
> apparently you don't even know the problem.

If we have only a few CPUs contending on the lock, the delays
will be short. Furthermore, the CPU at the head of the queue
will run the old spinlock code with just cpu_relax() and checking
the lock each iteration.

>> This patch series is not as much for the spinlocks we know about
>> (we can probably fix those), but to prevent catastrophic
>> performance degradation when users run into contention on spinlocks
>> we do NOT know about.
>
> .. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.
>
> It may make things WORSE. On all the things you haven't run to check
> that it does better.
>
> You just stated something that is not at all necessarily true. You
> changed the spinlock behavior, and then you blindly state that it will
> "fix" things on loads you haven't even tested.

I did not claim it will fix things.  I claim that it helps reduce
the excessive cross-cpu memory accesses (from the spinlock acquisition
path) that can cause catastrophic performance degradation.

This happens to be what I, Michel and Eric have observed.

Eric got a 45% increase in network throughput, and I saw a factor 4x
or so improvement with the semaphore test. I realize these are not
"real workloads", and I will give you numbers with those once I have
gathered some, on different systems.

> For all we know, it goes exactly the other way around, and makes some
> unknown load perform much much worse, because it turns something that
> didn't *use* to be contended into a contended case, because the
> spinlock is slower on some piece of hardware (or under
> virtualization).
>
> We already saw one virtual load regress quite noticeably, didn't we?

The cause of that was identified (with pause loop exiting, the  host
effectively does the back-off for us), and the problem is avoided
by limiting the maximum back-off value to something small on
virtual guests.

> And yet you go on to say that it will fix performance problems THAT WE
> DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
> behavior! What f*cking planet are you from, again?
>
> Christ, that's hubris.

I realize you do not have time to read all the email discussions
around these patches, but please do not assume that everybody else
are drooling idiots who are unable to log into their computers in
the morning without their guide dogs typing their passwords for them.

I will try to get you some performance test numbers on various kinds
of hardware over the next few days (probably early next week, depending
on hardware availability in the lab), running some mix of benchmarks
and workloads.

> Besides, out-of-line spinlock loops are *horrible* for performance
> monitoring. One of the *best* things about inlining the spinlock
> looping code is that it's so very obvious where a hot lock is used. It
> shows up as a shining beacon in profiles, without any need for
> callchains etc. So not only don't we know what loads it would improve,
> but those unknown loads would also be much harder to figure out.

Are there significant cases where "perf -g" is not easily available,
or harmful to tracking down the performance issue?

Even with inlined spinlocks, you have the issue that they tend to
be taken from functions that can be called from multiple places.
At least, in the parts of the kernel I tend to look at...

> Seriously. I'm going to NAK these kinds of changes unless you can
> actually show real improvement on real loads. No more of this "in
> theory this will fix all our problems" crap.

I will get you numbers. Please hang on while I try to reserve
various lab systems, run programs on them, etc...

-- 
All rights reversed

[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 18:30       ` Linus Torvalds
@ 2013-02-14  0:54         ` H. Peter Anvin
  2013-02-14  1:31           ` Linus Torvalds
  2013-02-14 10:50         ` Ingo Molnar
  2013-02-15  6:48         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 54+ messages in thread
From: H. Peter Anvin @ 2013-02-14  0:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On 02/13/2013 10:30 AM, Linus Torvalds wrote:
> 
> Sadly, gcc doesn't seem to allow specifying which registers are
> clobbered any easy way, which means that both the caller and the
> callee *both* tend to need to have some asm interface. So we bothered
> to do this for __read_lock_failed, but we have *not* bothered to do
> the same for the otherwise very similar __mutex_fastpath_lock() case,
> for example.
> 

It does for the callee, but only on a whole-file basis.  It would be a
lot nicer if we could do it with function attributes.

	-hpa



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 23:41               ` Rik van Riel
@ 2013-02-14  1:21                 ` Linus Torvalds
  2013-02-14  1:46                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-02-14  1:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt

On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>
> I have an example of the second case. It is a test case
> from a customer issue, where an application is contending on
> semaphores, doing semaphore lock and unlock operations. The
> test case simply has N threads, trying to lock and unlock the
> same semaphore.
>
> The attached graph (which I sent out with the intro email to
> my patches) shows how reducing the memory accesses from the
> spinlock wait path prevents the large performance degradation
> seen with the vanilla kernel. This is on a 24 CPU system with
> 4 6-core AMD CPUs.
>
> The "prop-N" series are with a fixed delay proportional back-off.
> You can see that a small value of N does not help much for large
> numbers of cpus, and a large value hurts with a small number of
> CPUs. The automatic tuning appears to be quite robust.

Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.

That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in
particular.

> If we have only a few CPUs contending on the lock, the delays
> will be short.

Yes. I'm more worried about the overhead, especially on I$ (and to a
lesser degree on D$ when loading hashed delay values etc). I don't
believe it would ever loop very long, it's the other overhead I'd be
worried about.

>From looking at profiles of the kernel loads I've cared about (ie
largely VFS code), the I$ footprint seems to be a big deal, and
function entry (and the instruction *after* a call instruction)
actually tend to be hotspots. Which is why I care about things like
function prologues for leaf functions etc.

> Furthermore, the CPU at the head of the queue
> will run the old spinlock code with just cpu_relax() and checking
> the lock each iteration.

That's not AT ALL TRUE.

Look at the code you wrote. It does all the spinlock delay etc crap
unconditionally. Only the loop itself is conditional.

IOW, exactly all the overhead that I worry about. The function call,
the pointless turning of leaf functions into non-leaf functions, the
loading (and storing) of delay information etc etc.

The non-leaf-function thing is done even if you never hit the
slow-path, and affects the non-contention path. And the delay
information thing is done even if there is only one waiter on the
spinlock.

Did I miss anything?

> Eric got a 45% increase in network throughput, and I saw a factor 4x
> or so improvement with the semaphore test. I realize these are not
> "real workloads", and I will give you numbers with those once I have
> gathered some, on different systems.

Good. This is what I want to see.

> Are there significant cases where "perf -g" is not easily available,
> or harmful to tracking down the performance issue?

Yes. There are lots of machines where you cannot get call chain
information with CPU event buffers (pebs). And without the CPU event
buffers, you cannot get good profile data at all.

Now, on other machines you get the call chain even with pebs because
you can get the whole

> The cause of that was identified (with pause loop exiting, the  host
> effectively does the back-off for us), and the problem is avoided
> by limiting the maximum back-off value to something small on
> virtual guests.

And what if the hardware does something equivalent even when not
virtualized (ie power optimizations I already mentioned)?  That whole
maximum back-off limit seems to be just for known virtualization
issues. This is the kind of thing that makes me worry..

                   Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14  0:54         ` H. Peter Anvin
@ 2013-02-14  1:31           ` Linus Torvalds
  2013-02-14  1:56             ` H. Peter Anvin
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-14  1:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> It does for the callee, but only on a whole-file basis.  It would be a
> lot nicer if we could do it with function attributes.

A way to just set the callee-clobbered list on a per-function basis
would be lovely. Gcc has limited support for this on some
architectures, where you can specify "save every register for this
function" in order to do things like interrupt handlers etc without
even resorting to asm. But there is no generic (or even just x86)
support for anything like it :-(

There are other calling-convention attributes that make me suspect gcc
could easily do this (it already supports per-function ABI
specification, so presumably it already has some concept of
callee-saved registers being different for different attributes), but
from my reading you currently have to generate asm wrappers by hand
(and call them by hand with inline asm) if you want to do something
like this.

And the straightforward "just wrap it in asm" approach sadly then
causes double "call" instructions etc. So that slows down the slow
case unnecessarily much ;(

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14  1:21                 ` Linus Torvalds
@ 2013-02-14  1:46                   ` Linus Torvalds
  2013-02-14 10:43                   ` Ingo Molnar
  2013-02-27 16:42                   ` Rik van Riel
  2 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-02-14  1:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt

On Wed, Feb 13, 2013 at 5:21 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, on other machines you get the call chain even with pebs because
> you can get the whole

Oops, that got cut short early, because I started looking up when PEBS
and the last-branch-buffer work together, and couldn't find it, and
then came back to the email and forgot to finish the sentence.

Anyway: sometimes you get call chains with precise events, sometimes
you don't. And it turns out that I can't find out which cores do both,
and which ones don't.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14  1:31           ` Linus Torvalds
@ 2013-02-14  1:56             ` H. Peter Anvin
  0 siblings, 0 replies; 54+ messages in thread
From: H. Peter Anvin @ 2013-02-14  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On 02/13/2013 05:31 PM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> It does for the callee, but only on a whole-file basis.  It would be a
>> lot nicer if we could do it with function attributes.
> 
> A way to just set the callee-clobbered list on a per-function basis
> would be lovely. Gcc has limited support for this on some
> architectures, where you can specify "save every register for this
> function" in order to do things like interrupt handlers etc without
> even resorting to asm. But there is no generic (or even just x86)
> support for anything like it :-(
> 
> There are other calling-convention attributes that make me suspect gcc
> could easily do this (it already supports per-function ABI
> specification, so presumably it already has some concept of
> callee-saved registers being different for different attributes), but
> from my reading you currently have to generate asm wrappers by hand
> (and call them by hand with inline asm) if you want to do something
> like this.
> 

I just filed a gcc bugzilla on this:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56314

	-hpa


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14  1:21                 ` Linus Torvalds
  2013-02-14  1:46                   ` Linus Torvalds
@ 2013-02-14 10:43                   ` Ingo Molnar
  2013-02-27 16:42                   ` Rik van Riel
  2 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-02-14 10:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Eric got a 45% increase in network throughput, and I saw a 
> > factor 4x or so improvement with the semaphore test. I 
> > realize these are not "real workloads", and I will give you 
> > numbers with those once I have gathered some, on different 
> > systems.
> 
> Good. This is what I want to see.

I've put these commits aside meanwhile, until we have the 
numbers and decide one way or another.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 18:30       ` Linus Torvalds
  2013-02-14  0:54         ` H. Peter Anvin
@ 2013-02-14 10:50         ` Ingo Molnar
  2013-02-14 16:10           ` Linus Torvalds
  2013-02-15  6:48         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2013-02-14 10:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Adding an external function call is *horrible*, and you 
> > might almost as well just uninline the spinlock entirely if 
> > you do this. It means that all the small callers now have 
> > their registers trashed, whether the unlikely function call 
> > is taken or not, and now leaf functions aren't leaves any 
> > more.
> 
> Btw, we've had things like this before, and I wonder if we 
> could perhaps introduce the notion of a "light-weight call" 
> for fastpath code that calls unlikely slow-path code..
> 
> In particular, see the out-of-line code used by the rwlocks 
> etc (see "arch_read_lock()" for an example in 
> arch/x86/include/asm/spinlock.h and arch/x86/lib/rwlock.S), 
> where we end up calling things from inline asm, with one big 
> reason being exactly the fact that a "normal" C call has such 
> horribly detrimental effects on the caller.
> 
> Sadly, gcc doesn't seem to allow specifying which registers 
> are clobbered any easy way, which means that both the caller 
> and the callee *both* tend to need to have some asm interface. 
> So we bothered to do this for __read_lock_failed, but we have 
> *not* bothered to do the same for the otherwise very similar 
> __mutex_fastpath_lock() case, for example.

At least on x86, how about saving *all* volatile registers in 
the slow out of line code path (to stack)?

That means we wouldn't have to do anything fancy with the called 
functions, and the caller would see minimal register impact. It 
would also be reasonably robust and straightforward assembly 
code.

It blows up the slow path somewhat, but it would allow us to 
keep the fast-path register impact even smaller - as the slow 
path would only have memory content side effects.

Am I missing something?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14 10:50         ` Ingo Molnar
@ 2013-02-14 16:10           ` Linus Torvalds
  2013-02-15 15:57             ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-14 16:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits

On Thu, Feb 14, 2013 at 2:50 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> At least on x86, how about saving *all* volatile registers in
> the slow out of line code path (to stack)?

Sure. The reason I suggested perhaps not saving %rax/%rdx is simply
that if it's a function that returns a value, %rax obviously cannot be
saved anyway. And then I was confused about the calling convention,
and said %rx because it's the second argument register, but that's
only true on x86-32 (with our regparm calling convention). On x86-64,
it's %rdi/%rsi that have arguments.

Anyway, argument registers *can* be useful to save, but are often
computed, so it can go either way.

> It blows up the slow path somewhat, but it would allow us to
> keep the fast-path register impact even smaller - as the slow
> path would only have memory content side effects.
>
> Am I missing something?

The best option would be to just let the user say which registers it
wants saved for any particular function. We'd probably want some
default set so that people who call "unlikely()" functions can just
pick that one, but for architecture-specific helper functions for
locking etc, we would probably want to optimize it for usage (ie "is
the argument likely to be useful to the caller after the call").

Sometimes saving everything is the right thing to do, sometimes it
might just be extra work.

And even *if* the callee saves everything, the caller still ends up
being constrained by the calling convention (ie fixed registers for
arguments), so it could mess up code generation in the caller for that
reason. But at least it would be much less painful.

Btw, it may end up that almost nobody cares. Modern CPU's are really
good at handling the straightforward "save/restore to stack"
instructions. One of the reasons I care is not performance per se,
butu the fact that I still look at asm code every time I do any
performance profiling, and it's absolutely horrible to see the code
when you see "ugh, that's pointless". I'm sensitive to the spinlocks
in particular, because we've gone back and forth on inlining them
before, so I've seen this. But right now I don't think we inline the
lock under normal configs *anyway*, so I guess it doesn't much matter.

                            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-13 18:30       ` Linus Torvalds
  2013-02-14  0:54         ` H. Peter Anvin
  2013-02-14 10:50         ` Ingo Molnar
@ 2013-02-15  6:48         ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 54+ messages in thread
From: Benjamin Herrenschmidt @ 2013-02-15  6:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, Rik van Riel, rostedt, aquini, Andrew Morton,
	Thomas Gleixner, Michel Lespinasse, linux-tip-commits

On Wed, 2013-02-13 at 10:30 -0800, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Adding an external function call is *horrible*, and you might almost
> > as well just uninline the spinlock entirely if you do this. It means
> > that all the small callers now have their registers trashed, whether
> > the unlikely function call is taken or not, and now leaf functions
> > aren't leaves any more.
> 
> Btw, we've had things like this before, and I wonder if we could
> perhaps introduce the notion of a "light-weight call" for fastpath
> code that calls unlikely slow-path code..
> 
> In particular, see the out-of-line code used by the rwlocks etc (see
> "arch_read_lock()" for an example in arch/x86/include/asm/spinlock.h
> and arch/x86/lib/rwlock.S), where we end up calling things from inline
> asm, with one big reason being exactly the fact that a "normal" C call
> has such horribly detrimental effects on the caller.

This would be nice. I've been wanting to do something like that for a
while in fact... On archs like powerpc, we lose 11 GPRs on a function
call, that ends up being a lot of stupid stack spills for cases that are
often corner cases (error cases etc... in inlines).

> Sadly, gcc doesn't seem to allow specifying which registers are
> clobbered any easy way, which means that both the caller and the
> callee *both* tend to need to have some asm interface. So we bothered
> to do this for __read_lock_failed, but we have *not* bothered to do
> the same for the otherwise very similar __mutex_fastpath_lock() case,
> for example.
>
> So for rwlocks, we actually get very nice code generation with small
> leaf functions not necessarily needing stack frames, but for mutexes
> we mark a lot of registers "unnecessarily" clobbered in the caller,
> exactly because we do *not* do that asm interface for the callee. So
> we have to clobber all the standard callee-clobbered registers, which
> is really sad, and callers almost always need a stack frame, because
> if they have any data live at all across the mutex, they have to save
> it in some register that is callee-saved - which basically means that
> the function has to have that stack frame in order to save its *own*
> callee-saved registers.
> 
> So it means that we penalize the fastpath because the slow-path can't
> be bothered to do the extra register saving, unless we go to the
> lengths we went to for the rwlocks, and build a wrapper in asm to save
> the extra registers in the cold path.
> 
> Maybe we could introduce some helpers to create these kinds of asm
> wrappers to do this? Something that would allow us to say: "this
> function only clobbers a minimal set of registers and you can call it
> from asm and only mark %rax/rcx/rdx clobbered" and that allows leaf
> functions to look like leaf functions for the fastpath?

We could so something like:

	define_fastcall(func [.. figure out how to deal with args ... ])

Which spits out both a trampoline for saving the nasty stuff and calling
the real func() and a call_func() inline asm for the call site.

At least on archs with register-passing conventions, especially if we
make mandatory to stick to register args only and forbid stack spills
(ie, only a handful of args), it's fairly easy to do.

For stack based archs, it gets nastier as you have to dig out the args,
save stuff, and pile them again.

But since we also don't want to lose strong typing, we probably want to
express the args in that macro, maybe like we do for the syscall
defines. A bit ugly, but that would allow to have a strongly typed
call_func() *and* allow the trampoline to know what to do about the args
for stack based calling conventions.

About to go & travel so I don't have time to actually write something,
at least not for a couple of weeks though...

Ben.

> Hmm? That would make my dislike of uninlining the slow case largely go
> away. I still think that back-off tends to be a mistake (and is often
> horrible for virtualization etc), but as long as the fastpath stays
> close to optimal, I don't care *too* much.
> 
>                      Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14 16:10           ` Linus Torvalds
@ 2013-02-15 15:57             ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2013-02-15 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Rik van Riel, rostedt, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Btw, it may end up that almost nobody cares. Modern CPU's are 
> really good at handling the straightforward "save/restore to 
> stack" instructions. One of the reasons I care is not 
> performance per se, butu the fact that I still look at asm 
> code every time I do any performance profiling, and it's 
> absolutely horrible to see the code when you see "ugh, that's 
> pointless". I'm sensitive to the spinlocks in particular, 
> because we've gone back and forth on inlining them before, so 
> I've seen this. But right now I don't think we inline the lock 
> under normal configs *anyway*, so I guess it doesn't much 
> matter.

Yes, right now we only inline non-debug spin_unlock() and 
spin_unlock_irq() [on !PREEMPT] - because that was an 
unconditional win.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-14  1:21                 ` Linus Torvalds
  2013-02-14  1:46                   ` Linus Torvalds
  2013-02-14 10:43                   ` Ingo Molnar
@ 2013-02-27 16:42                   ` Rik van Riel
  2013-02-27 17:10                     ` Linus Torvalds
  2 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-27 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt, Vinod,
	Chegu, Low, Jason, Michel Lespinasse

On 02/13/2013 08:21 PM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> I have an example of the second case. It is a test case
>> from a customer issue, where an application is contending on
>> semaphores, doing semaphore lock and unlock operations. The
>> test case simply has N threads, trying to lock and unlock the
>> same semaphore.
>>
>> The attached graph (which I sent out with the intro email to
>> my patches) shows how reducing the memory accesses from the
>> spinlock wait path prevents the large performance degradation
>> seen with the vanilla kernel. This is on a 24 CPU system with
>> 4 6-core AMD CPUs.
>>
>> The "prop-N" series are with a fixed delay proportional back-off.
>> You can see that a small value of N does not help much for large
>> numbers of cpus, and a large value hurts with a small number of
>> CPUs. The automatic tuning appears to be quite robust.
>
> Ok, good, so there are some numbers. I didn't see any in the commit
> messages anywhere, and since the threads I've looked at are from
> tip-bot, I never saw the intro email.

Some people at HP have collected an extensive list of AIM 7 results,
all the different AIM 7 workloads, on an 80-core DL-980, with HT
disabled.

The AIM7 workloads all work by slowly increasing the number of
worker processes, all of which have some duty cycle (busy & sleep).
Adding more processes tends to increase the number of jobs/minute
completed, up to a certain point. For some workloads, the system
has a performance peak and performance levels up at or near that
peak, for other workloads performance drops when more processes
are added beyond the peak, and performance drops to a lower plateau.

To keep the results readable and relevant, I am reporting the
plateau performance numbers. Comments are given where required.

		3.7.6 vanilla	3.7.6 w/ backoff

all_utime		333000		333000
alltests	300000-470000	180000-440000	large variability
compute			528000		528000
custom		290000-320000	250000-330000	4 fast runs, 1 slow
dbase			920000		925000
disk			100000	 90000-120000	similar plateau, wild
						swings with patches
five_sec		140000		140000
fserver		160000-300000	250000-430000	w/ patch drops off at
						higher number of users
high_systime	 80000-110000	 30000-125000	w/ patch mostly 40k-70k,
						wild wings
long		no performance platform, equal performance for both
new_dbase		960000		96000
new_fserver	150000-300000	210000-420000	vanilla drops off,
						w/ patches wild swings
shared		270000-440000	120000-440000	all runs ~equal to
						vanilla up to 1000
						users, one out of 5
						runs slows down past
						1100 users
short			120000		190000

In conclusion, the spinlock backoff patches seem to significantly
improve performance in workloads where there is simple contention
on just one or two spinlocks. However, in more complex workloads,
high variability is seen, including performance regression in some
test runs.

One hypothesis is that before the spinlock backoff patches, the
workloads get contention (and bottleneck) on multiple locks. With
the patches, the contention on some of the locks is relieved, and
more tasks bunch up on the remaining bottlenecks, leading to worse
performance.

> That said, it's interesting that this happens with the semaphore path.
> We've had other cases where the spinlock in the *sleeping* locks have
> caused problems, and I wonder if we should look at that path in
> particular.

If we want to get reliable improved performance without unpredictable
performance swings, we should probably change some of the kernel's
spinlocks, especially the ones embedded in sleeping locks, into
scalable locks like Michel's implementation of MCS locks.

We may be hitting the limit of what can be done with the current
ticket lock data structure. It simply may not scale as far as the
hardware on which Linux is being run.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-27 16:42                   ` Rik van Riel
@ 2013-02-27 17:10                     ` Linus Torvalds
  2013-02-27 19:53                       ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-27 17:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt, Vinod,
	Chegu, Low, Jason

On Wed, Feb 27, 2013 at 8:42 AM, Rik van Riel <riel@redhat.com> wrote:
>
> To keep the results readable and relevant, I am reporting the
> plateau performance numbers. Comments are given where required.
>
>                 3.7.6 vanilla   3.7.6 w/ backoff
>
> all_utime               333000          333000
> alltests        300000-470000   180000-440000   large variability
> compute                 528000          528000
> custom          290000-320000   250000-330000   4 fast runs, 1 slow
> dbase                   920000          925000
> disk                    100000   90000-120000   similar plateau, wild
>                                                 swings with patches
> five_sec                140000          140000
> fserver         160000-300000   250000-430000   w/ patch drops off at
>                                                 higher number of users
> high_systime     80000-110000    30000-125000   w/ patch mostly 40k-70k,
>                                                 wild wings
> long            no performance platform, equal performance for both
> new_dbase               960000          96000
> new_fserver     150000-300000   210000-420000   vanilla drops off,
>                                                 w/ patches wild swings
> shared          270000-440000   120000-440000   all runs ~equal to
>                                                 vanilla up to 1000
>                                                 users, one out of 5
>                                                 runs slows down past
>                                                 1100 users
> short                   120000          190000

Ugh. That really is rather random. "short" and fserver seems to
improve a lot (including the "new" version), the others look like they
are either unchanged or huge regressions.

Is there any way to get profiles for the improved versions vs the
regressed ones? It might well be that we have two different classes of
spinlocks. Maybe we could make the back-off version be *explicit* (ie
not part of the normal "spin_lock()", but you'd use a special
"spin_lock_backoff()" function for it) because it works well for some
cases but not for others?

Hmm? At the very least, it would give us an idea of *which* spinlock
it is that causes the most pain. I think your earlier indications was
that it's the mutex->wait_lock or something?

                   Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-27 17:10                     ` Linus Torvalds
@ 2013-02-27 19:53                       ` Rik van Riel
  2013-02-27 20:18                         ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt, Vinod,
	Chegu, Low, Jason

On 02/27/2013 12:10 PM, Linus Torvalds wrote:

> Ugh. That really is rather random. "short" and fserver seems to
> improve a lot (including the "new" version), the others look like they
> are either unchanged or huge regressions.
>
> Is there any way to get profiles for the improved versions vs the
> regressed ones? It might well be that we have two different classes of
> spinlocks. Maybe we could make the back-off version be *explicit* (ie
> not part of the normal "spin_lock()", but you'd use a special
> "spin_lock_backoff()" function for it) because it works well for some
> cases but not for others?

If we have two classes of spinlocks, I suspect we would be better
off making those high-demand spinlocks MCS or LCH locks, which have
the property that having N+1 CPUs contend on the lock will never
result in slower aggregate throughput than having N CPUs contend.

Both the current spinlock code and the spinlock code with backoff
has the annoying property that adding more waiters to the lock can
slow down aggregate throughput.

The backoff code seems to push that point a little further out.
It appears that that may result in "lesser bottlenecks" disappearing,
which in turn triggers amplification of the worst bottlenecks.

> Hmm? At the very least, it would give us an idea of *which* spinlock
> it is that causes the most pain. I think your earlier indications was
> that it's the mutex->wait_lock or something?

I can certainly take profiles of various workloads, but there is
absolutely no guarantee that I will see the same bottlenecks that
eg. the people at HP have seen. The largest test system I currently
have access to has 40 cores, vs. the 80 cores in the (much more
interesting) HP results I pasted.

Would you also be interested in performance numbers (and profiles)
of a kernel that has bottleneck spinlocks replaced with MCS locks?

That could make for an interesting comparison...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-27 19:53                       ` Rik van Riel
@ 2013-02-27 20:18                         ` Linus Torvalds
  2013-02-27 21:55                           ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-27 20:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt, Vinod,
	Chegu, Low, Jason

On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel <riel@redhat.com> wrote:
>
> If we have two classes of spinlocks, I suspect we would be better
> off making those high-demand spinlocks MCS or LCH locks, which have
> the property that having N+1 CPUs contend on the lock will never
> result in slower aggregate throughput than having N CPUs contend.

I doubt that.

The fancy "no slowdown" locks almost never work in practice. They
scale well by performing really badly for the normal case, either
needing separate allocations or having memory ordering problems
requiring multiple locked cycles.

A spinlock basically needs to have a fast-case that is a single locked
instruction, and all the clever ones tend to fail that simple test.

> I can certainly take profiles of various workloads, but there is
> absolutely no guarantee that I will see the same bottlenecks that
> eg. the people at HP have seen. The largest test system I currently
> have access to has 40 cores, vs. the 80 cores in the (much more
> interesting) HP results I pasted.
>
> Would you also be interested in performance numbers (and profiles)
> of a kernel that has bottleneck spinlocks replaced with MCS locks?

MCS locks don't even work, last time I saw. They need that extra lock
holder allocation, which forces people to have different calling
conventions, and is just a pain. Or am I confusing them with something
else?

They might work for the special cases like the sleeping locks, which
have one or two places that take and release the lock, but not for the
generic spinlock.

Also, it might be worth trying current git - if it's a rwsem that is
implicated, the new lock stealing might be a win.

So before even trying anything fancy, just basic profiles would be
good to see which lock it is. Many of the really bad slowdowns are
actually about the timing details of the sleeping locks (do *not*
enable lock debugging etc for profiling, you want the mutex spinning
code to be active, for example).

                   Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-27 20:18                         ` Linus Torvalds
@ 2013-02-27 21:55                           ` Rik van Riel
       [not found]                             ` <CA+55aFwa0EjGG2NUDYVLVBmXJa2k81YiuNO2yggk=GLRQxhhUQ@mail.gmail.com>
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-27 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Peter Zijlstra, aquini, Andrew Morton, Thomas Gleixner,
	Michel Lespinasse, linux-tip-commits, Steven Rostedt, Vinod,
	Chegu, Low, Jason

On 02/27/2013 03:18 PM, Linus Torvalds wrote:
> On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> If we have two classes of spinlocks, I suspect we would be better
>> off making those high-demand spinlocks MCS or LCH locks, which have
>> the property that having N+1 CPUs contend on the lock will never
>> result in slower aggregate throughput than having N CPUs contend.
>
> I doubt that.
>
> The fancy "no slowdown" locks almost never work in practice. They
> scale well by performing really badly for the normal case, either
> needing separate allocations or having memory ordering problems
> requiring multiple locked cycles.

The relative costs of atomic operations, cache line acquisition, and
other things has shifted over time. On the very latest systems, the
cost of cache line acquisition appears to dominate over the cost of
atomic operations.

Michel's results from last month show that MCS has essentially the same
performance for the single thread (uncontended) situation, on some
systems a degradation for 2 and 3 thread performance, in a tight micro
test, and improved performance when the contention involves 3 or more
CPUs.

http://thread.gmane.org/gmane.linux.kernel/1427417

> A spinlock basically needs to have a fast-case that is a single locked
> instruction, and all the clever ones tend to fail that simple test.

With the cost of a cache line acquisition outweighing the cost of an
atomic operation, for how much longer will this remain true?

Michel's results suggest that on Sandybridge this no longer seems
to hold.  The cost of the atomic operation on unlock appears to have
more than paid for itself by avoiding extraneous cache line bouncing,
and the cost of cache line acquisition. Even with only two CPUs
contending on the lock...

>> I can certainly take profiles of various workloads, but there is
>> absolutely no guarantee that I will see the same bottlenecks that
>> eg. the people at HP have seen. The largest test system I currently
>> have access to has 40 cores, vs. the 80 cores in the (much more
>> interesting) HP results I pasted.
>>
>> Would you also be interested in performance numbers (and profiles)
>> of a kernel that has bottleneck spinlocks replaced with MCS locks?
>
> MCS locks don't even work, last time I saw. They need that extra lock
> holder allocation, which forces people to have different calling
> conventions, and is just a pain. Or am I confusing them with something
> else?

Nope, those are the MCS locks alright.

> They might work for the special cases like the sleeping locks, which
> have one or two places that take and release the lock, but not for the
> generic spinlock.

I am certainly not advocating that all spinlocks be replaced with
harder to use locks.  On the other hand, we have to realize that
Linux users do not have the luxury to upgrade their kernel to the
latest upstream on whenever they run into a performance issue, so
it would be good to make Linux more robust against scalability
issues.

> So before even trying anything fancy, just basic profiles would be
> good to see which lock it is. Many of the really bad slowdowns are
> actually about the timing details of the sleeping locks (do *not*
> enable lock debugging etc for profiling, you want the mutex spinning
> code to be active, for example).

No argument there, but that does in no way negate the need for
some performance robustness.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
       [not found]                             ` <CA+55aFwa0EjGG2NUDYVLVBmXJa2k81YiuNO2yggk=GLRQxhhUQ@mail.gmail.com>
@ 2013-02-28  2:58                               ` Rik van Riel
  2013-02-28  3:19                                 ` Linus Torvalds
  2013-02-28  4:06                                 ` Davidlohr Bueso
  0 siblings, 2 replies; 54+ messages in thread
From: Rik van Riel @ 2013-02-28  2:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Vinod, Chegu, Low, Jason, linux-tip-commits, Peter Zijlstra,
	H. Peter Anvin, Andrew Morton, aquini, Michel Lespinasse,
	Ingo Molnar

On 02/27/2013 05:13 PM, Linus Torvalds wrote:
>
> On Feb 27, 2013 1:56 PM, "Rik van Riel" <riel@redhat.com
> <mailto:riel@redhat.com>> wrote:
>>
>> No argument there, but that does in no way negate the need for some
>> performance robustness.
>
> The very numbers you posted showed that the backoff was *not* more
> robust. Quite the reverse, there was arguably more variability.

On the other hand, both MCS and the fast queue locks
implemented by Michel showed low variability and high
performance.

http://thread.gmane.org/gmane.linux.kernel/1427417

> So I really don't like how you make these sweeping statements
> *again*. Numbers talk, bullshit walks.

If you read all the text in my last mail, you will see the
link to Michel's performance results. The numbers speak for
themselves.

> The fact is, life is complicated. The simple spinlocks tend to work
> really well. People have tried fancy things before, and it turns out
> it's not as simple as they think.

The numbers for both the simple spinlocks and the
spinlock backoff kind of suck. Both of these have
high variability, and both eventually fall down
under heavy load.

The numbers for Michel's MCS and fast queue lock
implementations appear to be both fast and stable.

I agree that we need numbers.

I do not agree that other locks should be dismissed
out of hand without looking at the numbers.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28  2:58                               ` Rik van Riel
@ 2013-02-28  3:19                                 ` Linus Torvalds
  2013-02-28  4:06                                 ` Davidlohr Bueso
  1 sibling, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28  3:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Vinod, Chegu, Low, Jason, linux-tip-commits, Peter Zijlstra,
	H. Peter Anvin, Andrew Morton, aquini, Michel Lespinasse,
	Ingo Molnar

On Wed, Feb 27, 2013 at 6:58 PM, Rik van Riel <riel@redhat.com> wrote:
>
> On the other hand, both MCS and the fast queue locks
> implemented by Michel showed low variability and high
> performance.

On microbenchmarks, and when implemented for only one single subsystem, yes.

> The numbers for Michel's MCS and fast queue lock
> implementations appear to be both fast and stable.

I do think that doing specialized spinlocks for special areas may be a
rasonable approach, and it's quite possible that the SySV IPC thing is
one such area.

But no, I don't think the numbers I've seen for Michel's MCS are AT
ALL comparable to the generic spinlocks, and the interface makes them
incompatible as a replacement to even test in general.

Don't get me wrong: I think the targeted approach is *better*. I just
also happen to think that you spout big words about things that aren't
all that big, and try to make this a bigger deal than it is. The
benchmark numbers you point to are micro-benchmarks, and not all
comparable.

              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28  2:58                               ` Rik van Riel
  2013-02-28  3:19                                 ` Linus Torvalds
@ 2013-02-28  4:06                                 ` Davidlohr Bueso
  2013-02-28  4:49                                   ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Davidlohr Bueso @ 2013-02-28  4:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 1760 bytes --]

On Wed, 2013-02-27 at 21:58 -0500, Rik van Riel wrote:
> On 02/27/2013 05:13 PM, Linus Torvalds wrote:
> >
> > On Feb 27, 2013 1:56 PM, "Rik van Riel" <riel@redhat.com
> > <mailto:riel@redhat.com>> wrote:
> >>
> >> No argument there, but that does in no way negate the need for some
> >> performance robustness.
> >
> > The very numbers you posted showed that the backoff was *not* more
> > robust. Quite the reverse, there was arguably more variability.
> 
> On the other hand, both MCS and the fast queue locks
> implemented by Michel showed low variability and high
> performance.
> 
> http://thread.gmane.org/gmane.linux.kernel/1427417
> 
> > So I really don't like how you make these sweeping statements
> > *again*. Numbers talk, bullshit walks.
> 
> If you read all the text in my last mail, you will see the
> link to Michel's performance results. The numbers speak for
> themselves.
> 
> > The fact is, life is complicated. The simple spinlocks tend to work
> > really well. People have tried fancy things before, and it turns out
> > it's not as simple as they think.
> 
> The numbers for both the simple spinlocks and the
> spinlock backoff kind of suck. Both of these have
> high variability, and both eventually fall down
> under heavy load.
> 
> The numbers for Michel's MCS and fast queue lock
> implementations appear to be both fast and stable.
> 
> I agree that we need numbers.

FWIW I've been doing some benchmarking for Swingbench DSS workloads
(Oracle data mining) comparing Rik and Michel's patches. With lower
amounts of contention, Rik's ticket spinlock is better, but once
contention gets high enough the queued locks performs better.

The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.

[-- Attachment #2: dss-ipclock.png --]
[-- Type: image/png, Size: 27346 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28  4:06                                 ` Davidlohr Bueso
@ 2013-02-28  4:49                                   ` Linus Torvalds
  2013-02-28 15:13                                     ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28  4:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Rik van Riel, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar

On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
>
> The attached file shows how the amount of sys time used by the ipc lock
> for a 4 and 8 socket box.

I have to say, even with the improvements, that looks pretty
disgusting. It really makes me wonder if that thing couldn't be done
better some way. Is it the SysV semaphores that this all ends up
using, or what?

That said, I think the IPC layer is just about the perfect candidate
for things like this, because I'm afraid that nobody is ever going to
fix it.

          Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28  4:49                                   ` Linus Torvalds
@ 2013-02-28 15:13                                     ` Rik van Riel
  2013-02-28 18:22                                       ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-28 15:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar

On 02/27/2013 11:49 PM, Linus Torvalds wrote:
> On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
>>
>> The attached file shows how the amount of sys time used by the ipc lock
>> for a 4 and 8 socket box.
>
> I have to say, even with the improvements, that looks pretty
> disgusting. It really makes me wonder if that thing couldn't be done
> better some way. Is it the SysV semaphores that this all ends up
> using, or what?
>
> That said, I think the IPC layer is just about the perfect candidate
> for things like this, because I'm afraid that nobody is ever going to
> fix it.

There's more to it than that. Userspace expects the IPC layer
to provide exclusion and/or serialization. That makes it a
rather poor candidate for parallelism.

Btw, the IPC lock is already fairly fine grained. One ipc
lock is allocated for each set of semaphores allocated through
sys_semget. Looking up those semaphores in the namespace, when
they are used later, is done under RCU.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 15:13                                     ` Rik van Riel
@ 2013-02-28 18:22                                       ` Linus Torvalds
  2013-02-28 20:26                                         ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28 18:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar

On Thu, Feb 28, 2013 at 7:13 AM, Rik van Riel <riel@redhat.com> wrote:
>
> Btw, the IPC lock is already fairly fine grained. One ipc
> lock is allocated for each set of semaphores allocated through
> sys_semget. Looking up those semaphores in the namespace, when
> they are used later, is done under RCU.

Bullshit. There is no fine-grained locking, and the code has never
ever been optimized for concurrency at all.

Yes, the lookup is done with RCU, and that's good. And I can tell you
*why* it's done with RCU:  all the trivial sysv semaphore benchmarks
I've seen tend to have just one lock per semget and then they test the
performance of that semaphore.

But I'm suspecting that since there is clearly contention going on
(and hopefully the real world people aren't just all pounding on one
single mutex), maybe these loads actually allocate multiple semaphores
with semget (that 'nsems' argument). And then we scale really badly,
because we use a single spinlock for all of them, even if we didn't
have to.

THERE IS ABSOLUTELY ZERO SCALING. None. Nada. The scaling that exists
(your RCU lookup example) is for *unrelated* locks, which is exactly
what you'd expect if all the simple benchmarks around did a single
sem-lock and then the "scaling" was tested by running a thousand
copies of that program - all using their own private single ipc
semaphore.

Plus the code in ipc/sem.c really is not very good. It does that
"sem_lock()" (which gets the ipc spinlock) way too eagerly, because
the semaphore lock also protects some trivial ipc stuff, so it gets
the lock *first*, then it does all the random checks it wants to do
(security checks, you name it), and then ot does all the queue undo
crap etc. All while holding that lock.

So no. There's no scaling, and the SINGLE lock that is held is held
way too f*cking long.

And that's what I'm talking about. We've never had people actually
bother about the IPC locking, because we've never had people who
cared. Everybody hates the crazy SysV IPC code, often for good reason.
People are told not to use it (use shared memory and futexes instead,
which actually *do* scale, and people *do* care about). But of course,
lots of projects do use the horrid SysV IPC code because it's
portable, and it does have those undo features etc.

And I'm sure we *could* fix it, but judging by past performance nobody
really cares enough. Which is why I do think that fixing it the wrong
way (by making the contention on the ipc_lock()) may well be the right
thing here for once.

Of course, if the real-world applications really only use a single
SysV semaphore, then we can't scale any better. But even then we could
still try to lower the hold-times of that ipc_lock().

And hold-time really tends to be one big killer for contention. A lot
of hardware tends to have ping-pong avoidance logic that means that
quick unlocks after getting the lock, and it will still remain in the
local node caches. Hardware features like that make it much harder to
introduce the really bad contention cases even if there are tons of
CPU's trying to access the same lock. But if the lock hold times are
long, it's *much* easier to get bad contention.

Looking at the code, if we could just get the security hook outside of
the locked region, that would probably be a big improvement. Maybe do
that part in the RCU lookup part? Because selinux really is very
expensive, with the whole ipc_has_perm -> avc_has_perm_flags thing etc
etc. This is exactly the kind of thing we did with pathnames.

I'm sure there are other things we could do to improve ipc lock times
even if we don't actually split the lock, but the security one might
be a good first step.

                      Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 18:22                                       ` Linus Torvalds
@ 2013-02-28 20:26                                         ` Linus Torvalds
  2013-02-28 21:14                                           ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28 20:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar

On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm sure there are other things we could do to improve ipc lock times
> even if we don't actually split the lock, but the security one might
> be a good first step.

Btw, if somebody has a benchmark for threads using multiple ipc
semaphores (from the same semget() allocation) concurrently, and we
could have a simple way to see the contention without having to run
some big DB thing, that would also be nice. Maybe there is something
out there already? Google didn't find any, and the normal benchmarks
I'm aware of all just do one single (private) ipc semaphore per
process.

Nothing gets some people going like just having a nice benchmark to
show the effect.

Because maybe I'm wrong, and there really are people who would love to
muck around with the SysV IPC code, and they just didn't have the
right reason or benchmark for it...

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 20:26                                         ` Linus Torvalds
@ 2013-02-28 21:14                                           ` Rik van Riel
  2013-02-28 21:58                                             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-28 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On 02/28/2013 03:26 PM, Linus Torvalds wrote:
> On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I'm sure there are other things we could do to improve ipc lock times
>> even if we don't actually split the lock, but the security one might
>> be a good first step.
>
> Btw, if somebody has a benchmark for threads using multiple ipc
> semaphores (from the same semget() allocation) concurrently, and we
> could have a simple way to see the contention without having to run
> some big DB thing, that would also be nice. Maybe there is something
> out there already? Google didn't find any, and the normal benchmarks
> I'm aware of all just do one single (private) ipc semaphore per
> process.
>
> Nothing gets some people going like just having a nice benchmark to
> show the effect.

I have modified one of the semop tests to use multiple semaphores.

To run the test, specify the number of threads.  If you want the
number of semaphores to be different from the number of threads,
specify a second commandline argument.

$ ./semop-multi
usage: ./semop-multi <threads> [nsems]


[-- Attachment #2: semop-multi.c --]
[-- Type: text/x-csrc, Size: 4046 bytes --]

#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <malloc.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <sys/sem.h>

#define TEST_TIME	30
#define SEMMNI		128

int		semid;
int		state = 1;
unsigned long	*results_array;
int		threads_starting;
pthread_cond_t	thread_parent;
pthread_cond_t	thread_worker;
pthread_mutex_t	thread_lock;
int		nsems;

union semun {
	int val;
	struct semid_ds *buf;
	unsigned short int *array;
	struct seminfo *__buf;
	void *__pad;
};

void *
worker(void *arg)
{
	unsigned long	count = 0;
	int		id = (int)(unsigned long)arg;
	struct sembuf	sembuff;

	sembuff.sem_num = 0;
	sembuff.sem_flg = 0;

	pthread_mutex_lock(&thread_lock);
	threads_starting--;
	if (!threads_starting)
		pthread_cond_signal(&thread_parent);
	pthread_cond_wait(&thread_worker, &thread_lock);
	pthread_mutex_unlock(&thread_lock);

	for (;state;) {

		/* Move "id" ahead through the semaphores */
		sembuff.sem_num = (sembuff.sem_num + id) % nsems;

		/* Lock the semaphore */
		sembuff.sem_op = 1;
		if (semop(semid, &sembuff, 1) < 0) {
			perror("semop");
			exit(1);
		}

		/* Unlock the semaphore */
		sembuff.sem_op = -1;
		if (semop(semid, &sembuff, 1) < 0) {
			perror("semop");
			exit(1);
		}

		count += 2;
	}

	results_array[id] = count;

	return NULL;
}

int
main(int argc, char **argv)
{
	pthread_t	*thread_array;
	pthread_attr_t	thread_attr;
	int		thread_count;
	unsigned short	seminit[SEMMNI];
	union semun	sem_un;
	cpu_set_t	cpu;
	unsigned long	total = 0;
	int		i, ret;
	long		cpus;

	cpus = sysconf(_SC_NPROCESSORS_ONLN);

	if (argc < 2) {
		printf("usage: %s <threads> [nsems]\n", argv[0]);
		exit(1);
	}

	thread_count = atoi(argv[1]);

	if (thread_count < 0) {
		printf("threads must be >= 0\n");
		exit(1);
	}

	if (thread_count == 0)
		thread_count = cpus;

	if (argc > 2)
		nsems = atoi(argv[2]);
	else
		nsems = thread_count;
	if (nsems > SEMMNI)
		nsems = SEMMNI;

	printf("cpus %ld, threads: %d, semaphores: %d, test duration: %d secs\n", cpus, thread_count, nsems, TEST_TIME);

	thread_array = malloc(thread_count * sizeof(pthread_t));

	if (!thread_array) {
		perror("malloc(thread_array)");
		exit(1);
	}

	results_array = malloc(thread_count * sizeof(unsigned long));

	if (!results_array) {
		perror("malloc(results_array)");
		exit(1);
	}

	semid = semget(0x12345, nsems, 0777|IPC_CREAT );

	if (semid < 0) {
		perror("semget");
		exit(1);
	}

	for (i = 0; i < SEMMNI; i++)
		seminit[i] = 200;
	sem_un.array = seminit;
  
	if (semctl(semid, 1, SETALL, sem_un) < 0) {
		perror("semctl(setall)");
		exit(1);
	}

	pthread_mutex_init(&thread_lock, NULL);
	pthread_cond_init(&thread_parent, NULL);
	pthread_cond_init(&thread_worker, NULL);
	pthread_attr_init(&thread_attr);

	threads_starting = thread_count;

	for (i = 0; i < thread_count; i++) {

		CPU_ZERO(&cpu);
		CPU_SET(i % cpus, &cpu);

		ret = pthread_attr_setaffinity_np(&thread_attr, sizeof(cpu_set_t), &cpu);

		if (ret) {
			printf("pthread_attr_setaffinity_np: %s\n", strerror(ret));
			exit(1);
		}

		ret = pthread_create(&thread_array[i], &thread_attr, worker, (void *)(unsigned long)i);

		if (ret) {
			printf("pthread_create: %s\n", strerror(ret));
			exit(1);
		}
	}

	pthread_attr_destroy(&thread_attr);

	pthread_mutex_lock(&thread_lock);
	while (threads_starting)
		pthread_cond_wait(&thread_parent, &thread_lock);
	pthread_cond_broadcast(&thread_worker);
	pthread_mutex_unlock(&thread_lock);

	sleep(TEST_TIME);
	state = 0;

	for (i = 0; i < thread_count; i++)
		pthread_join(thread_array[i], NULL);

	pthread_cond_destroy(&thread_parent);
	pthread_cond_destroy(&thread_worker);
	pthread_mutex_destroy(&thread_lock);

	if (semctl(semid, 1, IPC_RMID) < 0)
		perror("semctl(rmid)");

	for (i = 0; i < thread_count; i++)
		total += results_array[i];

	printf("total operations: %ld, ops/sec %ld\n", total, total / TEST_TIME);

	free(thread_array);
	free(results_array);

	return 0;
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 21:14                                           ` Rik van Riel
@ 2013-02-28 21:58                                             ` Linus Torvalds
  2013-02-28 22:38                                               ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28 21:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On Thu, Feb 28, 2013 at 1:14 PM, Rik van Riel <riel@redhat.com> wrote:
>
> I have modified one of the semop tests to use multiple semaphores.

Ooh yeah. This shows contention quite nicely. And it's all from
ipc_lock, and looking at the top-10 loffenders of the profile:

 43.01%  semop-multi  [kernel.kallsyms]   [k] _raw_spin_lock
...
  4.73%  semop-multi  [kernel.kallsyms]   [k] avc_has_perm_flags
  4.52%  semop-multi  [kernel.kallsyms]   [k] ipc_has_perm.isra.21
...
  2.43%  semop-multi  [kernel.kallsyms]   [k] ipcperms

The 43% isn't actually all that interesting, it just shows that there
is contention and we're waiting for other user. Yes, we waste almost
half the CPU time on locking, but ignore that for a moment.

The "more than 10% of the total time is spent in ipc permission code"
*is* the interesting part. Because that 10%+ is actually more like 20%
if you ignore the "wait for lock" part. And it's all done *inside* the
lock.

In other words, I can pretty much guarantee that the contention will
go down a lot if we just move the security check outside the spinlock.
According to the above numbers, we're currently spending basically
1/5th of our remaining CPU resources serialized for absolutely no good
reason. THAT is the kind of thing we shouldn't do.

The rest of the big offenders seem to be mostly done outside the
spinlock, although it's hard to tell how much of the 10% of
sys_semtimedop() iis also under the lock. There's probably other
things there than just the permission checking.

I'm not seeing any real reason the permission checking couldn't be
done just under the RCU lock, before we get the spinlock. Except for
the fact that the "helper" routines in ipc/util.c are written the way
they are, so it's a layering violation. But I really think that would
be a *reasonably* low-hanging fruit thing to do.

Changing the locking itself to be more fine-grained, and doing it
across many different ipc semaphores would be a major pain. So I do
suspect that the work Michel Lespinasse did is probably worth doing
anyway in addition to at least trying to fix the horrible lack of
scalability of the code a bit.

                      Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 21:58                                             ` Linus Torvalds
@ 2013-02-28 22:38                                               ` Rik van Riel
  2013-02-28 23:09                                                 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-02-28 22:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On 02/28/2013 04:58 PM, Linus Torvalds wrote:

> I'm not seeing any real reason the permission checking couldn't be
> done just under the RCU lock, before we get the spinlock. Except for
> the fact that the "helper" routines in ipc/util.c are written the way
> they are, so it's a layering violation. But I really think that would
> be a *reasonably* low-hanging fruit thing to do.

I could see doing the permission checks under a seq lock.

If the permissions, or any other aspect of the semaphore
array changed while we were doing our permission check,
we can simply jump back to the top of the function and
try again.

Protection against IPC_RMID (removal of the semaphore
block) can probably be done with RCU.

The ugliness of using two kinds of protection may be
necessary, since permissions can be modified in-place,
and RCU does not seem to do in-place modification...

I have been staring at the code for a bit this afternoon,
and have yet to come up with a nicer idea :(

I am always open to suggestions, though :)

> Changing the locking itself to be more fine-grained, and doing it
> across many different ipc semaphores would be a major pain. So I do
> suspect that the work Michel Lespinasse did is probably worth doing
> anyway in addition to at least trying to fix the horrible lack of
> scalability of the code a bit.

That would certainly be the easy fix.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 22:38                                               ` Rik van Riel
@ 2013-02-28 23:09                                                 ` Linus Torvalds
  2013-03-01  6:42                                                   ` Rik van Riel
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2013-02-28 23:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On Thu, Feb 28, 2013 at 2:38 PM, Rik van Riel <riel@redhat.com> wrote:
>
> I could see doing the permission checks under a seq lock.
>
> If the permissions, or any other aspect of the semaphore
> array changed while we were doing our permission check,
> we can simply jump back to the top of the function and
> try again.

We do the normal file permissions under just the RCU read lock. If
it's good enough for files, them ipc semaphores certainly don't need
anything more.

> The ugliness of using two kinds of protection may be
> necessary, since permissions can be modified in-place,
> and RCU does not seem to do in-place modification...

The selinux AVC code does its stuff by rcu, and for files we don't
even care if there is some uid/gid change going on - we'll  get one or
the other, and live with it. There is no atomic way to change the uid
and gid anyway, so there are no new races. You get old or you get new,
or a mix of the two, any of it is "correct".

So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check. I just looked at the ipcperm() part, which
certainly doesn't seem to merit spinlocks.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-02-28 23:09                                                 ` Linus Torvalds
@ 2013-03-01  6:42                                                   ` Rik van Riel
  2013-03-01 18:18                                                     ` Davidlohr Bueso
  0 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2013-03-01  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On 02/28/2013 06:09 PM, Linus Torvalds wrote:

> So I almost think that *everything* there in the semaphore code could
> be done under RCU. The actual spinlock doesn't seem to much matter, at
> least for semaphores. The semaphore values themselves seem to be
> protected by the atomic operations, but I might be wrong about that, I
> didn't even check.

Checking try_atomic_semop and do_smart_update, it looks like neither
is using atomic operations. That part of the semaphore code would
still benefit from spinlocks.

The way the code handles a whole batch of semops all at once,
potentially to multiple semaphores at once, and with the ability
to undo all of the operations, it looks like the spinlock will
still need to be per block of semaphores.

I guess the code may still benefit from Michel's locking code,
after the permission stuff has been moved from under the spinlock.

Two remaining worries are the security_sem_free call and the
non-RCU list_del calls from freeary, called from the SEM_RMID
code. They are probably fine, but worth another pair of eyes...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-03-01  6:42                                                   ` Rik van Riel
@ 2013-03-01 18:18                                                     ` Davidlohr Bueso
  2013-03-01 18:50                                                       ` Rik van Riel
  2013-03-01 18:52                                                       ` Linus Torvalds
  0 siblings, 2 replies; 54+ messages in thread
From: Davidlohr Bueso @ 2013-03-01 18:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
> On 02/28/2013 06:09 PM, Linus Torvalds wrote:
> 
> > So I almost think that *everything* there in the semaphore code could
> > be done under RCU. The actual spinlock doesn't seem to much matter, at
> > least for semaphores. The semaphore values themselves seem to be
> > protected by the atomic operations, but I might be wrong about that, I
> > didn't even check.
> 
> Checking try_atomic_semop and do_smart_update, it looks like neither
> is using atomic operations. That part of the semaphore code would
> still benefit from spinlocks.

Agreed.

> 
> The way the code handles a whole batch of semops all at once,
> potentially to multiple semaphores at once, and with the ability
> to undo all of the operations, it looks like the spinlock will
> still need to be per block of semaphores.
> 
> I guess the code may still benefit from Michel's locking code,
> after the permission stuff has been moved from under the spinlock.

How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp->lock [q_]spinlock when necessary.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-03-01 18:18                                                     ` Davidlohr Bueso
@ 2013-03-01 18:50                                                       ` Rik van Riel
  2013-03-01 18:52                                                       ` Linus Torvalds
  1 sibling, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2013-03-01 18:50 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On 03/01/2013 01:18 PM, Davidlohr Bueso wrote:
> On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
>> On 02/28/2013 06:09 PM, Linus Torvalds wrote:
>>
>>> So I almost think that *everything* there in the semaphore code could
>>> be done under RCU. The actual spinlock doesn't seem to much matter, at
>>> least for semaphores. The semaphore values themselves seem to be
>>> protected by the atomic operations, but I might be wrong about that, I
>>> didn't even check.
>>
>> Checking try_atomic_semop and do_smart_update, it looks like neither
>> is using atomic operations. That part of the semaphore code would
>> still benefit from spinlocks.
>
> Agreed.

If we assume that calls to semctl with more than one semaphore
operator are rare, we could do something smarter here.

We could turn the outer spinlock into an rwlock. If we are
doing a call that modifies the outer structure, or multiple
semops at once, we take the lock exclusively.

If we want to do just one semop, we can take the lock in
shared mode. Then each semaphore inside would have its own
spinlock, and we lock just that one.

Of course, that would just add overhead to the case where
a semaphore block has just one semaphore in it, so I'm not
sure this would be worthwhile at all...

>> The way the code handles a whole batch of semops all at once,
>> potentially to multiple semaphores at once, and with the ability
>> to undo all of the operations, it looks like the spinlock will
>> still need to be per block of semaphores.
>>
>> I guess the code may still benefit from Michel's locking code,
>> after the permission stuff has been moved from under the spinlock.
>
> How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
> obtain the ipc object (rcu_read_lock + idr_find), which can be called
> when performing the permissions and security checks, and another to
> obtain the ipcp->lock [q_]spinlock when necessary.

That is what I am working on now.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line
  2013-03-01 18:18                                                     ` Davidlohr Bueso
  2013-03-01 18:50                                                       ` Rik van Riel
@ 2013-03-01 18:52                                                       ` Linus Torvalds
  1 sibling, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2013-03-01 18:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Rik van Riel, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, Vinod, Chegu, Low, Jason, linux-tip-commits,
	Peter Zijlstra, H. Peter Anvin, Andrew Morton, aquini,
	Michel Lespinasse, Ingo Molnar, Larry Woodman

On Fri, Mar 1, 2013 at 10:18 AM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
> On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
>>
>> Checking try_atomic_semop and do_smart_update, it looks like neither
>> is using atomic operations. That part of the semaphore code would
>> still benefit from spinlocks.
>
> Agreed.

Yup. As mentioned, I hadn't even looked at that part of the code, but
yes, it definitely wants the spinlock.

> How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
> obtain the ipc object (rcu_read_lock + idr_find), which can be called
> when performing the permissions and security checks, and another to
> obtain the ipcp->lock [q_]spinlock when necessary.

That sounds like absolutely the right thing to do. And we can leave
the old helper functions that do both of them around, and only use the
split case for just a few places.

And if we make the RCU read-lock be explicit too, we could get rid of
some of the ugliness. Right now we have semtimedop() do things like a
conditional "find_alloc_undo()", which will get the RCU lock. It would
be much nicer if we just cleaned up the interfaces a tiny bit, said
that the *caller* has to get the RCU lock, and just do this
unconditionally before calling any of it. Because right now the RCU
details are quite messy, and we have code like

                if (un)
                        rcu_read_unlock();
                error = PTR_ERR(sma);
                goto out_free;

etc, when it would actually be much simpler to just do the RCU read
locking unconditionally (we'll need it for the semaphore lookup
anyway) and just have the exit paths unlock unconditionally like we
usually do (ie a few different exit goto's that just nest the
unlocking properly).

It would simplify the odd locking both for humans and for code
generation. Right now we actually nest those RCU read locks two deep,
as far as I can see.

And it looks like this could be done in fairly small independent steps
("add explicit RCU locking", "split up helper functions", "start using
the simplified helper functions in selected paths that care").

It won't *solve* the locking issues, and I'm sure we'll still see
contention, but it should clean the code up and probably helps the
contention numbers visibly. Even if it's only 10% (which, judging by
my profiles, would be a safe lower bound from just moving the security
callback out of the spinlock - and it could easily be 20% of more
because contention often begets more contention) that would be in the
same kind of ballpark as the spinlock improvements. And using Michel's
work would be a largely independent scalability improvement on top.

Anybody willing to give it a try? I suspect Rik's benchmark, some
"perf record", and concentrating on just semtimedop() to begin with
would be a fairly good starting point.

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2013-03-01 18:52 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-06 20:03 [PATCH -v5 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-02-06 20:04 ` [PATCH -v5 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-02-13 12:06   ` [tip:core/locking] x86/smp: Move " tip-bot for Rik van Riel
2013-02-13 16:20     ` Linus Torvalds
2013-02-13 18:30       ` Linus Torvalds
2013-02-14  0:54         ` H. Peter Anvin
2013-02-14  1:31           ` Linus Torvalds
2013-02-14  1:56             ` H. Peter Anvin
2013-02-14 10:50         ` Ingo Molnar
2013-02-14 16:10           ` Linus Torvalds
2013-02-15 15:57             ` Ingo Molnar
2013-02-15  6:48         ` Benjamin Herrenschmidt
2013-02-13 19:08       ` Rik van Riel
2013-02-13 19:36         ` Linus Torvalds
2013-02-13 22:21           ` Rik van Riel
2013-02-13 22:40             ` Linus Torvalds
2013-02-13 23:41               ` Rik van Riel
2013-02-14  1:21                 ` Linus Torvalds
2013-02-14  1:46                   ` Linus Torvalds
2013-02-14 10:43                   ` Ingo Molnar
2013-02-27 16:42                   ` Rik van Riel
2013-02-27 17:10                     ` Linus Torvalds
2013-02-27 19:53                       ` Rik van Riel
2013-02-27 20:18                         ` Linus Torvalds
2013-02-27 21:55                           ` Rik van Riel
     [not found]                             ` <CA+55aFwa0EjGG2NUDYVLVBmXJa2k81YiuNO2yggk=GLRQxhhUQ@mail.gmail.com>
2013-02-28  2:58                               ` Rik van Riel
2013-02-28  3:19                                 ` Linus Torvalds
2013-02-28  4:06                                 ` Davidlohr Bueso
2013-02-28  4:49                                   ` Linus Torvalds
2013-02-28 15:13                                     ` Rik van Riel
2013-02-28 18:22                                       ` Linus Torvalds
2013-02-28 20:26                                         ` Linus Torvalds
2013-02-28 21:14                                           ` Rik van Riel
2013-02-28 21:58                                             ` Linus Torvalds
2013-02-28 22:38                                               ` Rik van Riel
2013-02-28 23:09                                                 ` Linus Torvalds
2013-03-01  6:42                                                   ` Rik van Riel
2013-03-01 18:18                                                     ` Davidlohr Bueso
2013-03-01 18:50                                                       ` Rik van Riel
2013-03-01 18:52                                                       ` Linus Torvalds
2013-02-06 20:04 ` [PATCH -v5 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-02-13 12:07   ` [tip:core/locking] x86/smp: Implement " tip-bot for Rik van Riel
2013-02-06 20:05 ` [PATCH -v5 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-02-13 12:08   ` [tip:core/locking] x86/smp: Auto " tip-bot for Rik van Riel
2013-02-06 20:06 ` [PATCH -v5 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-02-13 12:09   ` [tip:core/locking] x86/smp: Keep " tip-bot for Eric Dumazet
2013-02-06 20:07 ` [PATCH -v5 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
2013-02-07 11:11   ` Ingo Molnar
2013-02-07 21:24     ` [PATCH fix " Rik van Riel
2013-02-13 12:10       ` [tip:core/locking] x86/smp: Limit " tip-bot for Rik van Riel
2013-02-07 11:25   ` [PATCH -v5 5/5] x86,smp: limit " Stefano Stabellini
2013-02-07 11:59     ` Raghavendra K T
2013-02-07 13:28     ` Rik van Riel
2013-02-06 20:08 ` [PATCH -v5 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.