linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2013-01-25 19:05 Rik van Riel
  2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

[-- Attachment #1: Type: text/plain, Size: 3543 bytes --]

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.


[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2013-01-25 19:16 ` Rik van Riel
  2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
---
v2: clean up the code a little, after Michel's suggestion 

 arch/x86/include/asm/spinlock.h |   11 +++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
+
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-25 19:17 ` Rik van Riel
  2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.

The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
---
 arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..aa743e9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned loops;
+
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		waiters_ahead = ticket - head - 1;
+		/*
+		 * We are next after the current lock holder. Check often
+		 * to avoid wasting time when the lock is released.
+		 */
+		if (!waiters_ahead) {
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			break;
+		}
+		loops = 50 * waiters_ahead;
+		while (loops--)
+			cpu_relax();
 
-		if (inc.head == inc.tail)
+		head = ACCESS_ONCE(lock->tickets.head);
+		if (head == ticket)
 			break;
 	}
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
  2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-25 19:17 ` Rik van Riel
  2013-01-26 12:03   ` Ingo Molnar
  2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kernel/smp.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..05f828b 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
  */
+#define DELAY_SHIFT 8
+#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
+DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned loops;
+	unsigned delay = __this_cpu_read(spinlock_delay);
+	unsigned loops = 1;
 
 	for (;;) {
 		waiters_ahead = ticket - head - 1;
@@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
 			break;
 		}
-		loops = 50 * waiters_ahead;
+
+		/* Aggressively increase delay, to minimize lock accesses. */
+		if (delay < MAX_SPINLOCK_DELAY)
+			delay += DELAY_FIXED_1 / 7;
+
+		loops = (delay * waiters_ahead) >> DELAY_SHIFT;
 		while (loops--)
 			cpu_relax();
 
 		head = ACCESS_ONCE(lock->tickets.head);
-		if (head == ticket)
+		if (head == ticket) {
+			/*
+			 * We overslept, and do not know by how.
+			 * Exponentially decay the value of delay,
+			 * to get it back to a good value quickly.
+			 */
+			if (delay >= 2 * DELAY_FIXED_1)
+				delay -= max(delay/32, DELAY_FIXED_1);
 			break;
+		}
 	}
+	__this_cpu_write(spinlock_delay, delay);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (2 preceding siblings ...)
  2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-25 19:18 ` Rik van Riel
  2013-01-27 13:04   ` Michel Lespinasse
  2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 05f828b..3d30f6e 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1 (1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
 #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+	u32 hash;
+	u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+		.hash = 0,
+		.delay = MIN_SPINLOCK_DELAY,
+	},
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned delay = __this_cpu_read(spinlock_delay);
+	u32 hash = hash32_ptr(lock);
+	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
 	unsigned loops = 1;
 
 	for (;;) {
@@ -171,11 +186,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			 * to get it back to a good value quickly.
 			 */
 			if (delay >= 2 * DELAY_FIXED_1)
-				delay -= max(delay/32, DELAY_FIXED_1);
+				delay -= max_t(u32, delay/32, DELAY_FIXED_1);
 			break;
 		}
 	}
-	__this_cpu_write(spinlock_delay, delay);
+	ent->hash = hash;
+	ent->delay = delay;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (3 preceding siblings ...)
  2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-25 19:19 ` Rik van Riel
  2013-01-26 12:00   ` Ingo Molnar
  2013-01-28 18:18   ` Stefano Stabellini
  2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Modern Intel and AMD CPUs will trap to the host when the guest
is spinning on a spinlock, allowing the host to schedule in
something else.

This effectively means the host is taking care of spinlock
backoff for virtual machines. It also means that doing the
spinlock backoff in the guest anyway can lead to totally
unpredictable results, extremely large backoffs, and
performance regressions.

To prevent those problems, we limit the spinlock backoff
delay, when running in a virtual machine, to a small value.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/processor.h |    2 ++
 arch/x86/kernel/setup.c          |    2 ++
 arch/x86/kernel/smp.c            |   30 ++++++++++++++++++++++++------
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 888184b..a365f97 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
 extern unsigned long arch_align_stack(unsigned long sp);
 extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
 
+extern void init_spinlock_delay(void);
+
 void default_idle(void);
 bool set_pm_idle_to_default(void);
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 23ddd55..b834eae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
 
 	arch_init_ideal_nops();
 
+	init_spinlock_delay();
+
 	register_refined_jiffies(CLOCK_TICK_RATE);
 
 #ifdef CONFIG_EFI
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 1877890..b1a65f0 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -31,6 +31,7 @@
 #include <asm/proto.h>
 #include <asm/apic.h>
 #include <asm/nmi.h>
+#include <asm/hypervisor.h>
 /*
  *	Some notes on x86 processor bugs affecting SMP operation:
  *
@@ -114,6 +115,27 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Modern Intel and AMD CPUs tell the hypervisor when a guest is
+ * spinning excessively on a spinlock. The hypervisor will then
+ * schedule something else, effectively taking care of the backoff
+ * for us. Doing our own backoff on top of the hypervisor's pause
+ * loop exit handling can lead to excessively long delays, and
+ * performance degradations. Limit the spinlock delay in virtual
+ * machines to a smaller value.
+ */
+#define DELAY_SHIFT 8
+#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
+static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
+void __init init_spinlock_delay(void)
+{
+	if (x86_hyper)
+		max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
+}
+
+/*
  * Wait on a congested ticket spinlock. Many spinlocks are embedded in
  * data structures; having many CPUs pounce on the cache line with the
  * spinlock simultaneously can slow down the lock holder, and the system
@@ -131,10 +153,6 @@ static bool smp_no_nmi_ipi = false;
  * the queue, to slowly increase the delay if we sleep for too short a
  * time, and to decrease the delay if we slept for too long.
  */
-#define DELAY_SHIFT 8
-#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
-#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
-#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
 #define DELAY_HASH_SHIFT 6
 struct delay_entry {
 	u32 hash;
@@ -171,7 +189,7 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 		}
 
 		/* Aggressively increase delay, to minimize lock accesses. */
-		if (delay < MAX_SPINLOCK_DELAY)
+		if (delay < max_spinlock_delay)
 			delay += DELAY_FIXED_1 / 7;
 
 		loops = (delay * waiters_ahead) >> DELAY_SHIFT;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (4 preceding siblings ...)
  2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-01-25 19:20 ` Rik van Riel
  2013-01-26  7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

Subject: x86,smp: add debugging code to track spinlock delay value

From: Eric Dumazet <eric.dumazet@gmail.com>

This code prints out the maximum spinlock delay value and the
backtrace that pushed it that far.

On systems with serial consoles, the act of printing can cause
the spinlock delay value to explode. It can still be useful as
a debugging tool, but is probably too verbose to merge upstream
in this form.

Not-signed-off-by: Rik van Riel <riel@redhat.com>
Not-signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>


diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 8908dfc..f973fc9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -165,6 +165,8 @@ static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay
 	},
 };
 
+static DEFINE_PER_CPU(u32, maxdelay);
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
@@ -210,6 +212,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 	}
 	ent->hash = hash;
 	ent->delay = delay;
+
+	if (__this_cpu_read(maxdelay) * 4 < delay * 3) {
+		pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay>>DELAY_SHIFT);
+		__this_cpu_write(maxdelay, delay);
+		WARN_ON(1);
+	}
 }
 
 /*


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (5 preceding siblings ...)
  2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
@ 2013-01-26  7:21 ` Mike Galbraith
  2013-01-26 12:05   ` Ingo Molnar
  2013-01-28  6:49 ` Raghavendra K T
  2013-02-03 15:28 ` Chegu Vinod
  8 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2013-01-26  7:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	chegu_vinod, raghavendra.kt, mingo

On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:

> The performance issue observed with AIM7 is still a mystery.

Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box.
Stock scheduler knobs are too preempt happy, produce unstable results.
I twiddle them as below to stabilize results.

I'm testing a load balancing series from Alex Shi with AIM7 and whatnot,
added your series on top of it and retested.  What I see is
improvement. 

Oodles of numbers follow.  Sorry that your numbers are mixed in with my
numbers, but this is just an excerpt from my test log, and I'm too lazy
to reformat and filter.  You can save wear and tear on your eyeballs by
just poking 'D'.  There does appear to be evidence that your patch set
improved this load though, so in case you want to see numbers, here come
a bunch, a quick scroll-by may be worth it.

The very heavy load end did not improve, which seems odd, but whatever.
Numbers...

sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

aim7 compute
         3.8.0-performance                                       3.8.0-balance                                           3.8.0-powersaving
Tasks    jobs/min  jti  jobs/min/task      real       cpu        jobs/min  jti  jobs/min/task      real       cpu        jobs/min  jti  jobs/min/task      real       cpu
    1      432.86  100       432.8571     14.00      3.99          433.48  100       433.4764     13.98      3.97          433.17  100       433.1665     13.99      3.98
    1      437.23  100       437.2294     13.86      3.85          436.60  100       436.5994     13.88      3.86          435.66  100       435.6578     13.91      3.90
    1      434.10  100       434.0974     13.96      3.95          436.29  100       436.2851     13.89      3.89          436.29  100       436.2851     13.89      3.87
    5     2400.95   99       480.1902     12.62     12.49         2554.81   98       510.9612     11.86      7.55         2487.68   98       497.5369     12.18      8.22
    5     2341.58   99       468.3153     12.94     13.95         2578.72   99       515.7447     11.75      7.25         2527.11   99       505.4212     11.99      7.90
    5     2350.66   99       470.1319     12.89     13.66         2600.86   99       520.1717     11.65      7.09         2508.28   98       501.6556     12.08      8.24
   10     4291.78   99       429.1785     14.12     40.14         5334.51   99       533.4507     11.36     11.13         5183.92   98       518.3918     11.69     12.15
   10     4334.76   99       433.4764     13.98     38.70         5311.13   99       531.1131     11.41     11.23         5215.15   99       521.5146     11.62     12.53
   10     4273.62   99       427.3625     14.18     40.29         5287.96   99       528.7958     11.46     11.46         5144.31   98       514.4312     11.78     12.32
   20     8487.39   94       424.3697     14.28     63.14        10594.41   99       529.7203     11.44     23.72        10575.92   99       528.7958     11.46     22.08
   20     8387.54   97       419.3772     14.45     77.01        10575.92   98       528.7958     11.46     23.41        10520.83   99       526.0417     11.52     21.88
   20     8713.16   95       435.6578     13.91     55.10        10659.63   99       532.9815     11.37     24.17        10539.13   99       526.9565     11.50     22.13
   40    16786.70   99       419.6676     14.44    170.08        19469.88   98       486.7470     12.45     60.78        19967.05   98       499.1763     12.14     51.40
   40    16728.78   99       418.2195     14.49    172.96        19627.53   98       490.6883     12.35     65.26        20386.88   98       509.6720     11.89     46.91
   40    16763.49   99       419.0871     14.46    171.42        20033.06   98       500.8264     12.10     51.44        20682.59   98       517.0648     11.72     42.45
   80    33024.52   98       412.8065     14.68    355.10        33205.48   98       415.0685     14.60    336.90        33690.06   97       421.1258     14.39    248.91
   80    33002.04   99       412.5255     14.69    356.27        33949.58   96       424.3697     14.28    283.87        33160.05   97       414.5007     14.62    264.85
   80    33047.03   99       413.0879     14.67    355.22        33137.39   98       414.2174     14.63    338.92        33526.97   97       419.0871     14.46    257.31
  160    64254.47   98       401.5905     15.09    391.30        64000.00   98       400.0000     15.15    396.87        65073.83   97       406.7114     14.90    371.09
  160    64468.09   98       402.9255     15.04    390.28        64553.93   98       403.4621     15.02    389.49        64640.00   98       404.0000     15.00    379.82
  160    64297.08   98       401.8568     15.08    389.45        64856.19   98       405.3512     14.95    383.64        64683.12   98       404.2695     14.99    379.43
  320   121579.94   98       379.9373     15.95    466.15       122039.02   98       381.3719     15.89    456.17       122811.91   97       383.7872     15.79    445.16
  320   121427.68   98       379.4615     15.97    466.68       121885.61   98       380.8925     15.91    462.30       122734.18   97       383.5443     15.80    447.03
  320   121200.00   98       378.7500     16.00    465.58       121048.69   97       378.2772     16.02    470.41       122501.58   97       382.8174     15.83    447.74
  640   215466.67   95       336.6667     18.00    624.64       211703.06   96       330.7860     18.32    654.19       216066.85   96       337.6045     17.95    620.45
  640   216066.85   95       337.6045     17.95    623.32       210097.51   96       328.2774     18.46    664.05       214394.69   96       334.9917     18.09    634.61
  640   213568.28   96       333.7004     18.16    641.45       214039.74   96       334.4371     18.12    638.48       215227.52   96       336.2930     18.02    630.19
 1280   320528.93   92       250.4132     24.20   1120.46       327015.18   91       255.4806     23.72   1084.29       326464.65   91       255.0505     23.76   1087.39
 1280   324552.30   92       253.5565     23.90   1098.43       323739.57   92       252.9215     23.96   1104.09       325232.70   92       254.0881     23.85   1095.50
 1280   327153.10   91       255.5884     23.71   1084.75       326190.08   91       254.8360     23.78   1086.72       325232.70   92       254.0881     23.85   1093.32
 2560   418946.80   86       163.6511     37.03   2140.41       418720.65   88       163.5628     37.05   2146.30       419626.72   87       163.9167     36.97   2139.74
 2560   418607.66   88       163.5186     37.06   2144.50       419059.97   89       163.6953     37.02   2143.47       419059.97   87       163.6953     37.02   2140.70
 2560   418607.66   84       163.5186     37.06   2138.81       419513.25   84       163.8724     36.98   2139.09       419286.49   87       163.7838     37.00   2140.60

/me adds spinlock series to take it for a spin, repeats test series...

         3.8.0-performance                                       3.8.0-balance                                           3.8.0-powersaving
Tasks    jobs/min  jti  jobs/min/task      real       cpu        jobs/min  jti  jobs/min/task      real       cpu        jobs/min  jti  jobs/min/task      real       cpu
    1      436.91  100       436.9142     13.87      3.69          433.17  100       433.1665     13.99      3.98          434.72  100       434.7202     13.94      3.93
    1      440.41  100       440.4070     13.76      3.74          437.86  100       437.8613     13.84      3.83          435.97  100       435.9712     13.90      3.89
    1      441.69  100       441.6910     13.72      3.71          435.66  100       435.6578     13.91      3.88          438.49  100       438.4949     13.82      3.81
    5     2267.96   98       453.5928     13.36     15.72         2580.92   99       516.1840     11.74      7.22         2535.56   99       507.1130     11.95      8.04
    5     2295.45   99       459.0909     13.20     15.32         2578.72   98       515.7447     11.75      7.22         2516.61   98       503.3223     12.04      8.24
    5     2279.91   98       455.9819     13.29     15.49         2600.86   99       520.1717     11.65      7.19         2514.52   98       502.9046     12.05      8.49
   10     4391.30   99       439.1304     13.80     36.83         5315.79   99       531.5789     11.40     11.21         5210.66   98       521.0662     11.63     12.41
   10     4375.45   99       437.5451     13.85     37.28         5201.72   99       520.1717     11.65     12.92         5242.21   99       524.2215     11.56     12.29
   10     4413.69   99       441.3693     13.73     36.15         5311.13   99       531.1131     11.41     11.11         5219.64   98       521.9638     11.61     12.25
   20     8801.74   97       440.0871     13.77     64.53        10725.66   99       536.2832     11.30     22.84        10511.71   99       525.5854     11.53     21.72
   20     8931.47   97       446.5733     13.57     58.68        10640.91   99       532.0457     11.39     23.95        10575.92   98       528.7958     11.46     21.53
   20     9004.46   96       450.2229     13.46     50.98        10669.01   99       533.4507     11.36     24.06        10557.49   99       527.8746     11.48     22.00
   40    17758.24   99       443.9560     13.65    135.44        20682.59   98       517.0648     11.72     43.25        20369.75   98       509.2437     11.90     47.04
   40    17667.64   99       441.6910     13.72    142.15        20860.59   98       521.5146     11.62     41.30        20250.63   98       506.2657     11.97     46.28
   40    17745.24   98       443.6310     13.66    137.20        20507.61   98       512.6904     11.82     44.98        20421.23   98       510.5307     11.87     44.25
   80    35725.87   99       446.5733     13.57    266.54        35438.60   98       442.9825     13.68    262.32        35028.90   97       437.8613     13.84    216.09
   80    35778.60   99       447.2325     13.55    266.69        35464.52   98       443.3065     13.67    256.49        35386.86   97       442.3358     13.70    210.93
   80    35805.02   99       447.5628     13.54    263.42        35542.52   98       444.2815     13.64    254.81        36098.29   97       451.2286     13.43    181.01
  160    68378.00   98       427.3625     14.18    316.02        68522.97   98       428.2686     14.15    321.43        69059.83   98       431.6239     14.04    298.16
  160    68522.97   98       428.2686     14.15    318.01        69257.14   97       432.8571     14.00    293.77        68765.96   98       429.7872     14.10    303.93
  160    68474.58   98       427.9661     14.16    320.64        68185.65   98       426.1603     14.22    324.11        68474.58   98       427.9661     14.16    311.32
  320   126910.99   97       396.5969     15.28    409.53       127244.09   98       397.6378     15.24    408.56       127831.25   98       399.4726     15.17    399.66
  320   127747.04   98       399.2095     15.18    403.06       127244.09   98       397.6378     15.24    404.61       127327.64   98       397.8989     15.23    406.47
  320   127662.94   98       398.9467     15.19    402.52       127411.30   98       398.1603     15.22    406.13       128423.84   98       401.3245     15.10    390.71
  640   220865.60   95       345.1025     17.56    591.17       222130.58   95       347.0790     17.46    586.26       216307.86   96       337.9810     17.93    615.63
  640   219118.64   96       342.3729     17.70    603.53       222768.52   95       348.0758     17.41    581.76       216549.41   95       338.3585     17.91    616.81
  640   217765.30   96       340.2583     17.81    611.16       218624.58   96       341.6009     17.74    607.05       220739.90   96       344.9061     17.57    590.80
 1280   323065.39   92       252.3948     24.01   1104.10       326602.11   92       255.1579     23.75   1088.87       324416.56   91       253.4504     23.91   1100.05
 1280   326877.37   93       255.3729     23.73   1082.77       326327.30   93       254.9432     23.77   1087.10       322662.23   92       252.0799     24.04   1103.16
 1280   327291.14   92       255.6962     23.70   1081.55       324280.94   93       253.3445     23.92   1098.44       326877.37   92       255.3729     23.73   1084.82
 2560   419173.20   87       163.7395     37.01   2139.19       419513.25   85       163.8724     36.98   2136.87       418946.80   86       163.6511     37.03   2138.73
 2560   418720.65   87       163.5628     37.05   2142.92       419399.84   86       163.8281     36.99   2143.80       418607.66   86       163.5186     37.06   2141.70
 2560   419399.84   87       163.8281     36.99   2140.22       419853.86   85       164.0054     36.95   2137.59       419967.51   86       164.0498     36.94   2139.05


Virgin source reference runs.

         3.8.0-virgin                                            3.8.0-virgin select_idle_sibling() disabled
Tasks    jobs/min  jti  jobs/min/task      real       cpu        jobs/min  jti  jobs/min/task      real       cpu
    1      435.97  100       435.9712     13.90      3.74          435.66  100       435.6578     13.91      3.90
    1      440.41  100       440.4070     13.76      3.73          434.72  100       434.7202     13.94      3.93
    1      440.73  100       440.7273     13.75      3.73          435.66  100       435.6578     13.91      3.91
    5     2376.47   99       475.2941     12.75     13.07         2267.96   99       453.5928     13.36     16.12
    5     2374.61   98       474.9216     12.76     12.84         2363.49   98       472.6989     12.82     13.02
    5     2271.36   99       454.2729     13.34     15.91         2305.94   98       461.1872     13.14     14.63
   10     4577.04   99       457.7039     13.24     31.53         4337.87   98       433.7867     13.97     37.78
   10     4246.67   99       424.6671     14.27     41.33         4331.67   98       433.1665     13.99     37.29
   10     4741.78   99       474.1784     12.78     25.74         4264.60   99       426.4602     14.21     38.81
   20     8991.10   93       449.5549     13.48     49.67         8381.74   97       419.0871     14.46     77.22
   20     9209.73   95       460.4863     13.16     45.81         8553.28   95       427.6641     14.17     62.19
   20     8808.14   96       440.4070     13.76     57.42         8547.25   95       427.3625     14.18     61.43
   40    16625.51   98       415.6379     14.58    169.95        16717.24   98       417.9310     14.50    166.00
   40    16694.21   98       417.3554     14.52    171.03        16682.73   98       417.0681     14.53    161.04
   40    16367.32   98       409.1830     14.81    170.28        16763.49   97       419.0871     14.46    159.94
   80    33114.75   99       413.9344     14.64    350.04        33228.24   97       415.3530     14.59    325.12
   80    33092.15   99       413.6519     14.65    352.77        33228.24   97       415.3530     14.59    326.98
   80    33092.15   99       413.6519     14.65    352.02        33228.24   97       415.3530     14.59    332.65
  160    63789.47   98       398.6842     15.20    394.70        64683.12   97       404.2695     14.99    373.11
  160    64169.42   98       401.0589     15.11    389.92        64856.19   97       405.3512     14.95    367.61
  160    64382.47   98       402.3904     15.06    392.40        64510.98   97       403.1936     15.03    378.98
  320   121200.00   97       378.7500     16.00    466.96       122115.87   97       381.6121     15.88    450.06
  320   120447.20   98       376.3975     16.10    469.83       121732.58   97       380.4143     15.93    464.31
  320   121809.05   97       380.6533     15.92    464.26       121200.00   98       378.7500     16.00    469.79
  640   210897.23   96       329.5269     18.39    658.28       211703.06   95       330.7860     18.32    645.51
  640   213450.74   95       333.5168     18.17    643.07       215108.15   95       336.1065     18.03    618.47
  640   215946.55   96       337.4165     17.96    626.89       217521.03   95       339.8766     17.83    606.25
 1280   323200.00   92       252.5000     24.00   1105.69       323469.56   91       252.7106     23.98   1092.70
 1280   325096.40   91       253.9816     23.86   1095.47       324145.42   92       253.2386     23.93   1091.97
 1280   322930.89   93       252.2898     24.02   1105.80       321858.92   93       251.4523     24.10   1102.40
 2560   418269.08   88       163.3864     37.09   2142.88       418156.33   87       163.3423     37.10   2139.12
 2560   418946.80   88       163.6511     37.03   2141.52       417931.03   86       163.2543     37.12   2142.56
 2560   418494.74   88       163.4745     37.07   2147.66       418720.65   88       163.5628     37.05   2137.14


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-01-26 12:00   ` Ingo Molnar
  2013-01-26 12:47     ` Borislav Petkov
  2013-01-28 18:18   ` Stefano Stabellini
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	chegu_vinod, raghavendra.kt, mingo


* Rik van Riel <riel@redhat.com> wrote:

>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock. The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value.
> + */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_spinlock_delay(void)
> +{
> +	if (x86_hyper)
> +		max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;

I realize that you took existing code and extended it, but that 
chunk of code looks pretty disgusting visually now - at minimum 
it should be vertically aligned as most other kernel code does.

The comment should also tell that the unit of these values is 
'spinlock-op loops' or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-26 12:03   ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	chegu_vinod, raghavendra.kt, mingo


* Rik van Riel <riel@redhat.com> wrote:

> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
> 
> The paper "Non-scalable locks are dangerous" is a good reference:
> 
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
> 
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
> 
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
> 
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
> 
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
> 
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
> 
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/kernel/smp.c |   43 +++++++++++++++++++++++++++++++++++++++----
>  1 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index aa743e9..05f828b 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. The solution is to go for a fast spin if we are at the head of
> + * the queue, to slowly increase the delay if we sleep for too short a
> + * time, and to decrease the delay if we slept for too long.
>   */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
> +DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };

This one's ugly too, in several ways, please improve it.

> +		if (head == ticket) {
> +			/*
> +			 * We overslept, and do not know by how.

s/by how./by how much.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-26  7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
@ 2013-01-26 12:05   ` Ingo Molnar
  2013-01-26 13:10     ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
	lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo


* Mike Galbraith <bitbucket@online.de> wrote:

> On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> 
> > The performance issue observed with AIM7 is still a mystery.
> 
> Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
> core box. Stock scheduler knobs are too preempt happy, produce 
> unstable results. I twiddle them as below to stabilize 
> results.
> 
> I'm testing a load balancing series from Alex Shi with AIM7 
> and whatnot, added your series on top of it and retested.  
> What I see is improvement.

Since we want both series that's good news, agreed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-01-26 12:00   ` Ingo Molnar
@ 2013-01-26 12:47     ` Borislav Petkov
  2013-02-04 13:50       ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2013-01-26 12:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
	lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo

On Sat, Jan 26, 2013 at 01:00:43PM +0100, Ingo Molnar wrote:
> > +void __init init_spinlock_delay(void)
> > +{
> > +	if (x86_hyper)
> > +		max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
> 
> I realize that you took existing code and extended it, but that 
> chunk of code looks pretty disgusting visually now - at minimum 
> it should be vertically aligned as most other kernel code does.
> 
> The comment should also tell that the unit of these values is 
> 'spinlock-op loops' or so.

Also, with currently making PARAVIRT_GUEST optional, x86_hyper is maybe
a bad choice of a variable to test.

Maybe instead to this:

	if (IS_ENABLED(CONFIG_PARAVIRT_GUEST))
		...

Btw, this CONFIG_PARAVIRT_GUEST will change into
CONFIG_HYPERVISOR_GUEST.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-26 12:05   ` Ingo Molnar
@ 2013-01-26 13:10     ` Mike Galbraith
  0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2013-01-26 13:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
	lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo

On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote: 
> * Mike Galbraith <bitbucket@online.de> wrote:
> 
> > On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> > 
> > > The performance issue observed with AIM7 is still a mystery.
> > 
> > Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
> > core box. Stock scheduler knobs are too preempt happy, produce 
> > unstable results. I twiddle them as below to stabilize 
> > results.
> > 
> > I'm testing a load balancing series from Alex Shi with AIM7 
> > and whatnot, added your series on top of it and retested.  
> > What I see is improvement.
> 
> Since we want both series that's good news, agreed?

Yes, both look good.

BTW, the balance and powersaving low end improvements in the Alex set
are not due to any select_idle_sibling() misbehavior as one might
suspect when looking at the set, it's innocent (this time), it's the
balancing policies themselves.  I don't see where it comes from, gotta
be some turbo thingy methinks, but who cares, better is better.

-Mike


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-27 13:04   ` Michel Lespinasse
  2013-02-06 20:10     ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Michel Lespinasse @ 2013-01-27 13:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
> +       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;

I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
there is a hash collision.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (6 preceding siblings ...)
  2013-01-26  7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
@ 2013-01-28  6:49 ` Raghavendra K T
  2013-02-03 15:28 ` Chegu Vinod
  8 siblings, 0 replies; 22+ messages in thread
From: Raghavendra K T @ 2013-01-28  6:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	chegu_vinod, mingo

On 01/26/2013 12:35 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> For the v4 series, I added code to keep the maximum spinlock
> delay to a small value when running in a virtual machine. That
> should solve the performance regression seen in virtual machines.
>
> The performance issue observed with AIM7 is still a mystery.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>

After the introduction of patch 5 which limits the delay loops to 16,
I am no more seeing the degradation in virtual guests as reported
earlier, but improvements.

For the whole series:
Reviewed-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
  2013-01-26 12:00   ` Ingo Molnar
@ 2013-01-28 18:18   ` Stefano Stabellini
  1 sibling, 0 replies; 22+ messages in thread
From: Stefano Stabellini @ 2013-01-28 18:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	chegu_vinod, raghavendra.kt, mingo, Konrad Rzeszutek Wilk,
	xen-devel, Jan Beulich, Stefano Stabellini

On Fri, 25 Jan 2013, Rik van Riel wrote:
> Modern Intel and AMD CPUs will trap to the host when the guest
> is spinning on a spinlock, allowing the host to schedule in
> something else.
> 
> This effectively means the host is taking care of spinlock
> backoff for virtual machines. It also means that doing the
> spinlock backoff in the guest anyway can lead to totally
> unpredictable results, extremely large backoffs, and
> performance regressions.
> 
> To prevent those problems, we limit the spinlock backoff
> delay, when running in a virtual machine, to a small value.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/x86/include/asm/processor.h |    2 ++
>  arch/x86/kernel/setup.c          |    2 ++
>  arch/x86/kernel/smp.c            |   30 ++++++++++++++++++++++++------
>  3 files changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 888184b..a365f97 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
>  extern unsigned long arch_align_stack(unsigned long sp);
>  extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
>  
> +extern void init_spinlock_delay(void);
> +
>  void default_idle(void);
>  bool set_pm_idle_to_default(void);
>  
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 23ddd55..b834eae 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  	arch_init_ideal_nops();
>  
> +	init_spinlock_delay();
> +
>  	register_refined_jiffies(CLOCK_TICK_RATE);
>  
>  #ifdef CONFIG_EFI
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 1877890..b1a65f0 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -31,6 +31,7 @@
>  #include <asm/proto.h>
>  #include <asm/apic.h>
>  #include <asm/nmi.h>
> +#include <asm/hypervisor.h>
>  /*
>   *	Some notes on x86 processor bugs affecting SMP operation:
>   *
> @@ -114,6 +115,27 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock.

I take that you are talking about PAUSE-loop exiting?


> The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value.
> + */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_spinlock_delay(void)
> +{
> +	if (x86_hyper)
> +		max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
> +}

Before reducing max_spinlock_delay, shouldn't we check that PAUSE-loop
exiting is available? What if we are running on an older x86 machine
that doesn't support it?

It is probably worth mentioning in the comment that Xen PV guests cannot
take advantage of PAUSE-loop exiting (they don't run inside a VMX
environment), but that's OK because Xen PV guests don't set x86_hyper.

On the other hand Xen PV on HVM guests can take advantage of it (they
run in a VMX environment), and in fact they set x86_hyper to
x86_hyper_xen_hvm.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (7 preceding siblings ...)
  2013-01-28  6:49 ` Raghavendra K T
@ 2013-02-03 15:28 ` Chegu Vinod
  8 siblings, 0 replies; 22+ messages in thread
From: Chegu Vinod @ 2013-02-03 15:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
	raghavendra.kt, mingo

On 1/25/2013 11:05 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> For the v4 series, I added code to keep the maximum spinlock
> delay to a small value when running in a virtual machine. That
> should solve the performance regression seen in virtual machines.
>
> The performance issue observed with AIM7 is still a mystery.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>
I got back on the machine re-ran the AIM7-highsystime microbenchmark 
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's 
latest patches.

Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Config 1 : 3.7.5 base running on host and in the guest

Config 2 : 3.7.5 + Rik's patches running on host and in the guest

(Note: I didn't change the PLE settings on the host... The severe drop 
from at 40 way and 80 way is due to the un-optimized PLE handler. 
Raghu's PLE's fixes should address those. ).

                           Config 1          Config 2
20vcpu  -              170K               168K
40vcpu  -              37K                   37K
80vcpu  -             10.5K                11.5K


Not much difference between the two configs.. (need to test it along 
with Raghu's fixes).

BTW, I noticed that there were results posted using AIM7-compute 
workload earlier.  The AIM7-highsystime is a lot more kernel intensive.

FYI
Vinod

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-01-26 12:47     ` Borislav Petkov
@ 2013-02-04 13:50       ` Rik van Riel
  2013-02-04 14:02         ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-02-04 13:50 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, linux-kernel, aquini, walken,
	eric.dumazet, lwoodman, knoel, chegu_vinod, raghavendra.kt,
	mingo

On 01/26/2013 07:47 AM, Borislav Petkov wrote:
> On Sat, Jan 26, 2013 at 01:00:43PM +0100, Ingo Molnar wrote:
>>> +void __init init_spinlock_delay(void)
>>> +{
>>> +	if (x86_hyper)
>>> +		max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
>>
>> I realize that you took existing code and extended it, but that
>> chunk of code looks pretty disgusting visually now - at minimum
>> it should be vertically aligned as most other kernel code does.
>>
>> The comment should also tell that the unit of these values is
>> 'spinlock-op loops' or so.

Will do.

> Also, with currently making PARAVIRT_GUEST optional, x86_hyper is maybe
> a bad choice of a variable to test.
>
> Maybe instead to this:
>
> 	if (IS_ENABLED(CONFIG_PARAVIRT_GUEST))
> 		...

We need to know whether we are actually running on top of a
hypervisor, not whether we have the code compiled in to do
so.

After all, the majority of distribution kernels will have
CONFIG_PARAVIRT_GUEST set, but the majority of those kernels
will be running bare metal...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-04 13:50       ` Rik van Riel
@ 2013-02-04 14:02         ` Borislav Petkov
  2013-02-06 17:05           ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2013-02-04 14:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, linux-kernel, aquini, walken, eric.dumazet,
	lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo

On Mon, Feb 04, 2013 at 08:50:33AM -0500, Rik van Riel wrote:
> We need to know whether we are actually running on top of a
> hypervisor, not whether we have the code compiled in to do so.

Oh ok, I see.

The thing is, if CONFIG_PARAVIRT_GUEST is disabled, x86_hyper won't
exist, see: http://marc.info/?l=linux-kernel&m=135936817627848&w=2

So maybe the hypervisor guest should itself take care of this and upon
init it should set the max_spinlock_delay in init_hypervisor() instead?
Seems only fair to me...

> After all, the majority of distribution kernels will have
> CONFIG_PARAVIRT_GUEST set, but the majority of those kernels
> will be running bare metal...

Yeah :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
  2013-02-04 14:02         ` Borislav Petkov
@ 2013-02-06 17:05           ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-02-06 17:05 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, linux-kernel, aquini, walken,
	eric.dumazet, lwoodman, knoel, chegu_vinod, raghavendra.kt,
	mingo

On 02/04/2013 09:02 AM, Borislav Petkov wrote:
> On Mon, Feb 04, 2013 at 08:50:33AM -0500, Rik van Riel wrote:
>> We need to know whether we are actually running on top of a
>> hypervisor, not whether we have the code compiled in to do so.
>
> Oh ok, I see.
>
> The thing is, if CONFIG_PARAVIRT_GUEST is disabled, x86_hyper won't
> exist, see: http://marc.info/?l=linux-kernel&m=135936817627848&w=2
>
> So maybe the hypervisor guest should itself take care of this and upon
> init it should set the max_spinlock_delay in init_hypervisor() instead?
> Seems only fair to me...

I ended up making the call from init_hypervisor_platform, which
already does the check for x86_hyper itself, allowing me to not
check x86_hyper at all from smp.c


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-27 13:04   ` Michel Lespinasse
@ 2013-02-06 20:10     ` Rik van Riel
  2013-02-09 23:50       ` Michel Lespinasse
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-02-06 20:10 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

On 01/27/2013 08:04 AM, Michel Lespinasse wrote:
> On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
>> +       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
>
> I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
> there is a hash collision.

I've been spending some time looking at this, because I am
not a fan either.

However, it seems to work and I failed to come up with
anything better. Therefore, I have left it as is in the
-v5 patch series.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-02-06 20:10     ` Rik van Riel
@ 2013-02-09 23:50       ` Michel Lespinasse
  0 siblings, 0 replies; 22+ messages in thread
From: Michel Lespinasse @ 2013-02-09 23:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
	raghavendra.kt, mingo

On Wed, Feb 6, 2013 at 12:10 PM, Rik van Riel <riel@redhat.com> wrote:
> On 01/27/2013 08:04 AM, Michel Lespinasse wrote:
>>
>> On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
>>>
>>> +       u32 delay = (ent->hash == hash) ? ent->delay :
>>> MIN_SPINLOCK_DELAY;
>>
>> I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
>> there is a hash collision.
>
> I've been spending some time looking at this, because I am
> not a fan either.
>
> However, it seems to work and I failed to come up with
> anything better. Therefore, I have left it as is in the
> -v5 patch series.

Does that mean you know of workloads that would regress if you didn't
reset the delay to MIN_SPINLOCK_DELAY when detecting a hash collision
?

I have not seen any public reports of that...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-02-09 23:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-01-26 12:03   ` Ingo Molnar
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-01-27 13:04   ` Michel Lespinasse
2013-02-06 20:10     ` Rik van Riel
2013-02-09 23:50       ` Michel Lespinasse
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
2013-01-26 12:00   ` Ingo Molnar
2013-01-26 12:47     ` Borislav Petkov
2013-02-04 13:50       ` Rik van Riel
2013-02-04 14:02         ` Borislav Petkov
2013-02-06 17:05           ` Rik van Riel
2013-01-28 18:18   ` Stefano Stabellini
2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
2013-01-26  7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
2013-01-26 12:05   ` Ingo Molnar
2013-01-26 13:10     ` Mike Galbraith
2013-01-28  6:49 ` Raghavendra K T
2013-02-03 15:28 ` Chegu Vinod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).