* [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2013-01-25 19:05 Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
` (8 more replies)
0 siblings, 9 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:05 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
[-- Attachment #1: Type: text/plain, Size: 3543 bytes --]
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.
The paper "Non-scalable locks are dangerous" is a good reference:
http://pdos.csail.mit.edu/papers/linux:lock.pdf
In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.
Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.
However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.
Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.
The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.
Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.
However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.
With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.
Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.
The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.
The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.
For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.
For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.
The performance issue observed with AIM7 is still a mystery.
Performance is within the margin of error of v2, so the graph
has not been update.
Please let me know if you manage to break this code in any way,
so I can fix it...
--
All rights reversed.
[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2013-01-25 19:16 ` Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
` (7 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:16 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
---
v2: clean up the code a little, after Michel's suggestion
arch/x86/include/asm/spinlock.h | 11 +++++------
arch/x86/kernel/smp.c | 14 ++++++++++++++
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
# define UNLOCK_LOCK_PREFIX
#endif
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
/*
* Ticket locks are conceptually two parts, one indicating the current head of
* the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
inc = xadd(&lock->tickets, inc);
- for (;;) {
- if (inc.head == inc.tail)
- break;
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
- }
+ if (inc.head != inc.tail)
+ ticket_spin_lock_wait(lock, inc);
+
barrier(); /* make sure nothing creeps before the lock is taken */
}
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
/*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+ for (;;) {
+ cpu_relax();
+ inc.head = ACCESS_ONCE(lock->tickets.head);
+
+ if (inc.head == inc.tail)
+ break;
+ }
+}
+
+/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-25 19:17 ` Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:17 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.
If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.
The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
---
arch/x86/kernel/smp.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..aa743e9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
*/
void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
{
+ __ticket_t head = inc.head, ticket = inc.tail;
+ __ticket_t waiters_ahead;
+ unsigned loops;
+
for (;;) {
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
+ waiters_ahead = ticket - head - 1;
+ /*
+ * We are next after the current lock holder. Check often
+ * to avoid wasting time when the lock is released.
+ */
+ if (!waiters_ahead) {
+ do {
+ cpu_relax();
+ } while (ACCESS_ONCE(lock->tickets.head) != ticket);
+ break;
+ }
+ loops = 50 * waiters_ahead;
+ while (loops--)
+ cpu_relax();
- if (inc.head == inc.tail)
+ head = ACCESS_ONCE(lock->tickets.head);
+ if (head == ticket)
break;
}
}
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-25 19:17 ` Rik van Riel
2013-01-26 12:03 ` Ingo Molnar
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
` (5 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:17 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.
The paper "Non-scalable locks are dangerous" is a good reference:
http://pdos.csail.mit.edu/papers/linux:lock.pdf
In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.
Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.
However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.
Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.
Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.
Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
---
arch/x86/kernel/smp.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..05f828b 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
/*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
*/
+#define DELAY_SHIFT 8
+#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
+DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
{
__ticket_t head = inc.head, ticket = inc.tail;
__ticket_t waiters_ahead;
- unsigned loops;
+ unsigned delay = __this_cpu_read(spinlock_delay);
+ unsigned loops = 1;
for (;;) {
waiters_ahead = ticket - head - 1;
@@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
} while (ACCESS_ONCE(lock->tickets.head) != ticket);
break;
}
- loops = 50 * waiters_ahead;
+
+ /* Aggressively increase delay, to minimize lock accesses. */
+ if (delay < MAX_SPINLOCK_DELAY)
+ delay += DELAY_FIXED_1 / 7;
+
+ loops = (delay * waiters_ahead) >> DELAY_SHIFT;
while (loops--)
cpu_relax();
head = ACCESS_ONCE(lock->tickets.head);
- if (head == ticket)
+ if (head == ticket) {
+ /*
+ * We overslept, and do not know by how.
+ * Exponentially decay the value of delay,
+ * to get it back to a good value quickly.
+ */
+ if (delay >= 2 * DELAY_FIXED_1)
+ delay -= max(delay/32, DELAY_FIXED_1);
break;
+ }
}
+ __this_cpu_write(spinlock_delay, delay);
}
/*
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (2 preceding siblings ...)
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-25 19:18 ` Rik van Riel
2013-01-27 13:04 ` Michel Lespinasse
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
` (4 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:18 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
From: Eric Dumazet <eric.dumazet@gmail.com>
Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.
This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.
Eric Dumazet wrote:
I did some tests with your patches with following configuration :
tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)
and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
Machine : 2 Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)
Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)
It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.
We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.
With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 05f828b..3d30f6e 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/gfp.h>
+#include <linux/hash.h>
#include <asm/mtrr.h>
#include <asm/tlbflush.h>
@@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+ u32 hash;
+ u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+ [0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+ .hash = 0,
+ .delay = MIN_SPINLOCK_DELAY,
+ },
+};
+
void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
{
__ticket_t head = inc.head, ticket = inc.tail;
__ticket_t waiters_ahead;
- unsigned delay = __this_cpu_read(spinlock_delay);
+ u32 hash = hash32_ptr(lock);
+ u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+ struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+ u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
unsigned loops = 1;
for (;;) {
@@ -171,11 +186,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
* to get it back to a good value quickly.
*/
if (delay >= 2 * DELAY_FIXED_1)
- delay -= max(delay/32, DELAY_FIXED_1);
+ delay -= max_t(u32, delay/32, DELAY_FIXED_1);
break;
}
}
- __this_cpu_write(spinlock_delay, delay);
+ ent->hash = hash;
+ ent->delay = delay;
}
/*
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (3 preceding siblings ...)
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-25 19:19 ` Rik van Riel
2013-01-26 12:00 ` Ingo Molnar
2013-01-28 18:18 ` Stefano Stabellini
2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
` (3 subsequent siblings)
8 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:19 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
Modern Intel and AMD CPUs will trap to the host when the guest
is spinning on a spinlock, allowing the host to schedule in
something else.
This effectively means the host is taking care of spinlock
backoff for virtual machines. It also means that doing the
spinlock backoff in the guest anyway can lead to totally
unpredictable results, extremely large backoffs, and
performance regressions.
To prevent those problems, we limit the spinlock backoff
delay, when running in a virtual machine, to a small value.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/setup.c | 2 ++
arch/x86/kernel/smp.c | 30 ++++++++++++++++++++++++------
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 888184b..a365f97 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
extern unsigned long arch_align_stack(unsigned long sp);
extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
+extern void init_spinlock_delay(void);
+
void default_idle(void);
bool set_pm_idle_to_default(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 23ddd55..b834eae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
arch_init_ideal_nops();
+ init_spinlock_delay();
+
register_refined_jiffies(CLOCK_TICK_RATE);
#ifdef CONFIG_EFI
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 1877890..b1a65f0 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -31,6 +31,7 @@
#include <asm/proto.h>
#include <asm/apic.h>
#include <asm/nmi.h>
+#include <asm/hypervisor.h>
/*
* Some notes on x86 processor bugs affecting SMP operation:
*
@@ -114,6 +115,27 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
/*
+ * Modern Intel and AMD CPUs tell the hypervisor when a guest is
+ * spinning excessively on a spinlock. The hypervisor will then
+ * schedule something else, effectively taking care of the backoff
+ * for us. Doing our own backoff on top of the hypervisor's pause
+ * loop exit handling can lead to excessively long delays, and
+ * performance degradations. Limit the spinlock delay in virtual
+ * machines to a smaller value.
+ */
+#define DELAY_SHIFT 8
+#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
+static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
+void __init init_spinlock_delay(void)
+{
+ if (x86_hyper)
+ max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
+}
+
+/*
* Wait on a congested ticket spinlock. Many spinlocks are embedded in
* data structures; having many CPUs pounce on the cache line with the
* spinlock simultaneously can slow down the lock holder, and the system
@@ -131,10 +153,6 @@ static bool smp_no_nmi_ipi = false;
* the queue, to slowly increase the delay if we sleep for too short a
* time, and to decrease the delay if we slept for too long.
*/
-#define DELAY_SHIFT 8
-#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
-#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
-#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
#define DELAY_HASH_SHIFT 6
struct delay_entry {
u32 hash;
@@ -171,7 +189,7 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
}
/* Aggressively increase delay, to minimize lock accesses. */
- if (delay < MAX_SPINLOCK_DELAY)
+ if (delay < max_spinlock_delay)
delay += DELAY_FIXED_1 / 7;
loops = (delay * waiters_ahead) >> DELAY_SHIFT;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (4 preceding siblings ...)
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-01-25 19:20 ` Rik van Riel
2013-01-26 7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
` (2 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-01-25 19:20 UTC (permalink / raw)
To: linux-kernel
Cc: aquini, walken, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
Subject: x86,smp: add debugging code to track spinlock delay value
From: Eric Dumazet <eric.dumazet@gmail.com>
This code prints out the maximum spinlock delay value and the
backtrace that pushed it that far.
On systems with serial consoles, the act of printing can cause
the spinlock delay value to explode. It can still be useful as
a debugging tool, but is probably too verbose to merge upstream
in this form.
Not-signed-off-by: Rik van Riel <riel@redhat.com>
Not-signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 8908dfc..f973fc9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -165,6 +165,8 @@ static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay
},
};
+static DEFINE_PER_CPU(u32, maxdelay);
+
void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
{
__ticket_t head = inc.head, ticket = inc.tail;
@@ -210,6 +212,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
}
ent->hash = hash;
ent->delay = delay;
+
+ if (__this_cpu_read(maxdelay) * 4 < delay * 3) {
+ pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay>>DELAY_SHIFT);
+ __this_cpu_write(maxdelay, delay);
+ WARN_ON(1);
+ }
}
/*
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (5 preceding siblings ...)
2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
@ 2013-01-26 7:21 ` Mike Galbraith
2013-01-26 12:05 ` Ingo Molnar
2013-01-28 6:49 ` Raghavendra K T
2013-02-03 15:28 ` Chegu Vinod
8 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2013-01-26 7:21 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
chegu_vinod, raghavendra.kt, mingo
On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> The performance issue observed with AIM7 is still a mystery.
Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box.
Stock scheduler knobs are too preempt happy, produce unstable results.
I twiddle them as below to stabilize results.
I'm testing a load balancing series from Alex Shi with AIM7 and whatnot,
added your series on top of it and retested. What I see is
improvement.
Oodles of numbers follow. Sorry that your numbers are mixed in with my
numbers, but this is just an excerpt from my test log, and I'm too lazy
to reformat and filter. You can save wear and tear on your eyeballs by
just poking 'D'. There does appear to be evidence that your patch set
improved this load though, so in case you want to see numbers, here come
a bunch, a quick scroll-by may be worth it.
The very heavy load end did not improve, which seems odd, but whatever.
Numbers...
sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms
aim7 compute
3.8.0-performance 3.8.0-balance 3.8.0-powersaving
Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98
1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90
1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87
5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22
5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90
5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24
10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15
10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53
10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32
20 8487.39 94 424.3697 14.28 63.14 10594.41 99 529.7203 11.44 23.72 10575.92 99 528.7958 11.46 22.08
20 8387.54 97 419.3772 14.45 77.01 10575.92 98 528.7958 11.46 23.41 10520.83 99 526.0417 11.52 21.88
20 8713.16 95 435.6578 13.91 55.10 10659.63 99 532.9815 11.37 24.17 10539.13 99 526.9565 11.50 22.13
40 16786.70 99 419.6676 14.44 170.08 19469.88 98 486.7470 12.45 60.78 19967.05 98 499.1763 12.14 51.40
40 16728.78 99 418.2195 14.49 172.96 19627.53 98 490.6883 12.35 65.26 20386.88 98 509.6720 11.89 46.91
40 16763.49 99 419.0871 14.46 171.42 20033.06 98 500.8264 12.10 51.44 20682.59 98 517.0648 11.72 42.45
80 33024.52 98 412.8065 14.68 355.10 33205.48 98 415.0685 14.60 336.90 33690.06 97 421.1258 14.39 248.91
80 33002.04 99 412.5255 14.69 356.27 33949.58 96 424.3697 14.28 283.87 33160.05 97 414.5007 14.62 264.85
80 33047.03 99 413.0879 14.67 355.22 33137.39 98 414.2174 14.63 338.92 33526.97 97 419.0871 14.46 257.31
160 64254.47 98 401.5905 15.09 391.30 64000.00 98 400.0000 15.15 396.87 65073.83 97 406.7114 14.90 371.09
160 64468.09 98 402.9255 15.04 390.28 64553.93 98 403.4621 15.02 389.49 64640.00 98 404.0000 15.00 379.82
160 64297.08 98 401.8568 15.08 389.45 64856.19 98 405.3512 14.95 383.64 64683.12 98 404.2695 14.99 379.43
320 121579.94 98 379.9373 15.95 466.15 122039.02 98 381.3719 15.89 456.17 122811.91 97 383.7872 15.79 445.16
320 121427.68 98 379.4615 15.97 466.68 121885.61 98 380.8925 15.91 462.30 122734.18 97 383.5443 15.80 447.03
320 121200.00 98 378.7500 16.00 465.58 121048.69 97 378.2772 16.02 470.41 122501.58 97 382.8174 15.83 447.74
640 215466.67 95 336.6667 18.00 624.64 211703.06 96 330.7860 18.32 654.19 216066.85 96 337.6045 17.95 620.45
640 216066.85 95 337.6045 17.95 623.32 210097.51 96 328.2774 18.46 664.05 214394.69 96 334.9917 18.09 634.61
640 213568.28 96 333.7004 18.16 641.45 214039.74 96 334.4371 18.12 638.48 215227.52 96 336.2930 18.02 630.19
1280 320528.93 92 250.4132 24.20 1120.46 327015.18 91 255.4806 23.72 1084.29 326464.65 91 255.0505 23.76 1087.39
1280 324552.30 92 253.5565 23.90 1098.43 323739.57 92 252.9215 23.96 1104.09 325232.70 92 254.0881 23.85 1095.50
1280 327153.10 91 255.5884 23.71 1084.75 326190.08 91 254.8360 23.78 1086.72 325232.70 92 254.0881 23.85 1093.32
2560 418946.80 86 163.6511 37.03 2140.41 418720.65 88 163.5628 37.05 2146.30 419626.72 87 163.9167 36.97 2139.74
2560 418607.66 88 163.5186 37.06 2144.50 419059.97 89 163.6953 37.02 2143.47 419059.97 87 163.6953 37.02 2140.70
2560 418607.66 84 163.5186 37.06 2138.81 419513.25 84 163.8724 36.98 2139.09 419286.49 87 163.7838 37.00 2140.60
/me adds spinlock series to take it for a spin, repeats test series...
3.8.0-performance 3.8.0-balance 3.8.0-powersaving
Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
1 436.91 100 436.9142 13.87 3.69 433.17 100 433.1665 13.99 3.98 434.72 100 434.7202 13.94 3.93
1 440.41 100 440.4070 13.76 3.74 437.86 100 437.8613 13.84 3.83 435.97 100 435.9712 13.90 3.89
1 441.69 100 441.6910 13.72 3.71 435.66 100 435.6578 13.91 3.88 438.49 100 438.4949 13.82 3.81
5 2267.96 98 453.5928 13.36 15.72 2580.92 99 516.1840 11.74 7.22 2535.56 99 507.1130 11.95 8.04
5 2295.45 99 459.0909 13.20 15.32 2578.72 98 515.7447 11.75 7.22 2516.61 98 503.3223 12.04 8.24
5 2279.91 98 455.9819 13.29 15.49 2600.86 99 520.1717 11.65 7.19 2514.52 98 502.9046 12.05 8.49
10 4391.30 99 439.1304 13.80 36.83 5315.79 99 531.5789 11.40 11.21 5210.66 98 521.0662 11.63 12.41
10 4375.45 99 437.5451 13.85 37.28 5201.72 99 520.1717 11.65 12.92 5242.21 99 524.2215 11.56 12.29
10 4413.69 99 441.3693 13.73 36.15 5311.13 99 531.1131 11.41 11.11 5219.64 98 521.9638 11.61 12.25
20 8801.74 97 440.0871 13.77 64.53 10725.66 99 536.2832 11.30 22.84 10511.71 99 525.5854 11.53 21.72
20 8931.47 97 446.5733 13.57 58.68 10640.91 99 532.0457 11.39 23.95 10575.92 98 528.7958 11.46 21.53
20 9004.46 96 450.2229 13.46 50.98 10669.01 99 533.4507 11.36 24.06 10557.49 99 527.8746 11.48 22.00
40 17758.24 99 443.9560 13.65 135.44 20682.59 98 517.0648 11.72 43.25 20369.75 98 509.2437 11.90 47.04
40 17667.64 99 441.6910 13.72 142.15 20860.59 98 521.5146 11.62 41.30 20250.63 98 506.2657 11.97 46.28
40 17745.24 98 443.6310 13.66 137.20 20507.61 98 512.6904 11.82 44.98 20421.23 98 510.5307 11.87 44.25
80 35725.87 99 446.5733 13.57 266.54 35438.60 98 442.9825 13.68 262.32 35028.90 97 437.8613 13.84 216.09
80 35778.60 99 447.2325 13.55 266.69 35464.52 98 443.3065 13.67 256.49 35386.86 97 442.3358 13.70 210.93
80 35805.02 99 447.5628 13.54 263.42 35542.52 98 444.2815 13.64 254.81 36098.29 97 451.2286 13.43 181.01
160 68378.00 98 427.3625 14.18 316.02 68522.97 98 428.2686 14.15 321.43 69059.83 98 431.6239 14.04 298.16
160 68522.97 98 428.2686 14.15 318.01 69257.14 97 432.8571 14.00 293.77 68765.96 98 429.7872 14.10 303.93
160 68474.58 98 427.9661 14.16 320.64 68185.65 98 426.1603 14.22 324.11 68474.58 98 427.9661 14.16 311.32
320 126910.99 97 396.5969 15.28 409.53 127244.09 98 397.6378 15.24 408.56 127831.25 98 399.4726 15.17 399.66
320 127747.04 98 399.2095 15.18 403.06 127244.09 98 397.6378 15.24 404.61 127327.64 98 397.8989 15.23 406.47
320 127662.94 98 398.9467 15.19 402.52 127411.30 98 398.1603 15.22 406.13 128423.84 98 401.3245 15.10 390.71
640 220865.60 95 345.1025 17.56 591.17 222130.58 95 347.0790 17.46 586.26 216307.86 96 337.9810 17.93 615.63
640 219118.64 96 342.3729 17.70 603.53 222768.52 95 348.0758 17.41 581.76 216549.41 95 338.3585 17.91 616.81
640 217765.30 96 340.2583 17.81 611.16 218624.58 96 341.6009 17.74 607.05 220739.90 96 344.9061 17.57 590.80
1280 323065.39 92 252.3948 24.01 1104.10 326602.11 92 255.1579 23.75 1088.87 324416.56 91 253.4504 23.91 1100.05
1280 326877.37 93 255.3729 23.73 1082.77 326327.30 93 254.9432 23.77 1087.10 322662.23 92 252.0799 24.04 1103.16
1280 327291.14 92 255.6962 23.70 1081.55 324280.94 93 253.3445 23.92 1098.44 326877.37 92 255.3729 23.73 1084.82
2560 419173.20 87 163.7395 37.01 2139.19 419513.25 85 163.8724 36.98 2136.87 418946.80 86 163.6511 37.03 2138.73
2560 418720.65 87 163.5628 37.05 2142.92 419399.84 86 163.8281 36.99 2143.80 418607.66 86 163.5186 37.06 2141.70
2560 419399.84 87 163.8281 36.99 2140.22 419853.86 85 164.0054 36.95 2137.59 419967.51 86 164.0498 36.94 2139.05
Virgin source reference runs.
3.8.0-virgin 3.8.0-virgin select_idle_sibling() disabled
Tasks jobs/min jti jobs/min/task real cpu jobs/min jti jobs/min/task real cpu
1 435.97 100 435.9712 13.90 3.74 435.66 100 435.6578 13.91 3.90
1 440.41 100 440.4070 13.76 3.73 434.72 100 434.7202 13.94 3.93
1 440.73 100 440.7273 13.75 3.73 435.66 100 435.6578 13.91 3.91
5 2376.47 99 475.2941 12.75 13.07 2267.96 99 453.5928 13.36 16.12
5 2374.61 98 474.9216 12.76 12.84 2363.49 98 472.6989 12.82 13.02
5 2271.36 99 454.2729 13.34 15.91 2305.94 98 461.1872 13.14 14.63
10 4577.04 99 457.7039 13.24 31.53 4337.87 98 433.7867 13.97 37.78
10 4246.67 99 424.6671 14.27 41.33 4331.67 98 433.1665 13.99 37.29
10 4741.78 99 474.1784 12.78 25.74 4264.60 99 426.4602 14.21 38.81
20 8991.10 93 449.5549 13.48 49.67 8381.74 97 419.0871 14.46 77.22
20 9209.73 95 460.4863 13.16 45.81 8553.28 95 427.6641 14.17 62.19
20 8808.14 96 440.4070 13.76 57.42 8547.25 95 427.3625 14.18 61.43
40 16625.51 98 415.6379 14.58 169.95 16717.24 98 417.9310 14.50 166.00
40 16694.21 98 417.3554 14.52 171.03 16682.73 98 417.0681 14.53 161.04
40 16367.32 98 409.1830 14.81 170.28 16763.49 97 419.0871 14.46 159.94
80 33114.75 99 413.9344 14.64 350.04 33228.24 97 415.3530 14.59 325.12
80 33092.15 99 413.6519 14.65 352.77 33228.24 97 415.3530 14.59 326.98
80 33092.15 99 413.6519 14.65 352.02 33228.24 97 415.3530 14.59 332.65
160 63789.47 98 398.6842 15.20 394.70 64683.12 97 404.2695 14.99 373.11
160 64169.42 98 401.0589 15.11 389.92 64856.19 97 405.3512 14.95 367.61
160 64382.47 98 402.3904 15.06 392.40 64510.98 97 403.1936 15.03 378.98
320 121200.00 97 378.7500 16.00 466.96 122115.87 97 381.6121 15.88 450.06
320 120447.20 98 376.3975 16.10 469.83 121732.58 97 380.4143 15.93 464.31
320 121809.05 97 380.6533 15.92 464.26 121200.00 98 378.7500 16.00 469.79
640 210897.23 96 329.5269 18.39 658.28 211703.06 95 330.7860 18.32 645.51
640 213450.74 95 333.5168 18.17 643.07 215108.15 95 336.1065 18.03 618.47
640 215946.55 96 337.4165 17.96 626.89 217521.03 95 339.8766 17.83 606.25
1280 323200.00 92 252.5000 24.00 1105.69 323469.56 91 252.7106 23.98 1092.70
1280 325096.40 91 253.9816 23.86 1095.47 324145.42 92 253.2386 23.93 1091.97
1280 322930.89 93 252.2898 24.02 1105.80 321858.92 93 251.4523 24.10 1102.40
2560 418269.08 88 163.3864 37.09 2142.88 418156.33 87 163.3423 37.10 2139.12
2560 418946.80 88 163.6511 37.03 2141.52 417931.03 86 163.2543 37.12 2142.56
2560 418494.74 88 163.4745 37.07 2147.66 418720.65 88 163.5628 37.05 2137.14
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
@ 2013-01-26 12:00 ` Ingo Molnar
2013-01-26 12:47 ` Borislav Petkov
2013-01-28 18:18 ` Stefano Stabellini
1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:00 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
chegu_vinod, raghavendra.kt, mingo
* Rik van Riel <riel@redhat.com> wrote:
> static bool smp_no_nmi_ipi = false;
>
> /*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock. The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value.
> + */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_spinlock_delay(void)
> +{
> + if (x86_hyper)
> + max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
I realize that you took existing code and extended it, but that
chunk of code looks pretty disgusting visually now - at minimum
it should be vertically aligned as most other kernel code does.
The comment should also tell that the unit of these values is
'spinlock-op loops' or so.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-26 12:03 ` Ingo Molnar
0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:03 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
chegu_vinod, raghavendra.kt, mingo
* Rik van Riel <riel@redhat.com> wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> ---
> arch/x86/kernel/smp.c | 43 +++++++++++++++++++++++++++++++++++++++----
> 1 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index aa743e9..05f828b 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
> static bool smp_no_nmi_ipi = false;
>
> /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. The solution is to go for a fast spin if we are at the head of
> + * the queue, to slowly increase the delay if we sleep for too short a
> + * time, and to decrease the delay if we slept for too long.
> */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
> +DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
This one's ugly too, in several ways, please improve it.
> + if (head == ticket) {
> + /*
> + * We overslept, and do not know by how.
s/by how./by how much.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
2013-01-26 7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
@ 2013-01-26 12:05 ` Ingo Molnar
2013-01-26 13:10 ` Mike Galbraith
0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-01-26 12:05 UTC (permalink / raw)
To: Mike Galbraith
Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo
* Mike Galbraith <bitbucket@online.de> wrote:
> On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
>
> > The performance issue observed with AIM7 is still a mystery.
>
> Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40
> core box. Stock scheduler knobs are too preempt happy, produce
> unstable results. I twiddle them as below to stabilize
> results.
>
> I'm testing a load balancing series from Alex Shi with AIM7
> and whatnot, added your series on top of it and retested.
> What I see is improvement.
Since we want both series that's good news, agreed?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-01-26 12:00 ` Ingo Molnar
@ 2013-01-26 12:47 ` Borislav Petkov
2013-02-04 13:50 ` Rik van Riel
0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2013-01-26 12:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo
On Sat, Jan 26, 2013 at 01:00:43PM +0100, Ingo Molnar wrote:
> > +void __init init_spinlock_delay(void)
> > +{
> > + if (x86_hyper)
> > + max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
>
> I realize that you took existing code and extended it, but that
> chunk of code looks pretty disgusting visually now - at minimum
> it should be vertically aligned as most other kernel code does.
>
> The comment should also tell that the unit of these values is
> 'spinlock-op loops' or so.
Also, with currently making PARAVIRT_GUEST optional, x86_hyper is maybe
a bad choice of a variable to test.
Maybe instead to this:
if (IS_ENABLED(CONFIG_PARAVIRT_GUEST))
...
Btw, this CONFIG_PARAVIRT_GUEST will change into
CONFIG_HYPERVISOR_GUEST.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
2013-01-26 12:05 ` Ingo Molnar
@ 2013-01-26 13:10 ` Mike Galbraith
0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2013-01-26 13:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo
On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote:
> * Mike Galbraith <bitbucket@online.de> wrote:
>
> > On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> >
> > > The performance issue observed with AIM7 is still a mystery.
> >
> > Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40
> > core box. Stock scheduler knobs are too preempt happy, produce
> > unstable results. I twiddle them as below to stabilize
> > results.
> >
> > I'm testing a load balancing series from Alex Shi with AIM7
> > and whatnot, added your series on top of it and retested.
> > What I see is improvement.
>
> Since we want both series that's good news, agreed?
Yes, both look good.
BTW, the balance and powersaving low end improvements in the Alex set
are not due to any select_idle_sibling() misbehavior as one might
suspect when looking at the set, it's innocent (this time), it's the
balancing policies themselves. I don't see where it comes from, gotta
be some turbo thingy methinks, but who cares, better is better.
-Mike
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-27 13:04 ` Michel Lespinasse
2013-02-06 20:10 ` Rik van Riel
0 siblings, 1 reply; 22+ messages in thread
From: Michel Lespinasse @ 2013-01-27 13:04 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
> + u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
there is a hash collision.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (6 preceding siblings ...)
2013-01-26 7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
@ 2013-01-28 6:49 ` Raghavendra K T
2013-02-03 15:28 ` Chegu Vinod
8 siblings, 0 replies; 22+ messages in thread
From: Raghavendra K T @ 2013-01-28 6:49 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
chegu_vinod, mingo
On 01/26/2013 12:35 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> For the v4 series, I added code to keep the maximum spinlock
> delay to a small value when running in a virtual machine. That
> should solve the performance regression seen in virtual machines.
>
> The performance issue observed with AIM7 is still a mystery.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>
After the introduction of patch 5 which limits the delay loops to 16,
I am no more seeing the degradation in virtual guests as reported
earlier, but improvements.
For the whole series:
Reviewed-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
2013-01-26 12:00 ` Ingo Molnar
@ 2013-01-28 18:18 ` Stefano Stabellini
1 sibling, 0 replies; 22+ messages in thread
From: Stefano Stabellini @ 2013-01-28 18:18 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
chegu_vinod, raghavendra.kt, mingo, Konrad Rzeszutek Wilk,
xen-devel, Jan Beulich, Stefano Stabellini
On Fri, 25 Jan 2013, Rik van Riel wrote:
> Modern Intel and AMD CPUs will trap to the host when the guest
> is spinning on a spinlock, allowing the host to schedule in
> something else.
>
> This effectively means the host is taking care of spinlock
> backoff for virtual machines. It also means that doing the
> spinlock backoff in the guest anyway can lead to totally
> unpredictable results, extremely large backoffs, and
> performance regressions.
>
> To prevent those problems, we limit the spinlock backoff
> delay, when running in a virtual machine, to a small value.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> arch/x86/include/asm/processor.h | 2 ++
> arch/x86/kernel/setup.c | 2 ++
> arch/x86/kernel/smp.c | 30 ++++++++++++++++++++++++------
> 3 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 888184b..a365f97 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -997,6 +997,8 @@ extern bool cpu_has_amd_erratum(const int *);
> extern unsigned long arch_align_stack(unsigned long sp);
> extern void free_init_pages(char *what, unsigned long begin, unsigned long end);
>
> +extern void init_spinlock_delay(void);
> +
> void default_idle(void);
> bool set_pm_idle_to_default(void);
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 23ddd55..b834eae 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
>
> arch_init_ideal_nops();
>
> + init_spinlock_delay();
> +
> register_refined_jiffies(CLOCK_TICK_RATE);
>
> #ifdef CONFIG_EFI
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 1877890..b1a65f0 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -31,6 +31,7 @@
> #include <asm/proto.h>
> #include <asm/apic.h>
> #include <asm/nmi.h>
> +#include <asm/hypervisor.h>
> /*
> * Some notes on x86 processor bugs affecting SMP operation:
> *
> @@ -114,6 +115,27 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
> static bool smp_no_nmi_ipi = false;
>
> /*
> + * Modern Intel and AMD CPUs tell the hypervisor when a guest is
> + * spinning excessively on a spinlock.
I take that you are talking about PAUSE-loop exiting?
> The hypervisor will then
> + * schedule something else, effectively taking care of the backoff
> + * for us. Doing our own backoff on top of the hypervisor's pause
> + * loop exit handling can lead to excessively long delays, and
> + * performance degradations. Limit the spinlock delay in virtual
> + * machines to a smaller value.
> + */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_NATIVE (16000 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY_GUEST (16 * DELAY_FIXED_1)
> +static int __read_mostly max_spinlock_delay = MAX_SPINLOCK_DELAY_NATIVE;
> +void __init init_spinlock_delay(void)
> +{
> + if (x86_hyper)
> + max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
> +}
Before reducing max_spinlock_delay, shouldn't we check that PAUSE-loop
exiting is available? What if we are running on an older x86 machine
that doesn't support it?
It is probably worth mentioning in the comment that Xen PV guests cannot
take advantage of PAUSE-loop exiting (they don't run inside a VMX
environment), but that's OK because Xen PV guests don't set x86_hyper.
On the other hand Xen PV on HVM guests can take advantage of it (they
run in a VMX environment), and in fact they set x86_hyper to
x86_hyper_xen_hvm.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
` (7 preceding siblings ...)
2013-01-28 6:49 ` Raghavendra K T
@ 2013-02-03 15:28 ` Chegu Vinod
8 siblings, 0 replies; 22+ messages in thread
From: Chegu Vinod @ 2013-02-03 15:28 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, knoel,
raghavendra.kt, mingo
On 1/25/2013 11:05 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> For the v4 series, I added code to keep the maximum spinlock
> delay to a small value when running in a virtual machine. That
> should solve the performance regression seen in virtual machines.
>
> The performance issue observed with AIM7 is still a mystery.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>
I got back on the machine re-ran the AIM7-highsystime microbenchmark
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's
latest patches.
Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.
Config 1 : 3.7.5 base running on host and in the guest
Config 2 : 3.7.5 + Rik's patches running on host and in the guest
(Note: I didn't change the PLE settings on the host... The severe drop
from at 40 way and 80 way is due to the un-optimized PLE handler.
Raghu's PLE's fixes should address those. ).
Config 1 Config 2
20vcpu - 170K 168K
40vcpu - 37K 37K
80vcpu - 10.5K 11.5K
Not much difference between the two configs.. (need to test it along
with Raghu's fixes).
BTW, I noticed that there were results posted using AIM7-compute
workload earlier. The AIM7-highsystime is a lot more kernel intensive.
FYI
Vinod
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-01-26 12:47 ` Borislav Petkov
@ 2013-02-04 13:50 ` Rik van Riel
2013-02-04 14:02 ` Borislav Petkov
0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-02-04 13:50 UTC (permalink / raw)
To: Borislav Petkov, Ingo Molnar, linux-kernel, aquini, walken,
eric.dumazet, lwoodman, knoel, chegu_vinod, raghavendra.kt,
mingo
On 01/26/2013 07:47 AM, Borislav Petkov wrote:
> On Sat, Jan 26, 2013 at 01:00:43PM +0100, Ingo Molnar wrote:
>>> +void __init init_spinlock_delay(void)
>>> +{
>>> + if (x86_hyper)
>>> + max_spinlock_delay = MAX_SPINLOCK_DELAY_GUEST;
>>
>> I realize that you took existing code and extended it, but that
>> chunk of code looks pretty disgusting visually now - at minimum
>> it should be vertically aligned as most other kernel code does.
>>
>> The comment should also tell that the unit of these values is
>> 'spinlock-op loops' or so.
Will do.
> Also, with currently making PARAVIRT_GUEST optional, x86_hyper is maybe
> a bad choice of a variable to test.
>
> Maybe instead to this:
>
> if (IS_ENABLED(CONFIG_PARAVIRT_GUEST))
> ...
We need to know whether we are actually running on top of a
hypervisor, not whether we have the code compiled in to do
so.
After all, the majority of distribution kernels will have
CONFIG_PARAVIRT_GUEST set, but the majority of those kernels
will be running bare metal...
--
All rights reversed
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-02-04 13:50 ` Rik van Riel
@ 2013-02-04 14:02 ` Borislav Petkov
2013-02-06 17:05 ` Rik van Riel
0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2013-02-04 14:02 UTC (permalink / raw)
To: Rik van Riel
Cc: Ingo Molnar, linux-kernel, aquini, walken, eric.dumazet,
lwoodman, knoel, chegu_vinod, raghavendra.kt, mingo
On Mon, Feb 04, 2013 at 08:50:33AM -0500, Rik van Riel wrote:
> We need to know whether we are actually running on top of a
> hypervisor, not whether we have the code compiled in to do so.
Oh ok, I see.
The thing is, if CONFIG_PARAVIRT_GUEST is disabled, x86_hyper won't
exist, see: http://marc.info/?l=linux-kernel&m=135936817627848&w=2
So maybe the hypervisor guest should itself take care of this and upon
init it should set the max_spinlock_delay in init_hypervisor() instead?
Seems only fair to me...
> After all, the majority of distribution kernels will have
> CONFIG_PARAVIRT_GUEST set, but the majority of those kernels
> will be running bare metal...
Yeah :-)
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines
2013-02-04 14:02 ` Borislav Petkov
@ 2013-02-06 17:05 ` Rik van Riel
0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2013-02-06 17:05 UTC (permalink / raw)
To: Borislav Petkov, Ingo Molnar, linux-kernel, aquini, walken,
eric.dumazet, lwoodman, knoel, chegu_vinod, raghavendra.kt,
mingo
On 02/04/2013 09:02 AM, Borislav Petkov wrote:
> On Mon, Feb 04, 2013 at 08:50:33AM -0500, Rik van Riel wrote:
>> We need to know whether we are actually running on top of a
>> hypervisor, not whether we have the code compiled in to do so.
>
> Oh ok, I see.
>
> The thing is, if CONFIG_PARAVIRT_GUEST is disabled, x86_hyper won't
> exist, see: http://marc.info/?l=linux-kernel&m=135936817627848&w=2
>
> So maybe the hypervisor guest should itself take care of this and upon
> init it should set the max_spinlock_delay in init_hypervisor() instead?
> Seems only fair to me...
I ended up making the call from init_hypervisor_platform, which
already does the check for x86_hyper itself, allowing me to not
check x86_hyper at all from smp.c
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
2013-01-27 13:04 ` Michel Lespinasse
@ 2013-02-06 20:10 ` Rik van Riel
2013-02-09 23:50 ` Michel Lespinasse
0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2013-02-06 20:10 UTC (permalink / raw)
To: Michel Lespinasse
Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
On 01/27/2013 08:04 AM, Michel Lespinasse wrote:
> On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
>> + u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
>
> I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
> there is a hash collision.
I've been spending some time looking at this, because I am
not a fan either.
However, it seems to work and I failed to come up with
anything better. Therefore, I have left it as is in the
-v5 patch series.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
2013-02-06 20:10 ` Rik van Riel
@ 2013-02-09 23:50 ` Michel Lespinasse
0 siblings, 0 replies; 22+ messages in thread
From: Michel Lespinasse @ 2013-02-09 23:50 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, aquini, eric.dumazet, lwoodman, knoel, chegu_vinod,
raghavendra.kt, mingo
On Wed, Feb 6, 2013 at 12:10 PM, Rik van Riel <riel@redhat.com> wrote:
> On 01/27/2013 08:04 AM, Michel Lespinasse wrote:
>>
>> On Fri, Jan 25, 2013 at 11:18 AM, Rik van Riel <riel@redhat.com> wrote:
>>>
>>> + u32 delay = (ent->hash == hash) ? ent->delay :
>>> MIN_SPINLOCK_DELAY;
>>
>> I still don't like the reseting of delay to MIN_SPINLOCK_DELAY when
>> there is a hash collision.
>
> I've been spending some time looking at this, because I am
> not a fan either.
>
> However, it seems to work and I failed to come up with
> anything better. Therefore, I have left it as is in the
> -v5 patch series.
Does that mean you know of workloads that would regress if you didn't
reset the delay to MIN_SPINLOCK_DELAY when detecting a hash collision
?
I have not seen any public reports of that...
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-02-09 23:50 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-01-26 12:03 ` Ingo Molnar
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-01-27 13:04 ` Michel Lespinasse
2013-02-06 20:10 ` Rik van Riel
2013-02-09 23:50 ` Michel Lespinasse
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
2013-01-26 12:00 ` Ingo Molnar
2013-01-26 12:47 ` Borislav Petkov
2013-02-04 13:50 ` Rik van Riel
2013-02-04 14:02 ` Borislav Petkov
2013-02-06 17:05 ` Rik van Riel
2013-01-28 18:18 ` Stefano Stabellini
2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
2013-01-26 7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
2013-01-26 12:05 ` Ingo Molnar
2013-01-26 13:10 ` Mike Galbraith
2013-01-28 6:49 ` Raghavendra K T
2013-02-03 15:28 ` Chegu Vinod
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).