* [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
@ 2012-07-18 13:37 Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
` (4 more replies)
0 siblings, 5 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
Avi Kivity, Rik van Riel
Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
Currently Pause Loop Exit (PLE) handler is doing directed yield to a
random vcpu on pl-exit. We already have filtering while choosing
the candidate to yield_to. This change adds more checks while choosing
a candidate to yield_to.
On a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit.
Such a yield can lead to the vcpu spinning again.
The patchset keeps track of the pause loop exit and gives chance to a
vcpu which has:
(a) Not done pause loop exit at all (probably he is preempted lock-holder)
(b) vcpu skipped in last iteration because it did pause loop exit, and
probably has become eligible now (next eligible lock holder)
This concept also helps in cpu relax interception cases which use same handler.
Changes since V4:
- Naming Change (Avi):
struct ple ==> struct spin_loop
cpu_relax_intercepted ==> in_spin_loop
vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
- mark vcpu in spinloop as not eligible to avoid influence of previous exit
Changes since V3:
- arch specific fix/changes (Christian)
Changes since v2:
- Move ple structure to common code (Avi)
- rename pause_loop_exited to cpu_relax_intercepted (Avi)
- add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
- Drop superfluous curly braces (Ingo)
Changes since v1:
- Add more documentation for structure and algorithm and Rename
plo ==> ple (Rik).
- change dy_eligible initial value to false. (otherwise very first directed
yield will not be skipped. (Nikunj)
- fixup signoff/from issue
Future enhancements:
(1) Currently we have a boolean to decide on eligibility of vcpu. It
would be nice if I get feedback on guest (>32 vcpu) whether we can
improve better with integer counter. (with counter = say f(log n )).
(2) We have not considered system load during iteration of vcpu. With
that information we can limit the scan and also decide whether schedule()
is better. [ I am able to use #kicked vcpus to decide on this But may
be there are better ideas like information from global loadavg.]
(3) We can exploit this further with PV patches since it also knows about
next eligible lock-holder.
Summary: There is a very good improvement for kvm based guest on PLE machine.
The V5 has huge improvement for kbench.
+-----------+-----------+-----------+------------+-----------+
base_rik stdev patched stdev %improve
+-----------+-----------+-----------+------------+-----------+
kernbench (time in sec lesser is better)
+-----------+-----------+-----------+------------+-----------+
1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy (records/sec more is better)
+-----------+-----------+-----------+------------+-----------+
1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
+-----------+-----------+-----------+------------+-----------+
Note: The patches are tested on x86.
Links
V4: https://lkml.org/lkml/2012/7/16/80
V3: https://lkml.org/lkml/2012/7/12/437
V2: https://lkml.org/lkml/2012/7/10/392
V1: https://lkml.org/lkml/2012/7/9/32
Raghavendra K T (3):
config: Add config to support ple or cpu relax optimzation
kvm : Note down when cpu relax intercepted or pause loop exited
kvm : Choose a better candidate for directed yield
---
arch/s390/kvm/Kconfig | 1 +
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++
virt/kvm/Kconfig | 3 +++
virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 85 insertions(+), 0 deletions(-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
@ 2012-07-18 13:37 ` Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
` (3 subsequent siblings)
4 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
Marcelo Tosatti, Rik van Riel
Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
arch/s390/kvm/Kconfig | 1 +
arch/x86/kvm/Kconfig | 1 +
virt/kvm/Kconfig | 3 +++
3 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 78eb984..a6e2677 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
depends on HAVE_KVM && EXPERIMENTAL
select PREEMPT_NOTIFIERS
select ANON_INODES
+ select HAVE_KVM_CPU_RELAX_INTERCEPT
---help---
Support hosting paravirtualized guest machines using the SIE
virtualization capability on the mainframe. This should work
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a28f338..45c044f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -37,6 +37,7 @@ config KVM
select TASK_DELAY_ACCT
select PERF_EVENTS
select HAVE_KVM_MSI
+ select HAVE_KVM_CPU_RELAX_INTERCEPT
---help---
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 28694f4..d01b24b 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -21,3 +21,6 @@ config KVM_ASYNC_PF
config HAVE_KVM_MSI
bool
+
+config HAVE_KVM_CPU_RELAX_INTERCEPT
+ bool
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
@ 2012-07-18 13:37 ` Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
` (2 subsequent siblings)
4 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
Avi Kivity, Rik van Riel
Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Noting pause loop exited vcpu or cpu relax intercepted helps in
filtering right candidate to yield. Wrong selection of vcpu;
i.e., a vcpu that just did a pl-exit or cpu relax intercepted may
contribute to performance degradation.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>
include/linux/kvm_host.h | 34 ++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 5 +++++
2 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c446435..34ce296 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -183,6 +183,18 @@ struct kvm_vcpu {
} async_pf;
#endif
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+ /*
+ * Cpu relax intercept or pause loop exit optimization
+ * in_spin_loop: set when a vcpu does a pause loop exit
+ * or cpu relax intercepted.
+ * dy_eligible: indicates whether vcpu is eligible for directed yield.
+ */
+ struct {
+ bool in_spin_loop;
+ bool dy_eligible;
+ } spin_loop;
+#endif
struct kvm_vcpu_arch arch;
};
@@ -890,5 +902,27 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
}
}
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+
+static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
+{
+ vcpu->spin_loop.in_spin_loop = val;
+}
+static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
+{
+ vcpu->spin_loop.dy_eligible = val;
+}
+
+#else /* !CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
+
+static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
+{
+}
+
+static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
+{
+}
+
+#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..3d6ffc8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -236,6 +236,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
}
vcpu->run = page_address(page);
+ kvm_vcpu_set_in_spin_loop(vcpu, false);
+ kvm_vcpu_set_dy_eligible(vcpu, false);
+
r = kvm_arch_vcpu_init(vcpu);
if (r < 0)
goto fail_free_run;
@@ -1577,6 +1580,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;
+ kvm_vcpu_set_in_spin_loop(me, true);
/*
* We boost the priority of a VCPU that is runnable but not
* currently running, because it got preempted by something
@@ -1602,6 +1606,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
}
}
}
+ kvm_vcpu_set_in_spin_loop(me, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
@ 2012-07-18 13:38 ` Raghavendra K T
2012-07-18 14:39 ` Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-23 10:03 ` Avi Kivity
4 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:38 UTC (permalink / raw)
To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
Marcelo Tosatti, Rik van Riel
Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Currently, on a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit or
cpu relax intercepted. Such a yield can lead to the vcpu spinning
again and hence degrade the performance.
The patchset keeps track of the pause loop exit/cpu relax interception
and gives chance to a vcpu which:
(a) Has not done pause loop exit or cpu relax intercepted at all
(probably he is preempted lock-holder)
(b) Was skipped in last iteration because it did pause loop exit or
cpu relax intercepted, and probably has become eligible now
(next eligible lock holder)
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>
include/linux/kvm_host.h | 5 +++++
virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 34ce296..952427d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
{
}
+static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+ return true;
+}
+
#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3d6ffc8..bf9fb97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,6 +1571,39 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+/*
+ * Helper that checks whether a VCPU is eligible for directed yield.
+ * Most eligible candidate to yield is decided by following heuristics:
+ *
+ * (a) VCPU which has not done pl-exit or cpu relax intercepted recently
+ * (preempted lock holder), indicated by @in_spin_loop.
+ * Set at the beiginning and cleared at the end of interception/PLE handler.
+ *
+ * (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
+ * chance last time (mostly it has become eligible now since we have probably
+ * yielded to lockholder in last iteration. This is done by toggling
+ * @dy_eligible each time a VCPU checked for eligibility.)
+ *
+ * Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
+ * to preempted lock-holder could result in wrong VCPU selection and CPU
+ * burning. Giving priority for a potential lock-holder increases lock
+ * progress.
+ */
+bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+ bool eligible;
+
+ eligible = !vcpu->spin_loop.in_spin_loop ||
+ (vcpu->spin_loop.in_spin_loop &&
+ vcpu->spin_loop.dy_eligible);
+
+ if (vcpu->spin_loop.in_spin_loop)
+ vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible;
+
+ return eligible;
+}
+#endif
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
@@ -1599,6 +1632,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (waitqueue_active(&vcpu->wq))
continue;
+ if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+ continue;
if (kvm_vcpu_yield_to(vcpu)) {
kvm->last_boosted_vcpu = i;
yielded = 1;
@@ -1607,6 +1642,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
}
}
kvm_vcpu_set_in_spin_loop(me, false);
+ kvm_vcpu_set_dy_eligible(me, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-18 14:39 ` Raghavendra K T
2012-07-19 9:47 ` [RESEND PATCH " Raghavendra K T
0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 14:39 UTC (permalink / raw)
To: Avi Kivity
Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
Marcelo Tosatti, Rik van Riel, Srikar, S390, Carsten Otte,
Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML,
X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
On 07/18/2012 07:08 PM, Raghavendra K T wrote:
> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
> +bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
> +{
> + bool eligible;
> +
> + eligible = !vcpu->spin_loop.in_spin_loop ||
> + (vcpu->spin_loop.in_spin_loop&&
> + vcpu->spin_loop.dy_eligible);
> +
> + if (vcpu->spin_loop.in_spin_loop)
> + vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible;
> +
> + return eligible;
> +}
I should have added a comment like:
Since algorithm is based on heuristics, accessing another vcpu data
without locking does not harm. It may result in trying to yield to same
VCPU, fail and continue with next and so on.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [RESEND PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
2012-07-18 14:39 ` Raghavendra K T
@ 2012-07-19 9:47 ` Raghavendra K T
0 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-19 9:47 UTC (permalink / raw)
To: Raghavendra K T, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar
Cc: S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod,
Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
Srivatsa Vaddagiri, Joerg Roedel
Currently, on a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit or
cpu relax intercepted. Such a yield can lead to the vcpu spinning
again and hence degrade the performance.
The patchset keeps track of the pause loop exit/cpu relax interception
and gives chance to a vcpu which:
(a) Has not done pause loop exit or cpu relax intercepted at all
(probably he is preempted lock-holder)
(b) Was skipped in last iteration because it did pause loop exit or
cpu relax intercepted, and probably has become eligible now
(next eligible lock holder)
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>
Changelog: Added comment on locking as suggested by Avi
include/linux/kvm_host.h | 5 +++++
virt/kvm/kvm_main.c | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 0 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 34ce296..952427d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
{
}
+static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+ return true;
+}
+
#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3d6ffc8..8fda756 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,6 +1571,43 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+/*
+ * Helper that checks whether a VCPU is eligible for directed yield.
+ * Most eligible candidate to yield is decided by following heuristics:
+ *
+ * (a) VCPU which has not done pl-exit or cpu relax intercepted recently
+ * (preempted lock holder), indicated by @in_spin_loop.
+ * Set at the beiginning and cleared at the end of interception/PLE handler.
+ *
+ * (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
+ * chance last time (mostly it has become eligible now since we have probably
+ * yielded to lockholder in last iteration. This is done by toggling
+ * @dy_eligible each time a VCPU checked for eligibility.)
+ *
+ * Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
+ * to preempted lock-holder could result in wrong VCPU selection and CPU
+ * burning. Giving priority for a potential lock-holder increases lock
+ * progress.
+ *
+ * Since algorithm is based on heuristics, accessing another VCPU data without
+ * locking does not harm. It may result in trying to yield to same VCPU, fail
+ * and continue with next VCPU and so on.
+ */
+bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+ bool eligible;
+
+ eligible = !vcpu->spin_loop.in_spin_loop ||
+ (vcpu->spin_loop.in_spin_loop &&
+ vcpu->spin_loop.dy_eligible);
+
+ if (vcpu->spin_loop.in_spin_loop)
+ kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible);
+
+ return eligible;
+}
+#endif
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
@@ -1599,6 +1636,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (waitqueue_active(&vcpu->wq))
continue;
+ if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+ continue;
if (kvm_vcpu_yield_to(vcpu)) {
kvm->last_boosted_vcpu = i;
yielded = 1;
@@ -1607,6 +1646,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
}
}
kvm_vcpu_set_in_spin_loop(me, false);
+
+ /* Ensure vcpu is not eligible during next spinloop */
+ kvm_vcpu_set_dy_eligible(me, false);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
` (2 preceding siblings ...)
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-20 17:36 ` Marcelo Tosatti
2012-07-22 12:34 ` Raghavendra K T
2012-07-23 10:03 ` Avi Kivity
4 siblings, 1 reply; 41+ messages in thread
From: Marcelo Tosatti @ 2012-07-20 17:36 UTC (permalink / raw)
To: Raghavendra K T
Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Avi Kivity,
Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger,
KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
linux390, Srivatsa Vaddagiri, Joerg Roedel
On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
>
> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
> random vcpu on pl-exit. We already have filtering while choosing
> the candidate to yield_to. This change adds more checks while choosing
> a candidate to yield_to.
>
> On a large vcpu guests, there is a high probability of
> yielding to the same vcpu who had recently done a pause-loop exit.
> Such a yield can lead to the vcpu spinning again.
>
> The patchset keeps track of the pause loop exit and gives chance to a
> vcpu which has:
>
> (a) Not done pause loop exit at all (probably he is preempted lock-holder)
>
> (b) vcpu skipped in last iteration because it did pause loop exit, and
> probably has become eligible now (next eligible lock holder)
>
> This concept also helps in cpu relax interception cases which use same handler.
>
> Changes since V4:
> - Naming Change (Avi):
> struct ple ==> struct spin_loop
> cpu_relax_intercepted ==> in_spin_loop
> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
> - mark vcpu in spinloop as not eligible to avoid influence of previous exit
>
> Changes since V3:
> - arch specific fix/changes (Christian)
>
> Changes since v2:
> - Move ple structure to common code (Avi)
> - rename pause_loop_exited to cpu_relax_intercepted (Avi)
> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
> - Drop superfluous curly braces (Ingo)
>
> Changes since v1:
> - Add more documentation for structure and algorithm and Rename
> plo ==> ple (Rik).
> - change dy_eligible initial value to false. (otherwise very first directed
> yield will not be skipped. (Nikunj)
> - fixup signoff/from issue
>
> Future enhancements:
> (1) Currently we have a boolean to decide on eligibility of vcpu. It
> would be nice if I get feedback on guest (>32 vcpu) whether we can
> improve better with integer counter. (with counter = say f(log n )).
>
> (2) We have not considered system load during iteration of vcpu. With
> that information we can limit the scan and also decide whether schedule()
> is better. [ I am able to use #kicked vcpus to decide on this But may
> be there are better ideas like information from global loadavg.]
>
> (3) We can exploit this further with PV patches since it also knows about
> next eligible lock-holder.
>
> Summary: There is a very good improvement for kvm based guest on PLE machine.
> The V5 has huge improvement for kbench.
>
> +-----------+-----------+-----------+------------+-----------+
> base_rik stdev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> kernbench (time in sec lesser is better)
> +-----------+-----------+-----------+------------+-----------+
> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
> +-----------+-----------+-----------+------------+-----------+
>
> +-----------+-----------+-----------+------------+-----------+
> ebizzy (records/sec more is better)
> +-----------+-----------+-----------+------------+-----------+
> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
> +-----------+-----------+-----------+------------+-----------+
>
> Note: The patches are tested on x86.
>
> Links
> V4: https://lkml.org/lkml/2012/7/16/80
> V3: https://lkml.org/lkml/2012/7/12/437
> V2: https://lkml.org/lkml/2012/7/10/392
> V1: https://lkml.org/lkml/2012/7/9/32
>
> Raghavendra K T (3):
> config: Add config to support ple or cpu relax optimzation
> kvm : Note down when cpu relax intercepted or pause loop exited
> kvm : Choose a better candidate for directed yield
> ---
> arch/s390/kvm/Kconfig | 1 +
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++
> virt/kvm/Kconfig | 3 +++
> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 85 insertions(+), 0 deletions(-)
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
@ 2012-07-22 12:34 ` Raghavendra K T
2012-07-22 12:43 ` Avi Kivity
2012-07-22 17:58 ` Rik van Riel
0 siblings, 2 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-22 12:34 UTC (permalink / raw)
To: Marcelo Tosatti, Avi Kivity, Rik van Riel, Christian Borntraeger
Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
On 07/20/2012 11:06 PM, Marcelo Tosatti wrote:
> On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
>>
>> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
>> random vcpu on pl-exit. We already have filtering while choosing
>> the candidate to yield_to. This change adds more checks while choosing
>> a candidate to yield_to.
>>
>> On a large vcpu guests, there is a high probability of
>> yielding to the same vcpu who had recently done a pause-loop exit.
>> Such a yield can lead to the vcpu spinning again.
>>
>> The patchset keeps track of the pause loop exit and gives chance to a
>> vcpu which has:
>>
>> (a) Not done pause loop exit at all (probably he is preempted lock-holder)
>>
>> (b) vcpu skipped in last iteration because it did pause loop exit, and
>> probably has become eligible now (next eligible lock holder)
>>
>> This concept also helps in cpu relax interception cases which use same handler.
>>
>> Changes since V4:
>> - Naming Change (Avi):
>> struct ple ==> struct spin_loop
>> cpu_relax_intercepted ==> in_spin_loop
>> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
>> - mark vcpu in spinloop as not eligible to avoid influence of previous exit
>>
>> Changes since V3:
>> - arch specific fix/changes (Christian)
>>
>> Changes since v2:
>> - Move ple structure to common code (Avi)
>> - rename pause_loop_exited to cpu_relax_intercepted (Avi)
>> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
>> - Drop superfluous curly braces (Ingo)
>>
>> Changes since v1:
>> - Add more documentation for structure and algorithm and Rename
>> plo ==> ple (Rik).
>> - change dy_eligible initial value to false. (otherwise very first directed
>> yield will not be skipped. (Nikunj)
>> - fixup signoff/from issue
>>
>> Future enhancements:
>> (1) Currently we have a boolean to decide on eligibility of vcpu. It
>> would be nice if I get feedback on guest (>32 vcpu) whether we can
>> improve better with integer counter. (with counter = say f(log n )).
>>
>> (2) We have not considered system load during iteration of vcpu. With
>> that information we can limit the scan and also decide whether schedule()
>> is better. [ I am able to use #kicked vcpus to decide on this But may
>> be there are better ideas like information from global loadavg.]
>>
>> (3) We can exploit this further with PV patches since it also knows about
>> next eligible lock-holder.
>>
>> Summary: There is a very good improvement for kvm based guest on PLE machine.
>> The V5 has huge improvement for kbench.
>>
>> +-----------+-----------+-----------+------------+-----------+
>> base_rik stdev patched stdev %improve
>> +-----------+-----------+-----------+------------+-----------+
>> kernbench (time in sec lesser is better)
>> +-----------+-----------+-----------+------------+-----------+
>> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
>> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>> ebizzy (records/sec more is better)
>> +-----------+-----------+-----------+------------+-----------+
>> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
>> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
>> +-----------+-----------+-----------+------------+-----------+
>>
>> Note: The patches are tested on x86.
>>
>> Links
>> V4: https://lkml.org/lkml/2012/7/16/80
>> V3: https://lkml.org/lkml/2012/7/12/437
>> V2: https://lkml.org/lkml/2012/7/10/392
>> V1: https://lkml.org/lkml/2012/7/9/32
>>
>> Raghavendra K T (3):
>> config: Add config to support ple or cpu relax optimzation
>> kvm : Note down when cpu relax intercepted or pause loop exited
>> kvm : Choose a better candidate for directed yield
>> ---
>> arch/s390/kvm/Kconfig | 1 +
>> arch/x86/kvm/Kconfig | 1 +
>> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++
>> virt/kvm/Kconfig | 3 +++
>> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 85 insertions(+), 0 deletions(-)
>
> Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com>
>
Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
if this series looks good now.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-22 12:34 ` Raghavendra K T
@ 2012-07-22 12:43 ` Avi Kivity
2012-07-23 7:35 ` Christian Borntraeger
2012-07-22 17:58 ` Rik van Riel
1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-07-22 12:43 UTC (permalink / raw)
To: Raghavendra K T
Cc: Marcelo Tosatti, Rik van Riel, Christian Borntraeger,
H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
On 07/22/2012 03:34 PM, Raghavendra K T wrote:
>
> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
> if this series looks good now.
>
It looks fine to me. Christian, is this okay for s390?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-22 12:34 ` Raghavendra K T
2012-07-22 12:43 ` Avi Kivity
@ 2012-07-22 17:58 ` Rik van Riel
1 sibling, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2012-07-22 17:58 UTC (permalink / raw)
To: Raghavendra K T
Cc: Marcelo Tosatti, Avi Kivity, Christian Borntraeger,
H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel
On 07/22/2012 08:34 AM, Raghavendra K T wrote:
> On 07/20/2012 11:06 PM, Marcelo Tosatti wrote:
>> On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
>>>
>>> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
>>> random vcpu on pl-exit. We already have filtering while choosing
>>> the candidate to yield_to. This change adds more checks while choosing
>>> a candidate to yield_to.
>>>
>>> On a large vcpu guests, there is a high probability of
>>> yielding to the same vcpu who had recently done a pause-loop exit.
>>> Such a yield can lead to the vcpu spinning again.
>>>
>>> The patchset keeps track of the pause loop exit and gives chance to a
>>> vcpu which has:
>>>
>>> (a) Not done pause loop exit at all (probably he is preempted
>>> lock-holder)
>>>
>>> (b) vcpu skipped in last iteration because it did pause loop exit, and
>>> probably has become eligible now (next eligible lock holder)
>>>
>>> This concept also helps in cpu relax interception cases which use
>>> same handler.
>>>
>>> Changes since V4:
>>> - Naming Change (Avi):
>>> struct ple ==> struct spin_loop
>>> cpu_relax_intercepted ==> in_spin_loop
>>> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
>>> - mark vcpu in spinloop as not eligible to avoid influence of
>>> previous exit
>>>
>>> Changes since V3:
>>> - arch specific fix/changes (Christian)
>>>
>>> Changes since v2:
>>> - Move ple structure to common code (Avi)
>>> - rename pause_loop_exited to cpu_relax_intercepted (Avi)
>>> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
>>> - Drop superfluous curly braces (Ingo)
>>>
>>> Changes since v1:
>>> - Add more documentation for structure and algorithm and Rename
>>> plo ==> ple (Rik).
>>> - change dy_eligible initial value to false. (otherwise very first
>>> directed
>>> yield will not be skipped. (Nikunj)
>>> - fixup signoff/from issue
>>>
>>> Future enhancements:
>>> (1) Currently we have a boolean to decide on eligibility of vcpu. It
>>> would be nice if I get feedback on guest (>32 vcpu) whether we can
>>> improve better with integer counter. (with counter = say f(log n )).
>>>
>>> (2) We have not considered system load during iteration of vcpu. With
>>> that information we can limit the scan and also decide whether
>>> schedule()
>>> is better. [ I am able to use #kicked vcpus to decide on this But may
>>> be there are better ideas like information from global loadavg.]
>>>
>>> (3) We can exploit this further with PV patches since it also knows
>>> about
>>> next eligible lock-holder.
>>>
>>> Summary: There is a very good improvement for kvm based guest on PLE
>>> machine.
>>> The V5 has huge improvement for kbench.
>>>
>>> +-----------+-----------+-----------+------------+-----------+
>>> base_rik stdev patched stdev %improve
>>> +-----------+-----------+-----------+------------+-----------+
>>> kernbench (time in sec lesser is better)
>>> +-----------+-----------+-----------+------------+-----------+
>>> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
>>> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> +-----------+-----------+-----------+------------+-----------+
>>> ebizzy (records/sec more is better)
>>> +-----------+-----------+-----------+------------+-----------+
>>> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
>>> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> Note: The patches are tested on x86.
>>>
>>> Links
>>> V4: https://lkml.org/lkml/2012/7/16/80
>>> V3: https://lkml.org/lkml/2012/7/12/437
>>> V2: https://lkml.org/lkml/2012/7/10/392
>>> V1: https://lkml.org/lkml/2012/7/9/32
>>>
>>> Raghavendra K T (3):
>>> config: Add config to support ple or cpu relax optimzation
>>> kvm : Note down when cpu relax intercepted or pause loop exited
>>> kvm : Choose a better candidate for directed yield
>>> ---
>>> arch/s390/kvm/Kconfig | 1 +
>>> arch/x86/kvm/Kconfig | 1 +
>>> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++
>>> virt/kvm/Kconfig | 3 +++
>>> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>> 5 files changed, 85 insertions(+), 0 deletions(-)
>>
>> Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com>
>>
>
> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
> if this series looks good now.
The series looks good to me.
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-22 12:43 ` Avi Kivity
@ 2012-07-23 7:35 ` Christian Borntraeger
0 siblings, 0 replies; 41+ messages in thread
From: Christian Borntraeger @ 2012-07-23 7:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Raghavendra K T, Marcelo Tosatti, Rik van Riel, H. Peter Anvin,
Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM,
chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
linux390, Srivatsa Vaddagiri, Joerg Roedel
On 22/07/12 14:43, Avi Kivity wrote:
> On 07/22/2012 03:34 PM, Raghavendra K T wrote:
>>
>> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
>> if this series looks good now.
>>
>
> It looks fine to me. Christian, is this okay for s390?
>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
` (3 preceding siblings ...)
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
@ 2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
4 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-07-23 10:03 UTC (permalink / raw)
To: Raghavendra K T
Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger,
KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
linux390, Srivatsa Vaddagiri, Joerg Roedel
On 07/18/2012 04:37 PM, Raghavendra K T wrote:
> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
> random vcpu on pl-exit. We already have filtering while choosing
> the candidate to yield_to. This change adds more checks while choosing
> a candidate to yield_to.
>
> On a large vcpu guests, there is a high probability of
> yielding to the same vcpu who had recently done a pause-loop exit.
> Such a yield can lead to the vcpu spinning again.
>
> The patchset keeps track of the pause loop exit and gives chance to a
> vcpu which has:
>
> (a) Not done pause loop exit at all (probably he is preempted lock-holder)
>
> (b) vcpu skipped in last iteration because it did pause loop exit, and
> probably has become eligible now (next eligible lock holder)
>
> This concept also helps in cpu relax interception cases which use same handler.
>
Thanks, applied to 'queue'.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 41+ messages in thread
* [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-07-23 10:03 ` Avi Kivity
@ 2012-09-07 13:11 ` Andrew Theurer
2012-09-07 18:06 ` Raghavendra K T
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-07 13:11 UTC (permalink / raw)
To: Avi Kivity
Cc: Raghavendra K T, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
Srikar, KVM, KVM, chegu vinod, LKML, X86, Gleb Natapov,
Srivatsa Vaddagiri
I have noticed recently that PLE/yield_to() is still not that scalable
for really large guests, sometimes even with no CPU over-commit. I have
a small change that make a very big difference.
First, let me explain what I saw:
Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
thread Westmere-EX system: 645 seconds!
Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
runqueue lock for yield_to()
So, I added some schedstats to yield_to(), one to count when we failed
this test in yield_to()
if (task_running(p_rq, p) || p->state)
and one when we pass all the conditions and get to actually yield:
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
And during boot up of this guest, I saw:
failed yield_to() because task is running: 8368810426
successful yield_to(): 13077658
0.156022% of yield_to calls
1 out of 640 yield_to calls
Obviously, we have a problem. Every exit causes a loop over 80 vcpus,
each one trying to get two locks. This is happening on all [but one]
vcpus at around the same time. Not going to work well.
So, since the check for a running task is nearly always true, I moved
that -before- the double runqueue lock, so 99.84% of the attempts do not
take the locks. Now, I do not know is this [not getting the locks] is a
problem. However, I'd rather have a little inaccurate test for a
running vcpu than burning 98% of CPU in host kernel. With the change
the VM boot time went to: 100 seconds, an 85% reduction in time.
I also wanted to check to see this did not affect truly over-committed
situations, so I first started with smaller VMs at 2x cpu over-commit:
16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
throughput +/- stddev
----- -----
ple off: 2281 +/- 7.32% (really bad as expected)
ple on: 19796 +/- 1.36%
ple on: w/fix: 19796 +/- 1.37% (no degrade at all)
In this case the VMs are small enough, that we do not loop through
enough vcpus to trigger the problem. host CPU is very low (3-4% range)
for both default ple and with yield_to() fix.
So I went on to a bigger VM:
10 VMs, 16-way each, all running dbench (2x cpu over-commit)
throughput +/- stddev
----- -----
ple on: 2552 +/- .70%
ple on: w/fix: 4621 +/- 2.12% (81% improvement!)
This is where we start seeing a major difference. Without the fix, host
cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and
guest went from 30 to 40%). I believe this is on the right track to
reduce the spin lock contention, still get proper directed yield, and
therefore improve the guest CPU available and its performance.
However, we still have lock contention, and I think we can reduce it
even more. We have eliminated some attempts at double runqueue lock
acquire because the check for the target vcpu is running is now before
the lock. However, even if the target-to-yield-to vcpu [for the same
guest upon we PLE exited] is not running, the physical
processor/runqueue that target-to-yield-to vcpu is located on could be
running a different VM's vcpu -and- going through a directed yield,
therefore that run queue lock may already acquired. We do not want to
just spin and wait, we want to move to the next candidate vcpu. We need
a check to see if the smp processor/runqueue is already in a directed
yield. Or, perhaps we just check if that cpu is not in guest mode, and
if so, we skip that yield attempt for that vcpu and move to the next
candidate vcpu. So, my question is: given a runqueue, what's the best
way to check if that corresponding phys cpu is not in guest mode?
Here's the changes so far (schedstat changes not included here):
signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..f8eff8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)
again:
p_rq = task_rq(p);
+ if (task_running(p_rq, p) || p->state) {
+ goto out_no_unlock;
+ }
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4856,8 +4859,6 @@ again:
if (curr->sched_class != p->sched_class)
goto out;
- if (task_running(p_rq, p) || p->state)
- goto out;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4879,6 +4880,7 @@ again:
out:
double_rq_unlock(rq, p_rq);
+out_no_unlock:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
@ 2012-09-07 18:06 ` Raghavendra K T
2012-09-07 19:42 ` Andrew Theurer
0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-07 18:06 UTC (permalink / raw)
To: habanero
Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
Peter Zijlstra
CCing PeterZ also.
On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> I have noticed recently that PLE/yield_to() is still not that scalable
> for really large guests, sometimes even with no CPU over-commit. I have
> a small change that make a very big difference.
>
> First, let me explain what I saw:
>
> Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> thread Westmere-EX system: 645 seconds!
>
> Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> runqueue lock for yield_to()
>
> So, I added some schedstats to yield_to(), one to count when we failed
> this test in yield_to()
>
> if (task_running(p_rq, p) || p->state)
>
> and one when we pass all the conditions and get to actually yield:
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>
>
> And during boot up of this guest, I saw:
>
>
> failed yield_to() because task is running: 8368810426
> successful yield_to(): 13077658
> 0.156022% of yield_to calls
> 1 out of 640 yield_to calls
>
> Obviously, we have a problem. Every exit causes a loop over 80 vcpus,
> each one trying to get two locks. This is happening on all [but one]
> vcpus at around the same time. Not going to work well.
>
True and interesting. I had once thought of reducing overall O(n^2)
iteration to O(n log(n)) iterations by reducing number of candidates
to search to O(log(n)) instead of current O(n). May be I have to get
back to my experiment modes.
> So, since the check for a running task is nearly always true, I moved
> that -before- the double runqueue lock, so 99.84% of the attempts do not
> take the locks. Now, I do not know is this [not getting the locks] is a
> problem. However, I'd rather have a little inaccurate test for a
> running vcpu than burning 98% of CPU in host kernel. With the change
> the VM boot time went to: 100 seconds, an 85% reduction in time.
>
> I also wanted to check to see this did not affect truly over-committed
> situations, so I first started with smaller VMs at 2x cpu over-commit:
>
> 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
> throughput +/- stddev
> ----- -----
> ple off: 2281 +/- 7.32% (really bad as expected)
> ple on: 19796 +/- 1.36%
> ple on: w/fix: 19796 +/- 1.37% (no degrade at all)
>
> In this case the VMs are small enough, that we do not loop through
> enough vcpus to trigger the problem. host CPU is very low (3-4% range)
> for both default ple and with yield_to() fix.
>
> So I went on to a bigger VM:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> throughput +/- stddev
> ----- -----
> ple on: 2552 +/- .70%
> ple on: w/fix: 4621 +/- 2.12% (81% improvement!)
>
> This is where we start seeing a major difference. Without the fix, host
> cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and
> guest went from 30 to 40%). I believe this is on the right track to
> reduce the spin lock contention, still get proper directed yield, and
> therefore improve the guest CPU available and its performance.
>
> However, we still have lock contention, and I think we can reduce it
> even more. We have eliminated some attempts at double runqueue lock
> acquire because the check for the target vcpu is running is now before
> the lock. However, even if the target-to-yield-to vcpu [for the same
> guest upon we PLE exited] is not running, the physical
> processor/runqueue that target-to-yield-to vcpu is located on could be
> running a different VM's vcpu -and- going through a directed yield,
> therefore that run queue lock may already acquired. We do not want to
> just spin and wait, we want to move to the next candidate vcpu. We need
> a check to see if the smp processor/runqueue is already in a directed
> yield. Or, perhaps we just check if that cpu is not in guest mode, and
> if so, we skip that yield attempt for that vcpu and move to the next
> candidate vcpu. So, my question is: given a runqueue, what's the best
> way to check if that corresponding phys cpu is not in guest mode?
>
We are indeed avoiding CPUS in guest mode when we check
task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice?
> Here's the changes so far (schedstat changes not included here):
>
> signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..f8eff8c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
> again:
> p_rq = task_rq(p);
> + if (task_running(p_rq, p) || p->state) {
> + goto out_no_unlock;
> + }
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
> if (curr->sched_class != p->sched_class)
> goto out;
>
> - if (task_running(p_rq, p) || p->state)
> - goto out;
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
> out:
> double_rq_unlock(rq, p_rq);
> +out_no_unlock:
> local_irq_restore(flags);
>
> if (yielded)
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-07 18:06 ` Raghavendra K T
@ 2012-09-07 19:42 ` Andrew Theurer
2012-09-08 8:43 ` Srikar Dronamraju
2012-09-10 14:43 ` Raghavendra K T
0 siblings, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-07 19:42 UTC (permalink / raw)
To: Raghavendra K T
Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
Peter Zijlstra
On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:
> CCing PeterZ also.
>
> On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> > I have noticed recently that PLE/yield_to() is still not that scalable
> > for really large guests, sometimes even with no CPU over-commit. I have
> > a small change that make a very big difference.
> >
> > First, let me explain what I saw:
> >
> > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> > thread Westmere-EX system: 645 seconds!
> >
> > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> > runqueue lock for yield_to()
> >
> > So, I added some schedstats to yield_to(), one to count when we failed
> > this test in yield_to()
> >
> > if (task_running(p_rq, p) || p->state)
> >
> > and one when we pass all the conditions and get to actually yield:
> >
> > yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >
> >
> > And during boot up of this guest, I saw:
> >
> >
> > failed yield_to() because task is running: 8368810426
> > successful yield_to(): 13077658
> > 0.156022% of yield_to calls
> > 1 out of 640 yield_to calls
> >
> > Obviously, we have a problem. Every exit causes a loop over 80 vcpus,
> > each one trying to get two locks. This is happening on all [but one]
> > vcpus at around the same time. Not going to work well.
> >
>
> True and interesting. I had once thought of reducing overall O(n^2)
> iteration to O(n log(n)) iterations by reducing number of candidates
> to search to O(log(n)) instead of current O(n). May be I have to get
> back to my experiment modes.
>
> > So, since the check for a running task is nearly always true, I moved
> > that -before- the double runqueue lock, so 99.84% of the attempts do not
> > take the locks. Now, I do not know is this [not getting the locks] is a
> > problem. However, I'd rather have a little inaccurate test for a
> > running vcpu than burning 98% of CPU in host kernel. With the change
> > the VM boot time went to: 100 seconds, an 85% reduction in time.
> >
> > I also wanted to check to see this did not affect truly over-committed
> > situations, so I first started with smaller VMs at 2x cpu over-commit:
> >
> > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
> > throughput +/- stddev
> > ----- -----
> > ple off: 2281 +/- 7.32% (really bad as expected)
> > ple on: 19796 +/- 1.36%
> > ple on: w/fix: 19796 +/- 1.37% (no degrade at all)
> >
> > In this case the VMs are small enough, that we do not loop through
> > enough vcpus to trigger the problem. host CPU is very low (3-4% range)
> > for both default ple and with yield_to() fix.
> >
> > So I went on to a bigger VM:
> >
> > 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> > throughput +/- stddev
> > ----- -----
> > ple on: 2552 +/- .70%
> > ple on: w/fix: 4621 +/- 2.12% (81% improvement!)
> >
> > This is where we start seeing a major difference. Without the fix, host
> > cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and
> > guest went from 30 to 40%). I believe this is on the right track to
> > reduce the spin lock contention, still get proper directed yield, and
> > therefore improve the guest CPU available and its performance.
> >
> > However, we still have lock contention, and I think we can reduce it
> > even more. We have eliminated some attempts at double runqueue lock
> > acquire because the check for the target vcpu is running is now before
> > the lock. However, even if the target-to-yield-to vcpu [for the same
> > guest upon we PLE exited] is not running, the physical
> > processor/runqueue that target-to-yield-to vcpu is located on could be
> > running a different VM's vcpu -and- going through a directed yield,
> > therefore that run queue lock may already acquired. We do not want to
> > just spin and wait, we want to move to the next candidate vcpu. We need
> > a check to see if the smp processor/runqueue is already in a directed
> > yield. Or, perhaps we just check if that cpu is not in guest mode, and
> > if so, we skip that yield attempt for that vcpu and move to the next
> > candidate vcpu. So, my question is: given a runqueue, what's the best
> > way to check if that corresponding phys cpu is not in guest mode?
> >
>
> We are indeed avoiding CPUS in guest mode when we check
> task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice?
My understanding is that it checks if the candidate vcpu task is in
guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
target to yield to if it is already in guest mode. I am concerned about
a different vcpu, possibly from a different VM (let's call it g2vcpuN),
but it also located on the same runqueue as g1vcpuN -and- running. That
vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
holding the rq lock. Or it could be in guest mode. If it is in guest
mode, then let's still target this rq, and try to yield to g1vcpuN.
However, if g2vcpuN is not in guest mode, then don't bother trying.
Patch include below.
Here's the new, v2 result with the previous two:
10 VMs, 16-way each, all running dbench (2x cpu over-commit)
throughput +/- stddev
----- -----
ple on: 2552 +/- .70%
ple on: w/fixv1: 4621 +/- 2.12% (81% improvement)
ple on: w/fixv2: 6115* (139% improvement)
[*] I do not have stdev yet because all 10 runs are not complete
for v1 to v2, host CPU dropped from 60% to 50%. Time in spin_lock() is
also dropping:
v1:
> 28.60% 264266 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
> 9.21% 85109 qemu-system-x86 [kernel.kallsyms] [k] __schedule
> 4.98% 46035 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
> 4.12% 38066 qemu-system-x86 [kernel.kallsyms] [k] yield_to
> 4.01% 37114 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
> 3.52% 32507 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
> 3.37% 31141 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
> 3.26% 30121 qemu-system-x86 [kvm] [k] __vcpu_run
> 3.01% 27844 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
> 2.98% 27592 qemu-system-x86 [kernel.kallsyms] [k] resched_task
> 2.72% 25108 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
> 2.25% 20758 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
> 2.01% 18522 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
> 1.89% 17508 qemu-system-x86 [kvm] [k] vcpu_enter_guest
v2:
> 10.12% 83285 qemu-system-x86 [kernel.kallsyms] [k] __schedule
> 9.74% 80106 qemu-system-x86 [kernel.kallsyms] [k] yield_to
> 7.38% 60686 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
> 6.28% 51695 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
> 5.24% 43128 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
> 5.04% 41517 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
> 4.11% 33823 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
> 3.62% 29827 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
> 3.57% 29413 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
> 3.44% 28297 qemu-system-x86 [kvm] [k] __vcpu_run
> 2.93% 24116 qemu-system-x86 [kvm] [k] vcpu_enter_guest
> 2.60% 21379 qemu-system-x86 [kernel.kallsyms] [k] resched_task
So this seems to be working. However I wonder just how far we can take
this. Ideally we need to be in <3-4% in host for PLE work, like I
observe for the 8-way VMs. We are still way off.
-Andrew
signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..c767915 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)
again:
p_rq = task_rq(p);
+ if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
PF_VCPU)) {
+ goto out_no_unlock;
+ }
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4856,8 +4859,6 @@ again:
if (curr->sched_class != p->sched_class)
goto out;
- if (task_running(p_rq, p) || p->state)
- goto out;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4879,6 +4880,7 @@ again:
out:
double_rq_unlock(rq, p_rq);
+out_no_unlock:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-07 19:42 ` Andrew Theurer
@ 2012-09-08 8:43 ` Srikar Dronamraju
2012-09-10 13:16 ` Andrew Theurer
2012-09-10 14:43 ` Raghavendra K T
1 sibling, 1 reply; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-08 8:43 UTC (permalink / raw)
To: Andrew Theurer
Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov,
Srivatsa Vaddagiri, Peter Zijlstra
>
> signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..c767915 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
> again:
> p_rq = task_rq(p);
> + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> PF_VCPU)) {
> + goto out_no_unlock;
> + }
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
> if (curr->sched_class != p->sched_class)
> goto out;
>
> - if (task_running(p_rq, p) || p->state)
> - goto out;
Is it possible that by this time the current thread takes double rq
lock, thread p could actually be running? i.e is there merit to keep
this check around even with your similar check above?
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
> out:
> double_rq_unlock(rq, p_rq);
> +out_no_unlock:
> local_irq_restore(flags);
>
> if (yielded)
>
>
--
thanks and regards
Srikar
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-08 8:43 ` Srikar Dronamraju
@ 2012-09-10 13:16 ` Andrew Theurer
2012-09-10 16:03 ` Peter Zijlstra
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-10 13:16 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov,
Srivatsa Vaddagiri, Peter Zijlstra
On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote:
> >
> > signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..c767915 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> > preempt)
> >
> > again:
> > p_rq = task_rq(p);
> > + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> > PF_VCPU)) {
> > + goto out_no_unlock;
> > + }
> > double_rq_lock(rq, p_rq);
> > while (task_rq(p) != p_rq) {
> > double_rq_unlock(rq, p_rq);
> > @@ -4856,8 +4859,6 @@ again:
> > if (curr->sched_class != p->sched_class)
> > goto out;
> >
> > - if (task_running(p_rq, p) || p->state)
> > - goto out;
>
> Is it possible that by this time the current thread takes double rq
> lock, thread p could actually be running? i.e is there merit to keep
> this check around even with your similar check above?
I think that's a good idea. I'll add that back in.
>
> >
> > yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> > if (yielded) {
> > @@ -4879,6 +4880,7 @@ again:
> >
> > out:
> > double_rq_unlock(rq, p_rq);
> > +out_no_unlock:
> > local_irq_restore(flags);
> >
> > if (yielded)
> >
> >
>
-Andrew Theurer
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-07 19:42 ` Andrew Theurer
2012-09-08 8:43 ` Srikar Dronamraju
@ 2012-09-10 14:43 ` Raghavendra K T
1 sibling, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-10 14:43 UTC (permalink / raw)
To: habanero
Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
Peter Zijlstra
On 09/08/2012 01:12 AM, Andrew Theurer wrote:
> On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:
>> CCing PeterZ also.
>>
>> On 09/07/2012 06:41 PM, Andrew Theurer wrote:
>>> I have noticed recently that PLE/yield_to() is still not that scalable
>>> for really large guests, sometimes even with no CPU over-commit. I have
>>> a small change that make a very big difference.
[...]
>> We are indeed avoiding CPUS in guest mode when we check
>> task->flags& PF_VCPU in vcpu_on_spin path. Doesn't that suffice?
> My understanding is that it checks if the candidate vcpu task is in
> guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
> target to yield to if it is already in guest mode. I am concerned about
> a different vcpu, possibly from a different VM (let's call it g2vcpuN),
> but it also located on the same runqueue as g1vcpuN -and- running. That
> vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
> holding the rq lock. Or it could be in guest mode. If it is in guest
> mode, then let's still target this rq, and try to yield to g1vcpuN.
> However, if g2vcpuN is not in guest mode, then don't bother trying.
- If a non vcpu task was currently running, this change can ignore
request to yield to a target vcpu. The target vcpu could be the most
eligible vcpu causing other vcpus to do ple exits.
Is it possible to modify the check to deal with only vcpu tasks?
- Should we use p_rq->cfs_rq->skip instead to let us know that some
yield was active at this time?
-
Cpu 1 cpu2 cpu3
a1 a2 a3
b1 b2 b3
c2(yield target of a1) c3(yield target of a2)
If vcpu a1 is doing directed yield to vcpu c2; current vcpu a2 on target
cpu is also doing a directed yield(to some vcpu c3). Then this change
will only allow vcpu a2 will do a schedule() to b2 (if a2 -> c3 yield is
successful). Do we miss yielding to a vcpu c2?
a1 might not find a suitable vcpu to yield and might go back to
spinning. Is my understanding correct?
> Patch include below.
>
> Here's the new, v2 result with the previous two:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> throughput +/- stddev
> ----- -----
> ple on: 2552 +/- .70%
> ple on: w/fixv1: 4621 +/- 2.12% (81% improvement)
> ple on: w/fixv2: 6115* (139% improvement)
>
The numbers look great.
> [*] I do not have stdev yet because all 10 runs are not complete
>
> for v1 to v2, host CPU dropped from 60% to 50%. Time in spin_lock() is
> also dropping:
>
[...]
>
> So this seems to be working. However I wonder just how far we can take
> this. Ideally we need to be in<3-4% in host for PLE work, like I
> observe for the 8-way VMs. We are still way off.
>
> -Andrew
>
>
> signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..c767915 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
> again:
> p_rq = task_rq(p);
> + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags&
> PF_VCPU)) {
While we are checking the flags of p_rq->curr task, the task p can
migrate to some other runqueue. In this case will we miss yielding to
the most eligible vcpu?
> + goto out_no_unlock;
> + }
Nit:
We dont need parenthesis above.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 13:16 ` Andrew Theurer
@ 2012-09-10 16:03 ` Peter Zijlstra
2012-09-10 16:56 ` Srikar Dronamraju
0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 16:03 UTC (permalink / raw)
To: habanero
Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > @@ -4856,8 +4859,6 @@ again:
> > > if (curr->sched_class != p->sched_class)
> > > goto out;
> > >
> > > - if (task_running(p_rq, p) || p->state)
> > > - goto out;
> >
> > Is it possible that by this time the current thread takes double rq
> > lock, thread p could actually be running? i.e is there merit to keep
> > this check around even with your similar check above?
>
> I think that's a good idea. I'll add that back in.
Right, it needs to still be there, the test before acquiring p_rq is an
optimistic test to avoid work, but you have to still test it once you
acquire p_rq since the rest of the code relies on this not being so.
How about something like this instead.. ?
---
kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46a011..c9ecab2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4300,6 +4300,23 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;
+
+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}
+
/**
* yield_to - yield the current processor to another thread in
* your thread group, or accelerate that thread toward the
@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
rq = this_rq();
again:
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p))
+ goto out_irq;
+
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
@@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
goto again;
}
- if (!curr->sched_class->yield_to_task)
- goto out;
-
- if (curr->sched_class != p->sched_class)
- goto out;
-
- if (task_running(p_rq, p) || p->state)
- goto out;
+ /* validate state, holding p_rq ensures p's state cannot change */
+ if (!__yield_to_candidate(curr, p))
+ goto out_unlock;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
resched_task(p_rq->curr);
}
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 16:03 ` Peter Zijlstra
@ 2012-09-10 16:56 ` Srikar Dronamraju
2012-09-10 17:12 ` Peter Zijlstra
0 siblings, 1 reply; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-10 16:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
* Peter Zijlstra <peterz@infradead.org> [2012-09-10 18:03:55]:
> On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > > @@ -4856,8 +4859,6 @@ again:
> > > > if (curr->sched_class != p->sched_class)
> > > > goto out;
> > > >
> > > > - if (task_running(p_rq, p) || p->state)
> > > > - goto out;
> > >
> > > Is it possible that by this time the current thread takes double rq
> > > lock, thread p could actually be running? i.e is there merit to keep
> > > this check around even with your similar check above?
> >
> > I think that's a good idea. I'll add that back in.
>
> Right, it needs to still be there, the test before acquiring p_rq is an
> optimistic test to avoid work, but you have to still test it once you
> acquire p_rq since the rest of the code relies on this not being so.
>
> How about something like this instead.. ?
>
> ---
> kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
> 1 file changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c46a011..c9ecab2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4300,6 +4300,23 @@ void __sched yield(void)
> }
> EXPORT_SYMBOL(yield);
>
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> +{
> + if (!curr->sched_class->yield_to_task)
> + return false;
> +
> + if (curr->sched_class != p->sched_class)
> + return false;
Peter,
Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set. Something akin
to
if (p_rq->cfs_rq->skip)
return false;
So if somebody has already acquired a double run queue lock and almost
set the next buddy, we dont need to take run queue lock and also avoid
overwriting the already set skip buddy.
> +
> + if (task_running(p_rq, p) || p->state)
> + return false;
> +
> + return true;
> +}
> +
> /**
> * yield_to - yield the current processor to another thread in
> * your thread group, or accelerate that thread toward the
> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> rq = this_rq();
>
> again:
> + /* optimistic test to avoid taking locks */
> + if (!__yield_to_candidate(curr, p))
> + goto out_irq;
> +
> p_rq = task_rq(p);
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> goto again;
> }
>
> - if (!curr->sched_class->yield_to_task)
> - goto out;
> -
> - if (curr->sched_class != p->sched_class)
> - goto out;
> -
> - if (task_running(p_rq, p) || p->state)
> - goto out;
> + /* validate state, holding p_rq ensures p's state cannot change */
> + if (!__yield_to_candidate(curr, p))
> + goto out_unlock;
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> resched_task(p_rq->curr);
> }
>
> -out:
> +out_unlock:
> double_rq_unlock(rq, p_rq);
> +out_irq:
> local_irq_restore(flags);
>
> if (yielded)
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 16:56 ` Srikar Dronamraju
@ 2012-09-10 17:12 ` Peter Zijlstra
2012-09-10 19:10 ` Raghavendra K T
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 17:12 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > +{
> > + if (!curr->sched_class->yield_to_task)
> > + return false;
> > +
> > + if (curr->sched_class != p->sched_class)
> > + return false;
>
>
> Peter,
>
> Should we also add a check if the runq has a skip buddy (as pointed out
> by Raghu) and return if the skip buddy is already set.
Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?
It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.
> >
> > + if (task_running(p_rq, p) || p->state)
> > + return false;
> > +
> > + return true;
> > +}
> > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> bool preempt)
> > rq = this_rq();
> >
> > again:
> > + /* optimistic test to avoid taking locks */
> > + if (!__yield_to_candidate(curr, p))
> > + goto out_irq;
> > +
So add something like:
/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq->cfs_rq->skip)
goto out_irq;
>
>
> > p_rq = task_rq(p);
> > double_rq_lock(rq, p_rq);
>
>
But I do have a question on this optimization though,.. Why do we check
p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
That is, I'd like to see this thing explained a little better.
Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a ->skip,
there's nothing we can do about it, OTOH p_rq having a ->skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 17:12 ` Peter Zijlstra
@ 2012-09-10 19:10 ` Raghavendra K T
2012-09-10 20:12 ` Andrew Theurer
2012-09-11 7:04 ` Srikar Dronamraju
2 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-10 19:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Srikar Dronamraju, habanero, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On 09/10/2012 10:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
>>> +{
>>> + if (!curr->sched_class->yield_to_task)
>>> + return false;
>>> +
>>> + if (curr->sched_class != p->sched_class)
>>> + return false;
>>
>>
>> Peter,
>>
>> Should we also add a check if the runq has a skip buddy (as pointed out
>> by Raghu) and return if the skip buddy is already set.
>
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
>
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
>
>>>
>>> + if (task_running(p_rq, p) || p->state)
>>> + return false;
>>> +
>>> + return true;
>>> +}
>
>
>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
>> bool preempt)
>>> rq = this_rq();
>>>
>>> again:
>>> + /* optimistic test to avoid taking locks */
>>> + if (!__yield_to_candidate(curr, p))
>>> + goto out_irq;
>>> +
>
> So add something like:
>
> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> if (p_rq->cfs_rq->skip)
> goto out_irq;
>>
>>
>>> p_rq = task_rq(p);
>>> double_rq_lock(rq, p_rq);
>>
>>
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>
> That is, I'd like to see this thing explained a little better.
>
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?
>
Yes, That is the intention (mean checking p_rq->cfs->skip). Though we
may be overdoing this check in the scenario when multiple vcpus of same
VM pinned to same CPU.
I am testing the above patch. Hope to be able to get back with the
results tomorrow.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 17:12 ` Peter Zijlstra
2012-09-10 19:10 ` Raghavendra K T
@ 2012-09-10 20:12 ` Andrew Theurer
2012-09-10 20:19 ` Peter Zijlstra
2012-09-11 6:08 ` Raghavendra K T
2012-09-11 7:04 ` Srikar Dronamraju
2 siblings, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-10 20:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > +{
> > > + if (!curr->sched_class->yield_to_task)
> > > + return false;
> > > +
> > > + if (curr->sched_class != p->sched_class)
> > > + return false;
> >
> >
> > Peter,
> >
> > Should we also add a check if the runq has a skip buddy (as pointed out
> > by Raghu) and return if the skip buddy is already set.
>
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
>
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
>
> > >
> > > + if (task_running(p_rq, p) || p->state)
> > > + return false;
> > > +
> > > + return true;
> > > +}
>
>
> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > > rq = this_rq();
> > >
> > > again:
> > > + /* optimistic test to avoid taking locks */
> > > + if (!__yield_to_candidate(curr, p))
> > > + goto out_irq;
> > > +
>
> So add something like:
>
> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> if (p_rq->cfs_rq->skip)
> goto out_irq;
> >
> >
> > > p_rq = task_rq(p);
> > > double_rq_lock(rq, p_rq);
> >
> >
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>
> That is, I'd like to see this thing explained a little better.
>
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?
Here's two new versions, both include a __yield_to_candidate(): "v3"
uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
skip check. Raghu, I am not sure if this is exactly what you want
implemented in v4.
Results:
> ple on: 2552 +/- .70%
> ple on: w/fixv1: 4621 +/- 2.12% (81% improvement)
> ple on: w/fixv2: 6115 (139% improvement)
v3: 5735 (124% improvement)
v4: 4524 ( 3% regression)
Both patches included below
-Andrew
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..0d98a67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;
+
+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}
+
/**
* yield_to - yield the current processor to another thread in
* your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
again:
p_rq = task_rq(p);
+
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_irq;
+
+ /*
+ * if the target task is not running, then only yield if the
+ * current task is in guest mode
+ */
+ if (!(p_rq->curr->flags & PF_VCPU))
+ goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
- if (!curr->sched_class->yield_to_task)
- goto out;
-
- if (curr->sched_class != p->sched_class)
- goto out;
-
- if (task_running(p_rq, p) || p->state)
- goto out;
+ /* validate state, holding p_rq ensures p's state cannot change */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_unlock;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4901,9 @@ again:
rq->skip_clock_update = 0;
}
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
if (yielded)
****************
v4:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2bec2ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;
+
+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}
+
/**
* yield_to - yield the current processor to another thread in
* your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
again:
p_rq = task_rq(p);
+
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_irq;
+
+ /* if a yield is in progress, skip */
+ if (p_rq->cfs.skip)
+ goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
- if (!curr->sched_class->yield_to_task)
- goto out;
-
- if (curr->sched_class != p->sched_class)
- goto out;
-
- if (task_running(p_rq, p) || p->state)
- goto out;
+ /* validate state, holding p_rq ensures p's state cannot change */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_unlock;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4898,9 @@ again:
rq->skip_clock_update = 0;
}
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 20:12 ` Andrew Theurer
@ 2012-09-10 20:19 ` Peter Zijlstra
2012-09-10 20:31 ` Rik van Riel
2012-09-11 6:08 ` Raghavendra K T
1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 20:19 UTC (permalink / raw)
To: habanero
Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
> + /*
> + * if the target task is not running, then only yield if the
> + * current task is in guest mode
> + */
> + if (!(p_rq->curr->flags & PF_VCPU))
> + goto out_irq;
This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 20:19 ` Peter Zijlstra
@ 2012-09-10 20:31 ` Rik van Riel
0 siblings, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2012-09-10 20:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: habanero, Srikar Dronamraju, Raghavendra K T, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On 09/10/2012 04:19 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
>> + /*
>> + * if the target task is not running, then only yield if the
>> + * current task is in guest mode
>> + */
>> + if (!(p_rq->curr->flags & PF_VCPU))
>> + goto out_irq;
>
> This would make yield_to() only ever work on KVM, not that I mind this
> too much, its a horrid thing and making it less useful for (ab)use is a
> good thing, still this probably wants mention somewhere :-)
Also, it would not preempt a non-kvm task, even if we need
to do that to boost a VCPU. I think the lines above should
be dropped.
--
All rights reversed
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 20:12 ` Andrew Theurer
2012-09-10 20:19 ` Peter Zijlstra
@ 2012-09-11 6:08 ` Raghavendra K T
2012-09-11 12:48 ` Andrew Theurer
2012-09-11 18:27 ` Andrew Theurer
1 sibling, 2 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-11 6:08 UTC (permalink / raw)
To: habanero
Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
>> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
>>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
>>>> +{
>>>> + if (!curr->sched_class->yield_to_task)
>>>> + return false;
>>>> +
>>>> + if (curr->sched_class != p->sched_class)
>>>> + return false;
>>>
>>>
>>> Peter,
>>>
>>> Should we also add a check if the runq has a skip buddy (as pointed out
>>> by Raghu) and return if the skip buddy is already set.
>>
>> Oh right, I missed that suggestion.. the performance improvement went
>> from 81% to 139% using this, right?
>>
>> It might make more sense to keep that separate, outside of this
>> function, since its not a strict prerequisite.
>>
>>>>
>>>> + if (task_running(p_rq, p) || p->state)
>>>> + return false;
>>>> +
>>>> + return true;
>>>> +}
>>
>>
>>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
>>> bool preempt)
>>>> rq = this_rq();
>>>>
>>>> again:
>>>> + /* optimistic test to avoid taking locks */
>>>> + if (!__yield_to_candidate(curr, p))
>>>> + goto out_irq;
>>>> +
>>
>> So add something like:
>>
>> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
>> if (p_rq->cfs_rq->skip)
>> goto out_irq;
>>>
>>>
>>>> p_rq = task_rq(p);
>>>> double_rq_lock(rq, p_rq);
>>>
>>>
>> But I do have a question on this optimization though,.. Why do we check
>> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>>
>> That is, I'd like to see this thing explained a little better.
>>
>> Does it go something like: p_rq is the runqueue of the task we'd like to
>> yield to, rq is our own, they might be the same. If we have a ->skip,
>> there's nothing we can do about it, OTOH p_rq having a ->skip and
>> failing the yield_to() simply means us picking the next VCPU thread,
>> which might be running on an entirely different cpu (rq) and could
>> succeed?
>
> Here's two new versions, both include a __yield_to_candidate(): "v3"
> uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> skip check. Raghu, I am not sure if this is exactly what you want
> implemented in v4.
>
Andrew, Yes that is what I had. I think there was a mis-understanding.
My intention was to if there is a directed_yield happened in runqueue
(say rqA), do not bother to directed yield to that. But unfortunately as
PeterZ pointed that would have resulted in setting next buddy of a
different run queue than rqA.
So we can drop this "skip" idea. Pondering more over what to do? can we
use next buddy itself ... thinking..
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-10 17:12 ` Peter Zijlstra
2012-09-10 19:10 ` Raghavendra K T
2012-09-10 20:12 ` Andrew Theurer
@ 2012-09-11 7:04 ` Srikar Dronamraju
2 siblings, 0 replies; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-11 7:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > > rq = this_rq();
> > >
> > > again:
> > > + /* optimistic test to avoid taking locks */
> > > + if (!__yield_to_candidate(curr, p))
> > > + goto out_irq;
> > > +
>
> So add something like:
>
> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> if (p_rq->cfs_rq->skip)
> goto out_irq;
> >
> >
> > > p_rq = task_rq(p);
> > > double_rq_lock(rq, p_rq);
> >
> >
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>
> That is, I'd like to see this thing explained a little better.
>
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?
>
Oh this made me look back at yield_to() again. I had misread the
yield_to_task_fair() code. I had wrongly thought that both ->skip and
->next buddies for the p_rq would be set. But it looks like only ->next
for the p_rq is set and ->skip is set for rq.
This should also explains why Andrew saw a regression when checking for
->skip flag instead of PF_VCPU.
Can we check for p_rq->cfs.next and bail out if
@@ -4820,6 +4820,23 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;
+
+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}
+
/**
* yield_to - yield the current processor to another thread in
* your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
again:
p_rq = task_rq(p);
+
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_irq;
+
+ /* if next buddy is set, assume yield is in progress */
+ if (p_rq->cfs.next)
+ goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
- if (!curr->sched_class->yield_to_task)
- goto out;
-
- if (curr->sched_class != p->sched_class)
- goto out;
-
- if (task_running(p_rq, p) || p->state)
- goto out;
+ /* validate state, holding p_rq ensures p's state cannot change */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_unlock;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4898,9 @@ again:
rq->skip_clock_update = 0;
}
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-11 6:08 ` Raghavendra K T
@ 2012-09-11 12:48 ` Andrew Theurer
2012-09-11 18:27 ` Andrew Theurer
1 sibling, 0 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-11 12:48 UTC (permalink / raw)
To: Raghavendra K T
Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> >>>> +{
> >>>> + if (!curr->sched_class->yield_to_task)
> >>>> + return false;
> >>>> +
> >>>> + if (curr->sched_class != p->sched_class)
> >>>> + return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> + if (task_running(p_rq, p) || p->state)
> >>>> + return false;
> >>>> +
> >>>> + return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>> rq = this_rq();
> >>>>
> >>>> again:
> >>>> + /* optimistic test to avoid taking locks */
> >>>> + if (!__yield_to_candidate(curr, p))
> >>>> + goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >> if (p_rq->cfs_rq->skip)
> >> goto out_irq;
> >>>
> >>>
> >>>> p_rq = task_rq(p);
> >>>> double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check. Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
>
> Andrew, Yes that is what I had. I think there was a mis-understanding.
> My intention was to if there is a directed_yield happened in runqueue
> (say rqA), do not bother to directed yield to that. But unfortunately as
> PeterZ pointed that would have resulted in setting next buddy of a
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we
> use next buddy itself ... thinking..
FYI, I regretfully forgot include your recent changes to
kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am
going to get some results for that before I experiment any more on
3.6-rc.
-Andrew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-11 6:08 ` Raghavendra K T
2012-09-11 12:48 ` Andrew Theurer
@ 2012-09-11 18:27 ` Andrew Theurer
2012-09-13 11:48 ` Raghavendra K T
2012-09-13 12:13 ` Avi Kivity
1 sibling, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-11 18:27 UTC (permalink / raw)
To: Raghavendra K T
Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> >>>> +{
> >>>> + if (!curr->sched_class->yield_to_task)
> >>>> + return false;
> >>>> +
> >>>> + if (curr->sched_class != p->sched_class)
> >>>> + return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> + if (task_running(p_rq, p) || p->state)
> >>>> + return false;
> >>>> +
> >>>> + return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>> rq = this_rq();
> >>>>
> >>>> again:
> >>>> + /* optimistic test to avoid taking locks */
> >>>> + if (!__yield_to_candidate(curr, p))
> >>>> + goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >> if (p_rq->cfs_rq->skip)
> >> goto out_irq;
> >>>
> >>>
> >>>> p_rq = task_rq(p);
> >>>> double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check. Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
>
> Andrew, Yes that is what I had. I think there was a mis-understanding.
> My intention was to if there is a directed_yield happened in runqueue
> (say rqA), do not bother to directed yield to that. But unfortunately as
> PeterZ pointed that would have resulted in setting next buddy of a
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we
> use next buddy itself ... thinking..
As I mentioned earlier today, I did not have your changes from kvm.git
tree when I tested my changes. Here are your changes and my changes
compared:
throughput in MB/sec
kvm_vcpu_on_spin changes: 4636 +/- 15.74%
yield_to changes: 4515 +/- 12.73%
I would be inclined to stick with your changes which are kept in kvm
code. I did try both combined, and did not get good results:
both changes: 4074 +/- 19.12%
So, having both is probably not a good idea. However, I feel like
there's more work to be done. With no over-commit (10 VMs), total
throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
overhead, but a reduction to ~4500 is still terrible. By contrast,
8-way VMs with 2x over-commit have a total throughput roughly 10% less
than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
host). We still have what appears to be scalability problems, but now
it's not so much in runqueue locks for yield_to(), but now
get_pid_task():
perf on host:
32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run
2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest
2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel
0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
get_pid_task() uses some rcu fucntions, wondering how scalable this
is.... I tend to think of rcu as -not- having issues like this... is
there a rcu stat/tracing tool which would help identify potential
problems?
-Andrew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-11 18:27 ` Andrew Theurer
@ 2012-09-13 11:48 ` Raghavendra K T
2012-09-13 21:30 ` Andrew Theurer
2012-09-13 12:13 ` Avi Kivity
1 sibling, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-13 11:48 UTC (permalink / raw)
To: Andrew Theurer
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
* Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > >>>> +{
> > >>>> + if (!curr->sched_class->yield_to_task)
> > >>>> + return false;
> > >>>> +
> > >>>> + if (curr->sched_class != p->sched_class)
> > >>>> + return false;
> > >>>
> > >>>
> > >>> Peter,
> > >>>
> > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > >>> by Raghu) and return if the skip buddy is already set.
> > >>
> > >> Oh right, I missed that suggestion.. the performance improvement went
> > >> from 81% to 139% using this, right?
> > >>
> > >> It might make more sense to keep that separate, outside of this
> > >> function, since its not a strict prerequisite.
> > >>
> > >>>>
> > >>>> + if (task_running(p_rq, p) || p->state)
> > >>>> + return false;
> > >>>> +
> > >>>> + return true;
> > >>>> +}
> > >>
> > >>
> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > >>> bool preempt)
> > >>>> rq = this_rq();
> > >>>>
> > >>>> again:
> > >>>> + /* optimistic test to avoid taking locks */
> > >>>> + if (!__yield_to_candidate(curr, p))
> > >>>> + goto out_irq;
> > >>>> +
> > >>
> > >> So add something like:
> > >>
> > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > >> if (p_rq->cfs_rq->skip)
> > >> goto out_irq;
> > >>>
> > >>>
> > >>>> p_rq = task_rq(p);
> > >>>> double_rq_lock(rq, p_rq);
> > >>>
> > >>>
> > >> But I do have a question on this optimization though,.. Why do we check
> > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > >>
> > >> That is, I'd like to see this thing explained a little better.
> > >>
> > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > >> failing the yield_to() simply means us picking the next VCPU thread,
> > >> which might be running on an entirely different cpu (rq) and could
> > >> succeed?
> > >
> > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > skip check. Raghu, I am not sure if this is exactly what you want
> > > implemented in v4.
> > >
> >
> > Andrew, Yes that is what I had. I think there was a mis-understanding.
> > My intention was to if there is a directed_yield happened in runqueue
> > (say rqA), do not bother to directed yield to that. But unfortunately as
> > PeterZ pointed that would have resulted in setting next buddy of a
> > different run queue than rqA.
> > So we can drop this "skip" idea. Pondering more over what to do? can we
> > use next buddy itself ... thinking..
>
> As I mentioned earlier today, I did not have your changes from kvm.git
> tree when I tested my changes. Here are your changes and my changes
> compared:
>
> throughput in MB/sec
>
> kvm_vcpu_on_spin changes: 4636 +/- 15.74%
> yield_to changes: 4515 +/- 12.73%
>
> I would be inclined to stick with your changes which are kept in kvm
> code. I did try both combined, and did not get good results:
>
> both changes: 4074 +/- 19.12%
>
> So, having both is probably not a good idea. However, I feel like
> there's more work to be done. With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible. By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host). We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
>
Hi Andrew,
IMHO, reducing the double runqueue lock overhead is a good idea,
and may be we see the benefits when we increase the overcommit further.
The explaination for not seeing good benefit on top of PLE handler
optimization patch is because we filter the yield_to candidates,
and hence resulting in less contention for double runqueue lock.
and extra code overhead during genuine yield_to might have resulted in
some degradation in the case you tested.
However, did you use cfs.next also?. I hope it helps, when we combine.
Here is the result that is showing positive benefit.
I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
+-----------+-----------+-----------+------------+-----------+
kernbench time in sec, lower is better
+-----------+-----------+-----------+------------+-----------+
base stddev patched stddev %improve
+-----------+-----------+-----------+------------+-----------+
1x 44.3880 1.8699 40.8180 1.9173 8.04271
2x 96.7580 4.2787 93.4188 3.5150 3.45108
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy record/sec higher is better
+-----------+-----------+-----------+------------+-----------+
base stddev patched stddev %improve
+-----------+-----------+-----------+------------+-----------+
1x 2374.1250 50.9718 3816.2500 54.0681 60.74343
2x 2536.2500 93.0403 2789.3750 204.7897 9.98029
+-----------+-----------+-----------+------------+-----------+
Below is the patch which combine suggestions of peterZ on your
original approach with cfs.next (already posted by Srikar in the other
thread)
----8<----
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..8551f57 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,24 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
+ struct rq *p_rq)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;
+
+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}
+
/**
* yield_to - yield the current processor to another thread in
* your thread group, or accelerate that thread toward the
@@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
again:
p_rq = task_rq(p);
+
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_irq;
+
+ /* if next buddy is set, assume yield is in progress */
+ if (p_rq->cfs.next)
+ goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
- if (!curr->sched_class->yield_to_task)
- goto out;
-
- if (curr->sched_class != p->sched_class)
- goto out;
-
- if (task_running(p_rq, p) || p->state)
- goto out;
+ /* validate state, holding p_rq ensures p's state cannot change */
+ if (!__yield_to_candidate(curr, p, p_rq))
+ goto out_unlock;
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4899,9 @@ again:
rq->skip_clock_update = 0;
}
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
if (yielded)
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-11 18:27 ` Andrew Theurer
2012-09-13 11:48 ` Raghavendra K T
@ 2012-09-13 12:13 ` Avi Kivity
1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-13 12:13 UTC (permalink / raw)
To: habanero
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On 09/11/2012 09:27 PM, Andrew Theurer wrote:
>
> So, having both is probably not a good idea. However, I feel like
> there's more work to be done. With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible. By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host). We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
>
> perf on host:
>
> 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
> 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
> 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
> 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
> 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
> 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
> 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run
> 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
> 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest
> 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
> 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
> 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
> 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel
> 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
> 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
> 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
>
> get_pid_task() uses some rcu fucntions, wondering how scalable this
> is.... I tend to think of rcu as -not- having issues like this... is
> there a rcu stat/tracing tool which would help identify potential
> problems?
It's not, it's the atomics + cache line bouncing. We're basically
guaranteed to bounce here.
Here we're finally paying for the ioctl() based interface. A syscall
based interface would have a 1:1 correspondence between vcpus and tasks,
so these games would be unnecessary.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-13 11:48 ` Raghavendra K T
@ 2012-09-13 21:30 ` Andrew Theurer
2012-09-14 17:10 ` Andrew Jones
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-13 21:30 UTC (permalink / raw)
To: Raghavendra K T
Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
Gleb Natapov, Srivatsa Vaddagiri
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
>
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > >>>> +{
> > > >>>> + if (!curr->sched_class->yield_to_task)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + if (curr->sched_class != p->sched_class)
> > > >>>> + return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> + if (task_running(p_rq, p) || p->state)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>> rq = this_rq();
> > > >>>>
> > > >>>> again:
> > > >>>> + /* optimistic test to avoid taking locks */
> > > >>>> + if (!__yield_to_candidate(curr, p))
> > > >>>> + goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > >> if (p_rq->cfs_rq->skip)
> > > >> goto out_irq;
> > > >>>
> > > >>>
> > > >>>> p_rq = task_rq(p);
> > > >>>> double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check. Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > >
> > > Andrew, Yes that is what I had. I think there was a mis-understanding.
> > > My intention was to if there is a directed_yield happened in runqueue
> > > (say rqA), do not bother to directed yield to that. But unfortunately as
> > > PeterZ pointed that would have resulted in setting next buddy of a
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we
> > > use next buddy itself ... thinking..
> >
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes. Here are your changes and my changes
> > compared:
> >
> > throughput in MB/sec
> >
> > kvm_vcpu_on_spin changes: 4636 +/- 15.74%
> > yield_to changes: 4515 +/- 12.73%
> >
> > I would be inclined to stick with your changes which are kept in kvm
> > code. I did try both combined, and did not get good results:
> >
> > both changes: 4074 +/- 19.12%
> >
> > So, having both is probably not a good idea. However, I feel like
> > there's more work to be done. With no over-commit (10 VMs), total
> > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> > overhead, but a reduction to ~4500 is still terrible. By contrast,
> > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > host). We still have what appears to be scalability problems, but now
> > it's not so much in runqueue locks for yield_to(), but now
> > get_pid_task():
> >
>
> Hi Andrew,
> IMHO, reducing the double runqueue lock overhead is a good idea,
> and may be we see the benefits when we increase the overcommit further.
>
> The explaination for not seeing good benefit on top of PLE handler
> optimization patch is because we filter the yield_to candidates,
> and hence resulting in less contention for double runqueue lock.
> and extra code overhead during genuine yield_to might have resulted in
> some degradation in the case you tested.
>
> However, did you use cfs.next also?. I hope it helps, when we combine.
>
> Here is the result that is showing positive benefit.
> I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
>
> +-----------+-----------+-----------+------------+-----------+
> kernbench time in sec, lower is better
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stddev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 44.3880 1.8699 40.8180 1.9173 8.04271
> 2x 96.7580 4.2787 93.4188 3.5150 3.45108
> +-----------+-----------+-----------+------------+-----------+
>
>
> +-----------+-----------+-----------+------------+-----------+
> ebizzy record/sec higher is better
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stddev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343
> 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029
> +-----------+-----------+-----------+------------+-----------+
>
>
> Below is the patch which combine suggestions of peterZ on your
> original approach with cfs.next (already posted by Srikar in the other
> thread)
I did get a chance to run with the below patch and your changes in
kvm.git, but the results were not too different:
Dbench, 10 x 16-way VMs on 80-way host:
kvm_vcpu_on_spin changes: 4636 +/- 15.74%
yield_to changes: 4515 +/- 12.73%
both changes from above: 4074 +/- 19.12%
...plus cfs.next check: 4418 +/- 16.97%
Still hovering around 4500 MB/sec
The concern I have is that even though we have gone through changes to
help reduce the candidate vcpus we yield to, we still have a very poor
idea of which vcpu really needs to run. The result is high cpu usage in
the get_pid_task and still some contention in the double runqueue lock.
To make this scalable, we either need to significantly reduce the
occurrence of the lock-holder preemption, or do a much better job of
knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
which do not need to run).
On reducing the occurrence: The worst case for lock-holder preemption
is having vcpus of same VM on the same runqueue. This guarantees the
situation of 1 vcpu running while another [of the same VM] is not. To
prove the point, I ran the same test, but with vcpus restricted to a
range of host cpus, such that any single VM's vcpus can never be on the
same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
vcpu-1's are on host cpus 5-9, and so on. Here is the result:
kvm_cpu_spin, and all
yield_to changes, plus
restricted vcpu placement: 8823 +/- 3.20% much, much better
On picking a better vcpu to yield to: I really hesitate to rely on
paravirt hint [telling us which vcpu is holding a lock], but I am not
sure how else to reduce the candidate vcpus to yield to. I suspect we
are yielding to way more vcpus than are prempted lock-holders, and that
IMO is just work accomplishing nothing. Trying to think of way to
further reduce candidate vcpus....
-Andrew
>
> ----8<----
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..8551f57 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4820,6 +4820,24 @@ void __sched yield(void)
> }
> EXPORT_SYMBOL(yield);
>
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> + struct rq *p_rq)
> +{
> + if (!curr->sched_class->yield_to_task)
> + return false;
> +
> + if (curr->sched_class != p->sched_class)
> + return false;
> +
> + if (task_running(p_rq, p) || p->state)
> + return false;
> +
> + return true;
> +}
> +
> /**
> * yield_to - yield the current processor to another thread in
> * your thread group, or accelerate that thread toward the
> @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>
> again:
> p_rq = task_rq(p);
> +
> + /* optimistic test to avoid taking locks */
> + if (!__yield_to_candidate(curr, p, p_rq))
> + goto out_irq;
> +
> + /* if next buddy is set, assume yield is in progress */
> + if (p_rq->cfs.next)
> + goto out_irq;
> +
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> double_rq_unlock(rq, p_rq);
> goto again;
> }
>
> - if (!curr->sched_class->yield_to_task)
> - goto out;
> -
> - if (curr->sched_class != p->sched_class)
> - goto out;
> -
> - if (task_running(p_rq, p) || p->state)
> - goto out;
> + /* validate state, holding p_rq ensures p's state cannot change */
> + if (!__yield_to_candidate(curr, p, p_rq))
> + goto out_unlock;
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4877,8 +4899,9 @@ again:
> rq->skip_clock_update = 0;
> }
>
> -out:
> +out_unlock:
> double_rq_unlock(rq, p_rq);
> +out_irq:
> local_irq_restore(flags);
>
> if (yielded)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-13 21:30 ` Andrew Theurer
@ 2012-09-14 17:10 ` Andrew Jones
2012-09-15 16:08 ` Raghavendra K T
2012-09-14 20:34 ` Konrad Rzeszutek Wilk
2012-09-16 8:55 ` Avi Kivity
2 siblings, 1 reply; 41+ messages in thread
From: Andrew Jones @ 2012-09-14 17:10 UTC (permalink / raw)
To: Andrew Theurer
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> > * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> >
> > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > > >>>> +{
> > > > >>>> + if (!curr->sched_class->yield_to_task)
> > > > >>>> + return false;
> > > > >>>> +
> > > > >>>> + if (curr->sched_class != p->sched_class)
> > > > >>>> + return false;
> > > > >>>
> > > > >>>
> > > > >>> Peter,
> > > > >>>
> > > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > > >>> by Raghu) and return if the skip buddy is already set.
> > > > >>
> > > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > > >> from 81% to 139% using this, right?
> > > > >>
> > > > >> It might make more sense to keep that separate, outside of this
> > > > >> function, since its not a strict prerequisite.
> > > > >>
> > > > >>>>
> > > > >>>> + if (task_running(p_rq, p) || p->state)
> > > > >>>> + return false;
> > > > >>>> +
> > > > >>>> + return true;
> > > > >>>> +}
> > > > >>
> > > > >>
> > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > > >>> bool preempt)
> > > > >>>> rq = this_rq();
> > > > >>>>
> > > > >>>> again:
> > > > >>>> + /* optimistic test to avoid taking locks */
> > > > >>>> + if (!__yield_to_candidate(curr, p))
> > > > >>>> + goto out_irq;
> > > > >>>> +
> > > > >>
> > > > >> So add something like:
> > > > >>
> > > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > > >> if (p_rq->cfs_rq->skip)
> > > > >> goto out_irq;
> > > > >>>
> > > > >>>
> > > > >>>> p_rq = task_rq(p);
> > > > >>>> double_rq_lock(rq, p_rq);
> > > > >>>
> > > > >>>
> > > > >> But I do have a question on this optimization though,.. Why do we check
> > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > > >>
> > > > >> That is, I'd like to see this thing explained a little better.
> > > > >>
> > > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > > >> which might be running on an entirely different cpu (rq) and could
> > > > >> succeed?
> > > > >
> > > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > > skip check. Raghu, I am not sure if this is exactly what you want
> > > > > implemented in v4.
> > > > >
> > > >
> > > > Andrew, Yes that is what I had. I think there was a mis-understanding.
> > > > My intention was to if there is a directed_yield happened in runqueue
> > > > (say rqA), do not bother to directed yield to that. But unfortunately as
> > > > PeterZ pointed that would have resulted in setting next buddy of a
> > > > different run queue than rqA.
> > > > So we can drop this "skip" idea. Pondering more over what to do? can we
> > > > use next buddy itself ... thinking..
> > >
> > > As I mentioned earlier today, I did not have your changes from kvm.git
> > > tree when I tested my changes. Here are your changes and my changes
> > > compared:
> > >
> > > throughput in MB/sec
> > >
> > > kvm_vcpu_on_spin changes: 4636 +/- 15.74%
> > > yield_to changes: 4515 +/- 12.73%
> > >
> > > I would be inclined to stick with your changes which are kept in kvm
> > > code. I did try both combined, and did not get good results:
> > >
> > > both changes: 4074 +/- 19.12%
> > >
> > > So, having both is probably not a good idea. However, I feel like
> > > there's more work to be done. With no over-commit (10 VMs), total
> > > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> > > overhead, but a reduction to ~4500 is still terrible. By contrast,
> > > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > > host). We still have what appears to be scalability problems, but now
> > > it's not so much in runqueue locks for yield_to(), but now
> > > get_pid_task():
> > >
> >
> > Hi Andrew,
> > IMHO, reducing the double runqueue lock overhead is a good idea,
> > and may be we see the benefits when we increase the overcommit further.
> >
> > The explaination for not seeing good benefit on top of PLE handler
> > optimization patch is because we filter the yield_to candidates,
> > and hence resulting in less contention for double runqueue lock.
> > and extra code overhead during genuine yield_to might have resulted in
> > some degradation in the case you tested.
> >
> > However, did you use cfs.next also?. I hope it helps, when we combine.
> >
> > Here is the result that is showing positive benefit.
> > I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
> >
> > +-----------+-----------+-----------+------------+-----------+
> > kernbench time in sec, lower is better
> > +-----------+-----------+-----------+------------+-----------+
> > base stddev patched stddev %improve
> > +-----------+-----------+-----------+------------+-----------+
> > 1x 44.3880 1.8699 40.8180 1.9173 8.04271
> > 2x 96.7580 4.2787 93.4188 3.5150 3.45108
> > +-----------+-----------+-----------+------------+-----------+
> >
> >
> > +-----------+-----------+-----------+------------+-----------+
> > ebizzy record/sec higher is better
> > +-----------+-----------+-----------+------------+-----------+
> > base stddev patched stddev %improve
> > +-----------+-----------+-----------+------------+-----------+
> > 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343
> > 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029
> > +-----------+-----------+-----------+------------+-----------+
> >
> >
> > Below is the patch which combine suggestions of peterZ on your
> > original approach with cfs.next (already posted by Srikar in the other
> > thread)
>
> I did get a chance to run with the below patch and your changes in
> kvm.git, but the results were not too different:
>
> Dbench, 10 x 16-way VMs on 80-way host:
>
> kvm_vcpu_on_spin changes: 4636 +/- 15.74%
> yield_to changes: 4515 +/- 12.73%
> both changes from above: 4074 +/- 19.12%
> ...plus cfs.next check: 4418 +/- 16.97%
>
> Still hovering around 4500 MB/sec
>
> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run. The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
>
> On reducing the occurrence: The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue. This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not. To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on. Here is the result:
>
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement: 8823 +/- 3.20% much, much better
>
> On picking a better vcpu to yield to: I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to. I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing. Trying to think of way to
> further reduce candidate vcpus....
>
wrt to yielding to vcpus for the same cpu, I recently noticed that
there's a bug in yield_to_task_fair. yield_task_fair() calls
clear_buddies(), so if we're yielding to a task that has been running on
the same cpu that we're currently running on, and thus is also on the
current cfs runqueue, then our 'who to pick next' hint is getting cleared
right after we set it.
I had hoped that the patch below would show a general improvement in the
vpu overcommit performance, however the results were variable - no worse,
no better. Based on your results above showing good improvement from
interleaving vcpus across the cpus, then that means there was a decent
percent of these types of yields going on. So since the patch didn't
change much that indicates that the next hinting isn't generally taken
too seriously by the scheduler. Anyway, the patch should correct the
code per its design, and testing shows that it didn't make anything worse,
so I'll post it soon. Also, in order to try and improve how far set-next
can jump ahead in the queue, I tested a kernel with group scheduling
compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
things). I did get slight improvement with that, but nothing to write home
to mom about.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c219bf8..7d8a21d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
return false;
+ /* We're yielding, so tell the scheduler we don't want to be picked */
+ yield_task_fair(rq);
+
/* Tell the scheduler that we'd really like pse to run next. */
set_next_buddy(se);
- yield_task_fair(rq);
-
return true;
}
>
> >
> > ----8<----
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..8551f57 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4820,6 +4820,24 @@ void __sched yield(void)
> > }
> > EXPORT_SYMBOL(yield);
> >
> > +/*
> > + * Tests preconditions required for sched_class::yield_to().
> > + */
> > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> > + struct rq *p_rq)
> > +{
> > + if (!curr->sched_class->yield_to_task)
> > + return false;
> > +
> > + if (curr->sched_class != p->sched_class)
> > + return false;
> > +
> > + if (task_running(p_rq, p) || p->state)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > /**
> > * yield_to - yield the current processor to another thread in
> > * your thread group, or accelerate that thread toward the
> > @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >
> > again:
> > p_rq = task_rq(p);
> > +
> > + /* optimistic test to avoid taking locks */
> > + if (!__yield_to_candidate(curr, p, p_rq))
> > + goto out_irq;
> > +
> > + /* if next buddy is set, assume yield is in progress */
> > + if (p_rq->cfs.next)
> > + goto out_irq;
> > +
> > double_rq_lock(rq, p_rq);
> > while (task_rq(p) != p_rq) {
> > double_rq_unlock(rq, p_rq);
> > goto again;
> > }
> >
> > - if (!curr->sched_class->yield_to_task)
> > - goto out;
> > -
> > - if (curr->sched_class != p->sched_class)
> > - goto out;
> > -
> > - if (task_running(p_rq, p) || p->state)
> > - goto out;
> > + /* validate state, holding p_rq ensures p's state cannot change */
> > + if (!__yield_to_candidate(curr, p, p_rq))
> > + goto out_unlock;
> >
> > yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> > if (yielded) {
> > @@ -4877,8 +4899,9 @@ again:
> > rq->skip_clock_update = 0;
> > }
> >
> > -out:
> > +out_unlock:
> > double_rq_unlock(rq, p_rq);
> > +out_irq:
> > local_irq_restore(flags);
> >
> > if (yielded)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-13 21:30 ` Andrew Theurer
2012-09-14 17:10 ` Andrew Jones
@ 2012-09-14 20:34 ` Konrad Rzeszutek Wilk
2012-09-17 8:02 ` Andrew Jones
2012-09-16 8:55 ` Avi Kivity
2 siblings, 1 reply; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-14 20:34 UTC (permalink / raw)
To: habanero
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run. The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
The patches that Raghavendra has been posting do accomplish that.
>
> On reducing the occurrence: The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue. This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not. To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on. Here is the result:
>
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement: 8823 +/- 3.20% much, much better
>
> On picking a better vcpu to yield to: I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to. I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing. Trying to think of way to
> further reduce candidate vcpus....
... the patches are posted - you could try them out?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-14 17:10 ` Andrew Jones
@ 2012-09-15 16:08 ` Raghavendra K T
2012-09-17 13:48 ` Andrew Jones
0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-15 16:08 UTC (permalink / raw)
To: Andrew Jones
Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On 09/14/2012 10:40 PM, Andrew Jones wrote:
> On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
>> On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
>>> * Andrew Theurer<habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
>>>
[...]
>>
>> On picking a better vcpu to yield to: I really hesitate to rely on
>> paravirt hint [telling us which vcpu is holding a lock], but I am not
>> sure how else to reduce the candidate vcpus to yield to. I suspect we
>> are yielding to way more vcpus than are prempted lock-holders, and that
>> IMO is just work accomplishing nothing. Trying to think of way to
>> further reduce candidate vcpus....
>>
>
> wrt to yielding to vcpus for the same cpu, I recently noticed that
> there's a bug in yield_to_task_fair. yield_task_fair() calls
> clear_buddies(), so if we're yielding to a task that has been running on
> the same cpu that we're currently running on, and thus is also on the
> current cfs runqueue, then our 'who to pick next' hint is getting cleared
> right after we set it.
>
> I had hoped that the patch below would show a general improvement in the
> vpu overcommit performance, however the results were variable - no worse,
> no better. Based on your results above showing good improvement from
> interleaving vcpus across the cpus, then that means there was a decent
> percent of these types of yields going on. So since the patch didn't
> change much that indicates that the next hinting isn't generally taken
> too seriously by the scheduler. Anyway, the patch should correct the
> code per its design, and testing shows that it didn't make anything worse,
> so I'll post it soon. Also, in order to try and improve how far set-next
> can jump ahead in the queue, I tested a kernel with group scheduling
> compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> things). I did get slight improvement with that, but nothing to write home
> to mom about.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c219bf8..7d8a21d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
> if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
> return false;
>
> + /* We're yielding, so tell the scheduler we don't want to be picked */
> + yield_task_fair(rq);
> +
> /* Tell the scheduler that we'd really like pse to run next. */
> set_next_buddy(se);
>
> - yield_task_fair(rq);
> -
> return true;
> }
>
Hi Drew, Agree with your fix and tested the patch too.. results are
pretty much same. puzzled why so.
thinking ... may be we hit this when #vcpu (of a VM) > #pcpu?
(pigeonhole principle ;)).
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-13 21:30 ` Andrew Theurer
2012-09-14 17:10 ` Andrew Jones
2012-09-14 20:34 ` Konrad Rzeszutek Wilk
@ 2012-09-16 8:55 ` Avi Kivity
2012-09-17 8:10 ` Andrew Jones
2012-09-18 3:03 ` Andrew Theurer
2 siblings, 2 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-16 8:55 UTC (permalink / raw)
To: habanero
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run. The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
>
> On reducing the occurrence: The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue. This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not. To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on. Here is the result:
>
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement: 8823 +/- 3.20% much, much better
>
> On picking a better vcpu to yield to: I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to. I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing. Trying to think of way to
> further reduce candidate vcpus....
I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
That other vcpu gets work done (unless it is in pause loop itself) and
the yielding vcpu gets put to sleep for a while, so it doesn't spend
cycles spinning. While we haven't fixed the problem at least the guest
is accomplishing work, and meanwhile the real lock holder may get
naturally scheduled and clear the lock.
The main problem with this theory is that the experiments don't seem to
bear it out. So maybe one of the assumptions is wrong - the yielding
vcpu gets scheduled early. That could be the case if the two vcpus are
on different runqueues - you could be changing the relative priority of
vcpus on the target runqueue, but still remain on top yourself. Is this
possible with the current code?
Maybe we should prefer vcpus on the same runqueue as yield_to targets,
and only fall back to remote vcpus when we see it didn't help.
Let's examine a few cases:
1. spinner on cpu 0, lock holder on cpu 0
win!
2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
Spinner gets put to sleep, random vcpus get to work, low lock contention
(no double_rq_lock), by the time spinner gets scheduled we might have won
3. spinner on cpu 0, another spinner on cpu 0
Worst case, we'll just spin some more. Need to detect this case and
migrate something in.
4. spinner on cpu 0, alone
Similar
It seems we need to tie in to the load balancer.
Would changing the priority of the task while it is spinning help the
load balancer?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-14 20:34 ` Konrad Rzeszutek Wilk
@ 2012-09-17 8:02 ` Andrew Jones
0 siblings, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17 8:02 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM,
chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
rkrcmar
On Fri, Sep 14, 2012 at 04:34:24PM -0400, Konrad Rzeszutek Wilk wrote:
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run. The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
>
> The patches that Raghavendra has been posting do accomplish that.
> >
> > On reducing the occurrence: The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue. This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not. To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on. Here is the result:
> >
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement: 8823 +/- 3.20% much, much better
> >
> > On picking a better vcpu to yield to: I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to. I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing. Trying to think of way to
> > further reduce candidate vcpus....
>
> ... the patches are posted - you could try them out?
Radim and I have done some testing with the pvticketlock series. While we
saw a gain over PLE alone, it wasn't huge, and without PLE also enabled it
could hardly support 2.0x overcommit. spinlocks aren't the only place
where cpu_relax() is called within a relatively tight loop, so it's likely
that PLE yielding just generally helps by getting schedule() called more
frequently.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-16 8:55 ` Avi Kivity
@ 2012-09-17 8:10 ` Andrew Jones
2012-09-18 3:03 ` Andrew Theurer
1 sibling, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17 8:10 UTC (permalink / raw)
To: Avi Kivity
Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On Sun, Sep 16, 2012 at 11:55:28AM +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run. The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> >
> > On reducing the occurrence: The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue. This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not. To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on. Here is the result:
> >
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement: 8823 +/- 3.20% much, much better
> >
> > On picking a better vcpu to yield to: I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to. I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing. Trying to think of way to
> > further reduce candidate vcpus....
>
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning. While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.
>
> The main problem with this theory is that the experiments don't seem to
> bear it out. So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early. That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself. Is this
> possible with the current code?
>
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
I thought about this a bit recently too, but didn't pursue it, because I
figured it would actually increase the get_pid_task and double_rq_lock
contention time if we have to hunt too long for a vcpu that matches a more
strict criteria. But, I guess if we can implement a special "reschedule"
to run on the current cpu which prioritizes runnable/non-running vcpus,
then it should be just as fast or faster for it to look through the
runqueue first, than it is to look through all the vcpus first.
Drew
>
> Let's examine a few cases:
>
> 1. spinner on cpu 0, lock holder on cpu 0
>
> win!
>
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
>
> 3. spinner on cpu 0, another spinner on cpu 0
>
> Worst case, we'll just spin some more. Need to detect this case and
> migrate something in.
>
> 4. spinner on cpu 0, alone
>
> Similar
>
>
> It seems we need to tie in to the load balancer.
>
> Would changing the priority of the task while it is spinning help the
> load balancer?
>
> --
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-15 16:08 ` Raghavendra K T
@ 2012-09-17 13:48 ` Andrew Jones
0 siblings, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17 13:48 UTC (permalink / raw)
To: Raghavendra K T
Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On Sat, Sep 15, 2012 at 09:38:54PM +0530, Raghavendra K T wrote:
> On 09/14/2012 10:40 PM, Andrew Jones wrote:
> >On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> >>On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> >>>* Andrew Theurer<habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> >>>
> [...]
> >>
> >>On picking a better vcpu to yield to: I really hesitate to rely on
> >>paravirt hint [telling us which vcpu is holding a lock], but I am not
> >>sure how else to reduce the candidate vcpus to yield to. I suspect we
> >>are yielding to way more vcpus than are prempted lock-holders, and that
> >>IMO is just work accomplishing nothing. Trying to think of way to
> >>further reduce candidate vcpus....
> >>
> >
> >wrt to yielding to vcpus for the same cpu, I recently noticed that
> >there's a bug in yield_to_task_fair. yield_task_fair() calls
> >clear_buddies(), so if we're yielding to a task that has been running on
> >the same cpu that we're currently running on, and thus is also on the
> >current cfs runqueue, then our 'who to pick next' hint is getting cleared
> >right after we set it.
> >
> >I had hoped that the patch below would show a general improvement in the
> >vpu overcommit performance, however the results were variable - no worse,
> >no better. Based on your results above showing good improvement from
> >interleaving vcpus across the cpus, then that means there was a decent
> >percent of these types of yields going on. So since the patch didn't
> >change much that indicates that the next hinting isn't generally taken
> >too seriously by the scheduler. Anyway, the patch should correct the
> >code per its design, and testing shows that it didn't make anything worse,
> >so I'll post it soon. Also, in order to try and improve how far set-next
> >can jump ahead in the queue, I tested a kernel with group scheduling
> >compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> >things). I did get slight improvement with that, but nothing to write home
> >to mom about.
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index c219bf8..7d8a21d 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
> > if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
> > return false;
> >
> >+ /* We're yielding, so tell the scheduler we don't want to be picked */
> >+ yield_task_fair(rq);
> >+
> > /* Tell the scheduler that we'd really like pse to run next. */
> > set_next_buddy(se);
> >
> >- yield_task_fair(rq);
> >-
> > return true;
> > }
> >
>
> Hi Drew, Agree with your fix and tested the patch too.. results are
> pretty much same. puzzled why so.
Looking at the code I see that the next hint might be used more frequently
if we bump up sysctl/kernel.sched_wakeup_granularity_ns. I also just found
out that some virt tuned profiles do that, so maybe I should try running
with one of those profiles.
>
> thinking ... may be we hit this when #vcpu (of a VM) > #pcpu?
> (pigeonhole principle ;)).
Not sure, but I haven't done any experiments where a single VM has >
#vcpus than the system as pcpus. For my vcpu overcommit I increase the
VM count, where each VM has #vcpus <= #pcpus.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-16 8:55 ` Avi Kivity
2012-09-17 8:10 ` Andrew Jones
@ 2012-09-18 3:03 ` Andrew Theurer
2012-09-19 13:39 ` Avi Kivity
1 sibling, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-18 3:03 UTC (permalink / raw)
To: Avi Kivity
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run. The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> >
> > On reducing the occurrence: The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue. This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not. To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on. Here is the result:
> >
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement: 8823 +/- 3.20% much, much better
> >
> > On picking a better vcpu to yield to: I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to. I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing. Trying to think of way to
> > further reduce candidate vcpus....
>
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning. While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.
OK, yes, if the other thread gets useful work done, then it is not
wasteful. I was thinking of the worst case scenario, where any other
vcpu would likely spin as well, and the host side cpu-time for switching
vcpu threads was not all that productive. Well, I suppose it does help
eliminate potential lock holding vcpus; it just seems to be not that
efficient or fast enough.
> The main problem with this theory is that the experiments don't seem to
> bear it out.
Granted, my test case is quite brutal. It's nothing but over-committed
VMs which always have some spin lock activity. However, we really
should try to fix the worst case scenario.
> So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early. That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself. Is this
> possible with the current code?
>
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
>
> Let's examine a few cases:
>
> 1. spinner on cpu 0, lock holder on cpu 0
>
> win!
>
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
>
> 3. spinner on cpu 0, another spinner on cpu 0
>
> Worst case, we'll just spin some more. Need to detect this case and
> migrate something in.
Well, we can certainly experiment and see what we get.
IMO, the key to getting this working really well on the large VMs is
finding the lock-holding cpu -quickly-. What I think is happening is
that we go through a relatively long process to get to that one right
vcpu. I guess I need to find a faster way to get there.
> 4. spinner on cpu 0, alone
>
> Similar
>
>
> It seems we need to tie in to the load balancer.
>
> Would changing the priority of the task while it is spinning help the
> load balancer?
Not sure.
-Andrew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
2012-09-18 3:03 ` Andrew Theurer
@ 2012-09-19 13:39 ` Avi Kivity
0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-19 13:39 UTC (permalink / raw)
To: habanero
Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
LKML, X86, Gleb Natapov, Srivatsa Vaddagiri
On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>>
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run. The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> >
>> > On reducing the occurrence: The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue. This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not. To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on. Here is the result:
>> >
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement: 8823 +/- 3.20% much, much better
>> >
>> > On picking a better vcpu to yield to: I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to. I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing. Trying to think of way to
>> > further reduce candidate vcpus....
>>
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning. While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
>
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful. I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive. Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.
If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right. Eventually
it's not-a-timeslice will expire and we go through this again. If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus. So in this worst case, we get quadratic behaviour.
One way out is to increase the not-a-timeslice. Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?
That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.
>
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
>
> Granted, my test case is quite brutal. It's nothing but over-committed
> VMs which always have some spin lock activity. However, we really
> should try to fix the worst case scenario.
Yes. And other guests may not scale as well as Linux, so they may show
this behaviour more often.
>
>> So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early. That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself. Is this
>> possible with the current code?
>>
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>>
>> Let's examine a few cases:
>>
>> 1. spinner on cpu 0, lock holder on cpu 0
>>
>> win!
>>
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>>
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>>
>> 3. spinner on cpu 0, another spinner on cpu 0
>>
>> Worst case, we'll just spin some more. Need to detect this case and
>> migrate something in.
>
> Well, we can certainly experiment and see what we get.
>
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-. What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu. I guess I need to find a faster way to get there.
pvspinlocks will find the right one, every time. Otherwise I see no way
to do this.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2012-09-19 13:40 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39 ` Raghavendra K T
2012-07-19 9:47 ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34 ` Raghavendra K T
2012-07-22 12:43 ` Avi Kivity
2012-07-23 7:35 ` Christian Borntraeger
2012-07-22 17:58 ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06 ` Raghavendra K T
2012-09-07 19:42 ` Andrew Theurer
2012-09-08 8:43 ` Srikar Dronamraju
2012-09-10 13:16 ` Andrew Theurer
2012-09-10 16:03 ` Peter Zijlstra
2012-09-10 16:56 ` Srikar Dronamraju
2012-09-10 17:12 ` Peter Zijlstra
2012-09-10 19:10 ` Raghavendra K T
2012-09-10 20:12 ` Andrew Theurer
2012-09-10 20:19 ` Peter Zijlstra
2012-09-10 20:31 ` Rik van Riel
2012-09-11 6:08 ` Raghavendra K T
2012-09-11 12:48 ` Andrew Theurer
2012-09-11 18:27 ` Andrew Theurer
2012-09-13 11:48 ` Raghavendra K T
2012-09-13 21:30 ` Andrew Theurer
2012-09-14 17:10 ` Andrew Jones
2012-09-15 16:08 ` Raghavendra K T
2012-09-17 13:48 ` Andrew Jones
2012-09-14 20:34 ` Konrad Rzeszutek Wilk
2012-09-17 8:02 ` Andrew Jones
2012-09-16 8:55 ` Avi Kivity
2012-09-17 8:10 ` Andrew Jones
2012-09-18 3:03 ` Andrew Theurer
2012-09-19 13:39 ` Avi Kivity
2012-09-13 12:13 ` Avi Kivity
2012-09-11 7:04 ` Srikar Dronamraju
2012-09-10 14:43 ` Raghavendra K T
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).