* [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler @ 2012-07-18 13:37 Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T ` (4 more replies) 0 siblings, 5 replies; 41+ messages in thread From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel Currently Pause Loop Exit (PLE) handler is doing directed yield to a random vcpu on pl-exit. We already have filtering while choosing the candidate to yield_to. This change adds more checks while choosing a candidate to yield_to. On a large vcpu guests, there is a high probability of yielding to the same vcpu who had recently done a pause-loop exit. Such a yield can lead to the vcpu spinning again. The patchset keeps track of the pause loop exit and gives chance to a vcpu which has: (a) Not done pause loop exit at all (probably he is preempted lock-holder) (b) vcpu skipped in last iteration because it did pause loop exit, and probably has become eligible now (next eligible lock holder) This concept also helps in cpu relax interception cases which use same handler. Changes since V4: - Naming Change (Avi): struct ple ==> struct spin_loop cpu_relax_intercepted ==> in_spin_loop vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield - mark vcpu in spinloop as not eligible to avoid influence of previous exit Changes since V3: - arch specific fix/changes (Christian) Changes since v2: - Move ple structure to common code (Avi) - rename pause_loop_exited to cpu_relax_intercepted (Avi) - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) - Drop superfluous curly braces (Ingo) Changes since v1: - Add more documentation for structure and algorithm and Rename plo ==> ple (Rik). - change dy_eligible initial value to false. (otherwise very first directed yield will not be skipped. (Nikunj) - fixup signoff/from issue Future enhancements: (1) Currently we have a boolean to decide on eligibility of vcpu. It would be nice if I get feedback on guest (>32 vcpu) whether we can improve better with integer counter. (with counter = say f(log n )). (2) We have not considered system load during iteration of vcpu. With that information we can limit the scan and also decide whether schedule() is better. [ I am able to use #kicked vcpus to decide on this But may be there are better ideas like information from global loadavg.] (3) We can exploit this further with PV patches since it also knows about next eligible lock-holder. Summary: There is a very good improvement for kvm based guest on PLE machine. The V5 has huge improvement for kbench. +-----------+-----------+-----------+------------+-----------+ base_rik stdev patched stdev %improve +-----------+-----------+-----------+------------+-----------+ kernbench (time in sec lesser is better) +-----------+-----------+-----------+------------+-----------+ 1x 49.2300 1.0171 22.6842 0.3073 117.0233 % 2x 91.9358 1.7768 53.9608 1.0154 70.37516 % +-----------+-----------+-----------+------------+-----------+ +-----------+-----------+-----------+------------+-----------+ ebizzy (records/sec more is better) +-----------+-----------+-----------+------------+-----------+ 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 % 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 % +-----------+-----------+-----------+------------+-----------+ Note: The patches are tested on x86. Links V4: https://lkml.org/lkml/2012/7/16/80 V3: https://lkml.org/lkml/2012/7/12/437 V2: https://lkml.org/lkml/2012/7/10/392 V1: https://lkml.org/lkml/2012/7/9/32 Raghavendra K T (3): config: Add config to support ple or cpu relax optimzation kvm : Note down when cpu relax intercepted or pause loop exited kvm : Choose a better candidate for directed yield --- arch/s390/kvm/Kconfig | 1 + arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++ virt/kvm/Kconfig | 3 +++ virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++ 5 files changed, 85 insertions(+), 0 deletions(-) ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T @ 2012-07-18 13:37 ` Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T ` (3 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, Rik van Riel Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Suggested-by: Avi Kivity <avi@redhat.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- arch/s390/kvm/Kconfig | 1 + arch/x86/kvm/Kconfig | 1 + virt/kvm/Kconfig | 3 +++ 3 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig index 78eb984..a6e2677 100644 --- a/arch/s390/kvm/Kconfig +++ b/arch/s390/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM depends on HAVE_KVM && EXPERIMENTAL select PREEMPT_NOTIFIERS select ANON_INODES + select HAVE_KVM_CPU_RELAX_INTERCEPT ---help--- Support hosting paravirtualized guest machines using the SIE virtualization capability on the mainframe. This should work diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index a28f338..45c044f 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -37,6 +37,7 @@ config KVM select TASK_DELAY_ACCT select PERF_EVENTS select HAVE_KVM_MSI + select HAVE_KVM_CPU_RELAX_INTERCEPT ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 28694f4..d01b24b 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -21,3 +21,6 @@ config KVM_ASYNC_PF config HAVE_KVM_MSI bool + +config HAVE_KVM_CPU_RELAX_INTERCEPT + bool ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T @ 2012-07-18 13:37 ` Raghavendra K T 2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T ` (2 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Noting pause loop exited vcpu or cpu relax intercepted helps in filtering right candidate to yield. Wrong selection of vcpu; i.e., a vcpu that just did a pl-exit or cpu relax intercepted may contribute to performance degradation. Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- V2 was: Reviewed-by: Rik van Riel <riel@redhat.com> include/linux/kvm_host.h | 34 ++++++++++++++++++++++++++++++++++ virt/kvm/kvm_main.c | 5 +++++ 2 files changed, 39 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c446435..34ce296 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -183,6 +183,18 @@ struct kvm_vcpu { } async_pf; #endif +#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT + /* + * Cpu relax intercept or pause loop exit optimization + * in_spin_loop: set when a vcpu does a pause loop exit + * or cpu relax intercepted. + * dy_eligible: indicates whether vcpu is eligible for directed yield. + */ + struct { + bool in_spin_loop; + bool dy_eligible; + } spin_loop; +#endif struct kvm_vcpu_arch arch; }; @@ -890,5 +902,27 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu) } } +#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT + +static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val) +{ + vcpu->spin_loop.in_spin_loop = val; +} +static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val) +{ + vcpu->spin_loop.dy_eligible = val; +} + +#else /* !CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */ + +static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val) +{ +} + +static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val) +{ +} + +#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */ #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..3d6ffc8 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -236,6 +236,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) } vcpu->run = page_address(page); + kvm_vcpu_set_in_spin_loop(vcpu, false); + kvm_vcpu_set_dy_eligible(vcpu, false); + r = kvm_arch_vcpu_init(vcpu); if (r < 0) goto fail_free_run; @@ -1577,6 +1580,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) int pass; int i; + kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not * currently running, because it got preempted by something @@ -1602,6 +1606,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) } } } + kvm_vcpu_set_in_spin_loop(me, false); } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T @ 2012-07-18 13:38 ` Raghavendra K T 2012-07-18 14:39 ` Raghavendra K T 2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti 2012-07-23 10:03 ` Avi Kivity 4 siblings, 1 reply; 41+ messages in thread From: Raghavendra K T @ 2012-07-18 13:38 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, Rik van Riel Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Currently, on a large vcpu guests, there is a high probability of yielding to the same vcpu who had recently done a pause-loop exit or cpu relax intercepted. Such a yield can lead to the vcpu spinning again and hence degrade the performance. The patchset keeps track of the pause loop exit/cpu relax interception and gives chance to a vcpu which: (a) Has not done pause loop exit or cpu relax intercepted at all (probably he is preempted lock-holder) (b) Was skipped in last iteration because it did pause loop exit or cpu relax intercepted, and probably has become eligible now (next eligible lock holder) Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- V2 was: Reviewed-by: Rik van Riel <riel@redhat.com> include/linux/kvm_host.h | 5 +++++ virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 34ce296..952427d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val) { } +static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) +{ + return true; +} + #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */ #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3d6ffc8..bf9fb97 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1571,6 +1571,39 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) } EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to); +#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT +/* + * Helper that checks whether a VCPU is eligible for directed yield. + * Most eligible candidate to yield is decided by following heuristics: + * + * (a) VCPU which has not done pl-exit or cpu relax intercepted recently + * (preempted lock holder), indicated by @in_spin_loop. + * Set at the beiginning and cleared at the end of interception/PLE handler. + * + * (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get + * chance last time (mostly it has become eligible now since we have probably + * yielded to lockholder in last iteration. This is done by toggling + * @dy_eligible each time a VCPU checked for eligibility.) + * + * Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding + * to preempted lock-holder could result in wrong VCPU selection and CPU + * burning. Giving priority for a potential lock-holder increases lock + * progress. + */ +bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) +{ + bool eligible; + + eligible = !vcpu->spin_loop.in_spin_loop || + (vcpu->spin_loop.in_spin_loop && + vcpu->spin_loop.dy_eligible); + + if (vcpu->spin_loop.in_spin_loop) + vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible; + + return eligible; +} +#endif void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me->kvm; @@ -1599,6 +1632,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (waitqueue_active(&vcpu->wq)) continue; + if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) + continue; if (kvm_vcpu_yield_to(vcpu)) { kvm->last_boosted_vcpu = i; yielded = 1; @@ -1607,6 +1642,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) } } kvm_vcpu_set_in_spin_loop(me, false); + kvm_vcpu_set_dy_eligible(me, false); } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield 2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T @ 2012-07-18 14:39 ` Raghavendra K T 2012-07-19 9:47 ` [RESEND PATCH " Raghavendra K T 0 siblings, 1 reply; 41+ messages in thread From: Raghavendra K T @ 2012-07-18 14:39 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/18/2012 07:08 PM, Raghavendra K T wrote: > From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> > +bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) > +{ > + bool eligible; > + > + eligible = !vcpu->spin_loop.in_spin_loop || > + (vcpu->spin_loop.in_spin_loop&& > + vcpu->spin_loop.dy_eligible); > + > + if (vcpu->spin_loop.in_spin_loop) > + vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible; > + > + return eligible; > +} I should have added a comment like: Since algorithm is based on heuristics, accessing another vcpu data without locking does not harm. It may result in trying to yield to same VCPU, fail and continue with next and so on. ^ permalink raw reply [flat|nested] 41+ messages in thread
* [RESEND PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield 2012-07-18 14:39 ` Raghavendra K T @ 2012-07-19 9:47 ` Raghavendra K T 0 siblings, 0 replies; 41+ messages in thread From: Raghavendra K T @ 2012-07-19 9:47 UTC (permalink / raw) To: Raghavendra K T, Avi Kivity, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar Cc: S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel Currently, on a large vcpu guests, there is a high probability of yielding to the same vcpu who had recently done a pause-loop exit or cpu relax intercepted. Such a yield can lead to the vcpu spinning again and hence degrade the performance. The patchset keeps track of the pause loop exit/cpu relax interception and gives chance to a vcpu which: (a) Has not done pause loop exit or cpu relax intercepted at all (probably he is preempted lock-holder) (b) Was skipped in last iteration because it did pause loop exit or cpu relax intercepted, and probably has become eligible now (next eligible lock holder) Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- V2 was: Reviewed-by: Rik van Riel <riel@redhat.com> Changelog: Added comment on locking as suggested by Avi include/linux/kvm_host.h | 5 +++++ virt/kvm/kvm_main.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 34ce296..952427d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val) { } +static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) +{ + return true; +} + #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */ #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3d6ffc8..8fda756 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1571,6 +1571,43 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target) } EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to); +#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT +/* + * Helper that checks whether a VCPU is eligible for directed yield. + * Most eligible candidate to yield is decided by following heuristics: + * + * (a) VCPU which has not done pl-exit or cpu relax intercepted recently + * (preempted lock holder), indicated by @in_spin_loop. + * Set at the beiginning and cleared at the end of interception/PLE handler. + * + * (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get + * chance last time (mostly it has become eligible now since we have probably + * yielded to lockholder in last iteration. This is done by toggling + * @dy_eligible each time a VCPU checked for eligibility.) + * + * Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding + * to preempted lock-holder could result in wrong VCPU selection and CPU + * burning. Giving priority for a potential lock-holder increases lock + * progress. + * + * Since algorithm is based on heuristics, accessing another VCPU data without + * locking does not harm. It may result in trying to yield to same VCPU, fail + * and continue with next VCPU and so on. + */ +bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) +{ + bool eligible; + + eligible = !vcpu->spin_loop.in_spin_loop || + (vcpu->spin_loop.in_spin_loop && + vcpu->spin_loop.dy_eligible); + + if (vcpu->spin_loop.in_spin_loop) + kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible); + + return eligible; +} +#endif void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me->kvm; @@ -1599,6 +1636,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (waitqueue_active(&vcpu->wq)) continue; + if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) + continue; if (kvm_vcpu_yield_to(vcpu)) { kvm->last_boosted_vcpu = i; yielded = 1; @@ -1607,6 +1646,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) } } kvm_vcpu_set_in_spin_loop(me, false); + + /* Ensure vcpu is not eligible during next spinloop */ + kvm_vcpu_set_dy_eligible(me, false); } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T ` (2 preceding siblings ...) 2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T @ 2012-07-20 17:36 ` Marcelo Tosatti 2012-07-22 12:34 ` Raghavendra K T 2012-07-23 10:03 ` Avi Kivity 4 siblings, 1 reply; 41+ messages in thread From: Marcelo Tosatti @ 2012-07-20 17:36 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Avi Kivity, Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote: > > Currently Pause Loop Exit (PLE) handler is doing directed yield to a > random vcpu on pl-exit. We already have filtering while choosing > the candidate to yield_to. This change adds more checks while choosing > a candidate to yield_to. > > On a large vcpu guests, there is a high probability of > yielding to the same vcpu who had recently done a pause-loop exit. > Such a yield can lead to the vcpu spinning again. > > The patchset keeps track of the pause loop exit and gives chance to a > vcpu which has: > > (a) Not done pause loop exit at all (probably he is preempted lock-holder) > > (b) vcpu skipped in last iteration because it did pause loop exit, and > probably has become eligible now (next eligible lock holder) > > This concept also helps in cpu relax interception cases which use same handler. > > Changes since V4: > - Naming Change (Avi): > struct ple ==> struct spin_loop > cpu_relax_intercepted ==> in_spin_loop > vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield > - mark vcpu in spinloop as not eligible to avoid influence of previous exit > > Changes since V3: > - arch specific fix/changes (Christian) > > Changes since v2: > - Move ple structure to common code (Avi) > - rename pause_loop_exited to cpu_relax_intercepted (Avi) > - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) > - Drop superfluous curly braces (Ingo) > > Changes since v1: > - Add more documentation for structure and algorithm and Rename > plo ==> ple (Rik). > - change dy_eligible initial value to false. (otherwise very first directed > yield will not be skipped. (Nikunj) > - fixup signoff/from issue > > Future enhancements: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With > that information we can limit the scan and also decide whether schedule() > is better. [ I am able to use #kicked vcpus to decide on this But may > be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about > next eligible lock-holder. > > Summary: There is a very good improvement for kvm based guest on PLE machine. > The V5 has huge improvement for kbench. > > +-----------+-----------+-----------+------------+-----------+ > base_rik stdev patched stdev %improve > +-----------+-----------+-----------+------------+-----------+ > kernbench (time in sec lesser is better) > +-----------+-----------+-----------+------------+-----------+ > 1x 49.2300 1.0171 22.6842 0.3073 117.0233 % > 2x 91.9358 1.7768 53.9608 1.0154 70.37516 % > +-----------+-----------+-----------+------------+-----------+ > > +-----------+-----------+-----------+------------+-----------+ > ebizzy (records/sec more is better) > +-----------+-----------+-----------+------------+-----------+ > 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 % > 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 % > +-----------+-----------+-----------+------------+-----------+ > > Note: The patches are tested on x86. > > Links > V4: https://lkml.org/lkml/2012/7/16/80 > V3: https://lkml.org/lkml/2012/7/12/437 > V2: https://lkml.org/lkml/2012/7/10/392 > V1: https://lkml.org/lkml/2012/7/9/32 > > Raghavendra K T (3): > config: Add config to support ple or cpu relax optimzation > kvm : Note down when cpu relax intercepted or pause loop exited > kvm : Choose a better candidate for directed yield > --- > arch/s390/kvm/Kconfig | 1 + > arch/x86/kvm/Kconfig | 1 + > include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++ > virt/kvm/Kconfig | 3 +++ > virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++ > 5 files changed, 85 insertions(+), 0 deletions(-) Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti @ 2012-07-22 12:34 ` Raghavendra K T 2012-07-22 12:43 ` Avi Kivity 2012-07-22 17:58 ` Rik van Riel 0 siblings, 2 replies; 41+ messages in thread From: Raghavendra K T @ 2012-07-22 12:34 UTC (permalink / raw) To: Marcelo Tosatti, Avi Kivity, Rik van Riel, Christian Borntraeger Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/20/2012 11:06 PM, Marcelo Tosatti wrote: > On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote: >> >> Currently Pause Loop Exit (PLE) handler is doing directed yield to a >> random vcpu on pl-exit. We already have filtering while choosing >> the candidate to yield_to. This change adds more checks while choosing >> a candidate to yield_to. >> >> On a large vcpu guests, there is a high probability of >> yielding to the same vcpu who had recently done a pause-loop exit. >> Such a yield can lead to the vcpu spinning again. >> >> The patchset keeps track of the pause loop exit and gives chance to a >> vcpu which has: >> >> (a) Not done pause loop exit at all (probably he is preempted lock-holder) >> >> (b) vcpu skipped in last iteration because it did pause loop exit, and >> probably has become eligible now (next eligible lock holder) >> >> This concept also helps in cpu relax interception cases which use same handler. >> >> Changes since V4: >> - Naming Change (Avi): >> struct ple ==> struct spin_loop >> cpu_relax_intercepted ==> in_spin_loop >> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield >> - mark vcpu in spinloop as not eligible to avoid influence of previous exit >> >> Changes since V3: >> - arch specific fix/changes (Christian) >> >> Changes since v2: >> - Move ple structure to common code (Avi) >> - rename pause_loop_exited to cpu_relax_intercepted (Avi) >> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) >> - Drop superfluous curly braces (Ingo) >> >> Changes since v1: >> - Add more documentation for structure and algorithm and Rename >> plo ==> ple (Rik). >> - change dy_eligible initial value to false. (otherwise very first directed >> yield will not be skipped. (Nikunj) >> - fixup signoff/from issue >> >> Future enhancements: >> (1) Currently we have a boolean to decide on eligibility of vcpu. It >> would be nice if I get feedback on guest (>32 vcpu) whether we can >> improve better with integer counter. (with counter = say f(log n )). >> >> (2) We have not considered system load during iteration of vcpu. With >> that information we can limit the scan and also decide whether schedule() >> is better. [ I am able to use #kicked vcpus to decide on this But may >> be there are better ideas like information from global loadavg.] >> >> (3) We can exploit this further with PV patches since it also knows about >> next eligible lock-holder. >> >> Summary: There is a very good improvement for kvm based guest on PLE machine. >> The V5 has huge improvement for kbench. >> >> +-----------+-----------+-----------+------------+-----------+ >> base_rik stdev patched stdev %improve >> +-----------+-----------+-----------+------------+-----------+ >> kernbench (time in sec lesser is better) >> +-----------+-----------+-----------+------------+-----------+ >> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 % >> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 % >> +-----------+-----------+-----------+------------+-----------+ >> >> +-----------+-----------+-----------+------------+-----------+ >> ebizzy (records/sec more is better) >> +-----------+-----------+-----------+------------+-----------+ >> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 % >> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 % >> +-----------+-----------+-----------+------------+-----------+ >> >> Note: The patches are tested on x86. >> >> Links >> V4: https://lkml.org/lkml/2012/7/16/80 >> V3: https://lkml.org/lkml/2012/7/12/437 >> V2: https://lkml.org/lkml/2012/7/10/392 >> V1: https://lkml.org/lkml/2012/7/9/32 >> >> Raghavendra K T (3): >> config: Add config to support ple or cpu relax optimzation >> kvm : Note down when cpu relax intercepted or pause loop exited >> kvm : Choose a better candidate for directed yield >> --- >> arch/s390/kvm/Kconfig | 1 + >> arch/x86/kvm/Kconfig | 1 + >> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++ >> virt/kvm/Kconfig | 3 +++ >> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> 5 files changed, 85 insertions(+), 0 deletions(-) > > Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com> > Thanks Marcelo for the review. Avi, Rik, Christian, please let me know if this series looks good now. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-22 12:34 ` Raghavendra K T @ 2012-07-22 12:43 ` Avi Kivity 2012-07-23 7:35 ` Christian Borntraeger 2012-07-22 17:58 ` Rik van Riel 1 sibling, 1 reply; 41+ messages in thread From: Avi Kivity @ 2012-07-22 12:43 UTC (permalink / raw) To: Raghavendra K T Cc: Marcelo Tosatti, Rik van Riel, Christian Borntraeger, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/22/2012 03:34 PM, Raghavendra K T wrote: > > Thanks Marcelo for the review. Avi, Rik, Christian, please let me know > if this series looks good now. > It looks fine to me. Christian, is this okay for s390? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-22 12:43 ` Avi Kivity @ 2012-07-23 7:35 ` Christian Borntraeger 0 siblings, 0 replies; 41+ messages in thread From: Christian Borntraeger @ 2012-07-23 7:35 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, Marcelo Tosatti, Rik van Riel, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 22/07/12 14:43, Avi Kivity wrote: > On 07/22/2012 03:34 PM, Raghavendra K T wrote: >> >> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know >> if this series looks good now. >> > > It looks fine to me. Christian, is this okay for s390? > Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-22 12:34 ` Raghavendra K T 2012-07-22 12:43 ` Avi Kivity @ 2012-07-22 17:58 ` Rik van Riel 1 sibling, 0 replies; 41+ messages in thread From: Rik van Riel @ 2012-07-22 17:58 UTC (permalink / raw) To: Raghavendra K T Cc: Marcelo Tosatti, Avi Kivity, Christian Borntraeger, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/22/2012 08:34 AM, Raghavendra K T wrote: > On 07/20/2012 11:06 PM, Marcelo Tosatti wrote: >> On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote: >>> >>> Currently Pause Loop Exit (PLE) handler is doing directed yield to a >>> random vcpu on pl-exit. We already have filtering while choosing >>> the candidate to yield_to. This change adds more checks while choosing >>> a candidate to yield_to. >>> >>> On a large vcpu guests, there is a high probability of >>> yielding to the same vcpu who had recently done a pause-loop exit. >>> Such a yield can lead to the vcpu spinning again. >>> >>> The patchset keeps track of the pause loop exit and gives chance to a >>> vcpu which has: >>> >>> (a) Not done pause loop exit at all (probably he is preempted >>> lock-holder) >>> >>> (b) vcpu skipped in last iteration because it did pause loop exit, and >>> probably has become eligible now (next eligible lock holder) >>> >>> This concept also helps in cpu relax interception cases which use >>> same handler. >>> >>> Changes since V4: >>> - Naming Change (Avi): >>> struct ple ==> struct spin_loop >>> cpu_relax_intercepted ==> in_spin_loop >>> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield >>> - mark vcpu in spinloop as not eligible to avoid influence of >>> previous exit >>> >>> Changes since V3: >>> - arch specific fix/changes (Christian) >>> >>> Changes since v2: >>> - Move ple structure to common code (Avi) >>> - rename pause_loop_exited to cpu_relax_intercepted (Avi) >>> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) >>> - Drop superfluous curly braces (Ingo) >>> >>> Changes since v1: >>> - Add more documentation for structure and algorithm and Rename >>> plo ==> ple (Rik). >>> - change dy_eligible initial value to false. (otherwise very first >>> directed >>> yield will not be skipped. (Nikunj) >>> - fixup signoff/from issue >>> >>> Future enhancements: >>> (1) Currently we have a boolean to decide on eligibility of vcpu. It >>> would be nice if I get feedback on guest (>32 vcpu) whether we can >>> improve better with integer counter. (with counter = say f(log n )). >>> >>> (2) We have not considered system load during iteration of vcpu. With >>> that information we can limit the scan and also decide whether >>> schedule() >>> is better. [ I am able to use #kicked vcpus to decide on this But may >>> be there are better ideas like information from global loadavg.] >>> >>> (3) We can exploit this further with PV patches since it also knows >>> about >>> next eligible lock-holder. >>> >>> Summary: There is a very good improvement for kvm based guest on PLE >>> machine. >>> The V5 has huge improvement for kbench. >>> >>> +-----------+-----------+-----------+------------+-----------+ >>> base_rik stdev patched stdev %improve >>> +-----------+-----------+-----------+------------+-----------+ >>> kernbench (time in sec lesser is better) >>> +-----------+-----------+-----------+------------+-----------+ >>> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 % >>> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 % >>> +-----------+-----------+-----------+------------+-----------+ >>> >>> +-----------+-----------+-----------+------------+-----------+ >>> ebizzy (records/sec more is better) >>> +-----------+-----------+-----------+------------+-----------+ >>> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 % >>> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 % >>> +-----------+-----------+-----------+------------+-----------+ >>> >>> Note: The patches are tested on x86. >>> >>> Links >>> V4: https://lkml.org/lkml/2012/7/16/80 >>> V3: https://lkml.org/lkml/2012/7/12/437 >>> V2: https://lkml.org/lkml/2012/7/10/392 >>> V1: https://lkml.org/lkml/2012/7/9/32 >>> >>> Raghavendra K T (3): >>> config: Add config to support ple or cpu relax optimzation >>> kvm : Note down when cpu relax intercepted or pause loop exited >>> kvm : Choose a better candidate for directed yield >>> --- >>> arch/s390/kvm/Kconfig | 1 + >>> arch/x86/kvm/Kconfig | 1 + >>> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++ >>> virt/kvm/Kconfig | 3 +++ >>> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++ >>> 5 files changed, 85 insertions(+), 0 deletions(-) >> >> Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com> >> > > Thanks Marcelo for the review. Avi, Rik, Christian, please let me know > if this series looks good now. The series looks good to me. Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T ` (3 preceding siblings ...) 2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti @ 2012-07-23 10:03 ` Avi Kivity 2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer 4 siblings, 1 reply; 41+ messages in thread From: Avi Kivity @ 2012-07-23 10:03 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/18/2012 04:37 PM, Raghavendra K T wrote: > Currently Pause Loop Exit (PLE) handler is doing directed yield to a > random vcpu on pl-exit. We already have filtering while choosing > the candidate to yield_to. This change adds more checks while choosing > a candidate to yield_to. > > On a large vcpu guests, there is a high probability of > yielding to the same vcpu who had recently done a pause-loop exit. > Such a yield can lead to the vcpu spinning again. > > The patchset keeps track of the pause loop exit and gives chance to a > vcpu which has: > > (a) Not done pause loop exit at all (probably he is preempted lock-holder) > > (b) vcpu skipped in last iteration because it did pause loop exit, and > probably has become eligible now (next eligible lock holder) > > This concept also helps in cpu relax interception cases which use same handler. > Thanks, applied to 'queue'. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 41+ messages in thread
* [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-07-23 10:03 ` Avi Kivity @ 2012-09-07 13:11 ` Andrew Theurer 2012-09-07 18:06 ` Raghavendra K T 0 siblings, 1 reply; 41+ messages in thread From: Andrew Theurer @ 2012-09-07 13:11 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar, KVM, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri I have noticed recently that PLE/yield_to() is still not that scalable for really large guests, sometimes even with no CPU over-commit. I have a small change that make a very big difference. First, let me explain what I saw: Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 thread Westmere-EX system: 645 seconds! Host cpu: ~98% in kernel, nearly all of it in spin_lock from double runqueue lock for yield_to() So, I added some schedstats to yield_to(), one to count when we failed this test in yield_to() if (task_running(p_rq, p) || p->state) and one when we pass all the conditions and get to actually yield: yielded = curr->sched_class->yield_to_task(rq, p, preempt); And during boot up of this guest, I saw: failed yield_to() because task is running: 8368810426 successful yield_to(): 13077658 0.156022% of yield_to calls 1 out of 640 yield_to calls Obviously, we have a problem. Every exit causes a loop over 80 vcpus, each one trying to get two locks. This is happening on all [but one] vcpus at around the same time. Not going to work well. So, since the check for a running task is nearly always true, I moved that -before- the double runqueue lock, so 99.84% of the attempts do not take the locks. Now, I do not know is this [not getting the locks] is a problem. However, I'd rather have a little inaccurate test for a running vcpu than burning 98% of CPU in host kernel. With the change the VM boot time went to: 100 seconds, an 85% reduction in time. I also wanted to check to see this did not affect truly over-committed situations, so I first started with smaller VMs at 2x cpu over-commit: 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) throughput +/- stddev ----- ----- ple off: 2281 +/- 7.32% (really bad as expected) ple on: 19796 +/- 1.36% ple on: w/fix: 19796 +/- 1.37% (no degrade at all) In this case the VMs are small enough, that we do not loop through enough vcpus to trigger the problem. host CPU is very low (3-4% range) for both default ple and with yield_to() fix. So I went on to a bigger VM: 10 VMs, 16-way each, all running dbench (2x cpu over-commit) throughput +/- stddev ----- ----- ple on: 2552 +/- .70% ple on: w/fix: 4621 +/- 2.12% (81% improvement!) This is where we start seeing a major difference. Without the fix, host cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and guest went from 30 to 40%). I believe this is on the right track to reduce the spin lock contention, still get proper directed yield, and therefore improve the guest CPU available and its performance. However, we still have lock contention, and I think we can reduce it even more. We have eliminated some attempts at double runqueue lock acquire because the check for the target vcpu is running is now before the lock. However, even if the target-to-yield-to vcpu [for the same guest upon we PLE exited] is not running, the physical processor/runqueue that target-to-yield-to vcpu is located on could be running a different VM's vcpu -and- going through a directed yield, therefore that run queue lock may already acquired. We do not want to just spin and wait, we want to move to the next candidate vcpu. We need a check to see if the smp processor/runqueue is already in a directed yield. Or, perhaps we just check if that cpu is not in guest mode, and if so, we skip that yield attempt for that vcpu and move to the next candidate vcpu. So, my question is: given a runqueue, what's the best way to check if that corresponding phys cpu is not in guest mode? Here's the changes so far (schedstat changes not included here): signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..f8eff8c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + if (task_running(p_rq, p) || p->state) { + goto out_no_unlock; + } double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4856,8 +4859,6 @@ again: if (curr->sched_class != p->sched_class) goto out; - if (task_running(p_rq, p) || p->state) - goto out; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4879,6 +4880,7 @@ again: out: double_rq_unlock(rq, p_rq); +out_no_unlock: local_irq_restore(flags); if (yielded) ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer @ 2012-09-07 18:06 ` Raghavendra K T 2012-09-07 19:42 ` Andrew Theurer 0 siblings, 1 reply; 41+ messages in thread From: Raghavendra K T @ 2012-09-07 18:06 UTC (permalink / raw) To: habanero Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, Peter Zijlstra CCing PeterZ also. On 09/07/2012 06:41 PM, Andrew Theurer wrote: > I have noticed recently that PLE/yield_to() is still not that scalable > for really large guests, sometimes even with no CPU over-commit. I have > a small change that make a very big difference. > > First, let me explain what I saw: > > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 > thread Westmere-EX system: 645 seconds! > > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double > runqueue lock for yield_to() > > So, I added some schedstats to yield_to(), one to count when we failed > this test in yield_to() > > if (task_running(p_rq, p) || p->state) > > and one when we pass all the conditions and get to actually yield: > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > > And during boot up of this guest, I saw: > > > failed yield_to() because task is running: 8368810426 > successful yield_to(): 13077658 > 0.156022% of yield_to calls > 1 out of 640 yield_to calls > > Obviously, we have a problem. Every exit causes a loop over 80 vcpus, > each one trying to get two locks. This is happening on all [but one] > vcpus at around the same time. Not going to work well. > True and interesting. I had once thought of reducing overall O(n^2) iteration to O(n log(n)) iterations by reducing number of candidates to search to O(log(n)) instead of current O(n). May be I have to get back to my experiment modes. > So, since the check for a running task is nearly always true, I moved > that -before- the double runqueue lock, so 99.84% of the attempts do not > take the locks. Now, I do not know is this [not getting the locks] is a > problem. However, I'd rather have a little inaccurate test for a > running vcpu than burning 98% of CPU in host kernel. With the change > the VM boot time went to: 100 seconds, an 85% reduction in time. > > I also wanted to check to see this did not affect truly over-committed > situations, so I first started with smaller VMs at 2x cpu over-commit: > > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) > throughput +/- stddev > ----- ----- > ple off: 2281 +/- 7.32% (really bad as expected) > ple on: 19796 +/- 1.36% > ple on: w/fix: 19796 +/- 1.37% (no degrade at all) > > In this case the VMs are small enough, that we do not loop through > enough vcpus to trigger the problem. host CPU is very low (3-4% range) > for both default ple and with yield_to() fix. > > So I went on to a bigger VM: > > 10 VMs, 16-way each, all running dbench (2x cpu over-commit) > throughput +/- stddev > ----- ----- > ple on: 2552 +/- .70% > ple on: w/fix: 4621 +/- 2.12% (81% improvement!) > > This is where we start seeing a major difference. Without the fix, host > cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and > guest went from 30 to 40%). I believe this is on the right track to > reduce the spin lock contention, still get proper directed yield, and > therefore improve the guest CPU available and its performance. > > However, we still have lock contention, and I think we can reduce it > even more. We have eliminated some attempts at double runqueue lock > acquire because the check for the target vcpu is running is now before > the lock. However, even if the target-to-yield-to vcpu [for the same > guest upon we PLE exited] is not running, the physical > processor/runqueue that target-to-yield-to vcpu is located on could be > running a different VM's vcpu -and- going through a directed yield, > therefore that run queue lock may already acquired. We do not want to > just spin and wait, we want to move to the next candidate vcpu. We need > a check to see if the smp processor/runqueue is already in a directed > yield. Or, perhaps we just check if that cpu is not in guest mode, and > if so, we skip that yield attempt for that vcpu and move to the next > candidate vcpu. So, my question is: given a runqueue, what's the best > way to check if that corresponding phys cpu is not in guest mode? > We are indeed avoiding CPUS in guest mode when we check task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice? > Here's the changes so far (schedstat changes not included here): > > signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..f8eff8c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > > again: > p_rq = task_rq(p); > + if (task_running(p_rq, p) || p->state) { > + goto out_no_unlock; > + } > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > double_rq_unlock(rq, p_rq); > @@ -4856,8 +4859,6 @@ again: > if (curr->sched_class != p->sched_class) > goto out; > > - if (task_running(p_rq, p) || p->state) > - goto out; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4879,6 +4880,7 @@ again: > > out: > double_rq_unlock(rq, p_rq); > +out_no_unlock: > local_irq_restore(flags); > > if (yielded) > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-07 18:06 ` Raghavendra K T @ 2012-09-07 19:42 ` Andrew Theurer 2012-09-08 8:43 ` Srikar Dronamraju 2012-09-10 14:43 ` Raghavendra K T 0 siblings, 2 replies; 41+ messages in thread From: Andrew Theurer @ 2012-09-07 19:42 UTC (permalink / raw) To: Raghavendra K T Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, Peter Zijlstra On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote: > CCing PeterZ also. > > On 09/07/2012 06:41 PM, Andrew Theurer wrote: > > I have noticed recently that PLE/yield_to() is still not that scalable > > for really large guests, sometimes even with no CPU over-commit. I have > > a small change that make a very big difference. > > > > First, let me explain what I saw: > > > > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 > > thread Westmere-EX system: 645 seconds! > > > > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double > > runqueue lock for yield_to() > > > > So, I added some schedstats to yield_to(), one to count when we failed > > this test in yield_to() > > > > if (task_running(p_rq, p) || p->state) > > > > and one when we pass all the conditions and get to actually yield: > > > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > > > > > And during boot up of this guest, I saw: > > > > > > failed yield_to() because task is running: 8368810426 > > successful yield_to(): 13077658 > > 0.156022% of yield_to calls > > 1 out of 640 yield_to calls > > > > Obviously, we have a problem. Every exit causes a loop over 80 vcpus, > > each one trying to get two locks. This is happening on all [but one] > > vcpus at around the same time. Not going to work well. > > > > True and interesting. I had once thought of reducing overall O(n^2) > iteration to O(n log(n)) iterations by reducing number of candidates > to search to O(log(n)) instead of current O(n). May be I have to get > back to my experiment modes. > > > So, since the check for a running task is nearly always true, I moved > > that -before- the double runqueue lock, so 99.84% of the attempts do not > > take the locks. Now, I do not know is this [not getting the locks] is a > > problem. However, I'd rather have a little inaccurate test for a > > running vcpu than burning 98% of CPU in host kernel. With the change > > the VM boot time went to: 100 seconds, an 85% reduction in time. > > > > I also wanted to check to see this did not affect truly over-committed > > situations, so I first started with smaller VMs at 2x cpu over-commit: > > > > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) > > throughput +/- stddev > > ----- ----- > > ple off: 2281 +/- 7.32% (really bad as expected) > > ple on: 19796 +/- 1.36% > > ple on: w/fix: 19796 +/- 1.37% (no degrade at all) > > > > In this case the VMs are small enough, that we do not loop through > > enough vcpus to trigger the problem. host CPU is very low (3-4% range) > > for both default ple and with yield_to() fix. > > > > So I went on to a bigger VM: > > > > 10 VMs, 16-way each, all running dbench (2x cpu over-commit) > > throughput +/- stddev > > ----- ----- > > ple on: 2552 +/- .70% > > ple on: w/fix: 4621 +/- 2.12% (81% improvement!) > > > > This is where we start seeing a major difference. Without the fix, host > > cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and > > guest went from 30 to 40%). I believe this is on the right track to > > reduce the spin lock contention, still get proper directed yield, and > > therefore improve the guest CPU available and its performance. > > > > However, we still have lock contention, and I think we can reduce it > > even more. We have eliminated some attempts at double runqueue lock > > acquire because the check for the target vcpu is running is now before > > the lock. However, even if the target-to-yield-to vcpu [for the same > > guest upon we PLE exited] is not running, the physical > > processor/runqueue that target-to-yield-to vcpu is located on could be > > running a different VM's vcpu -and- going through a directed yield, > > therefore that run queue lock may already acquired. We do not want to > > just spin and wait, we want to move to the next candidate vcpu. We need > > a check to see if the smp processor/runqueue is already in a directed > > yield. Or, perhaps we just check if that cpu is not in guest mode, and > > if so, we skip that yield attempt for that vcpu and move to the next > > candidate vcpu. So, my question is: given a runqueue, what's the best > > way to check if that corresponding phys cpu is not in guest mode? > > > > We are indeed avoiding CPUS in guest mode when we check > task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice? My understanding is that it checks if the candidate vcpu task is in guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a target to yield to if it is already in guest mode. I am concerned about a different vcpu, possibly from a different VM (let's call it g2vcpuN), but it also located on the same runqueue as g1vcpuN -and- running. That vcpu, g2vcpuN, may also be doing a directed yield, and it may already be holding the rq lock. Or it could be in guest mode. If it is in guest mode, then let's still target this rq, and try to yield to g1vcpuN. However, if g2vcpuN is not in guest mode, then don't bother trying. Patch include below. Here's the new, v2 result with the previous two: 10 VMs, 16-way each, all running dbench (2x cpu over-commit) throughput +/- stddev ----- ----- ple on: 2552 +/- .70% ple on: w/fixv1: 4621 +/- 2.12% (81% improvement) ple on: w/fixv2: 6115* (139% improvement) [*] I do not have stdev yet because all 10 runs are not complete for v1 to v2, host CPU dropped from 60% to 50%. Time in spin_lock() is also dropping: v1: > 28.60% 264266 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock > 9.21% 85109 qemu-system-x86 [kernel.kallsyms] [k] __schedule > 4.98% 46035 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq > 4.12% 38066 qemu-system-x86 [kernel.kallsyms] [k] yield_to > 4.01% 37114 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin > 3.52% 32507 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task > 3.37% 31141 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc > 3.26% 30121 qemu-system-x86 [kvm] [k] __vcpu_run > 3.01% 27844 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run > 2.98% 27592 qemu-system-x86 [kernel.kallsyms] [k] resched_task > 2.72% 25108 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock > 2.25% 20758 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe > 2.01% 18522 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to > 1.89% 17508 qemu-system-x86 [kvm] [k] vcpu_enter_guest v2: > 10.12% 83285 qemu-system-x86 [kernel.kallsyms] [k] __schedule > 9.74% 80106 qemu-system-x86 [kernel.kallsyms] [k] yield_to > 7.38% 60686 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task > 6.28% 51695 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock > 5.24% 43128 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin > 5.04% 41517 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock > 4.11% 33823 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run > 3.62% 29827 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc > 3.57% 29413 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to > 3.44% 28297 qemu-system-x86 [kvm] [k] __vcpu_run > 2.93% 24116 qemu-system-x86 [kvm] [k] vcpu_enter_guest > 2.60% 21379 qemu-system-x86 [kernel.kallsyms] [k] resched_task So this seems to be working. However I wonder just how far we can take this. Ideally we need to be in <3-4% in host for PLE work, like I observe for the 8-way VMs. We are still way off. -Andrew signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..c767915 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags & PF_VCPU)) { + goto out_no_unlock; + } double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4856,8 +4859,6 @@ again: if (curr->sched_class != p->sched_class) goto out; - if (task_running(p_rq, p) || p->state) - goto out; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4879,6 +4880,7 @@ again: out: double_rq_unlock(rq, p_rq); +out_no_unlock: local_irq_restore(flags); if (yielded) ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-07 19:42 ` Andrew Theurer @ 2012-09-08 8:43 ` Srikar Dronamraju 2012-09-10 13:16 ` Andrew Theurer 2012-09-10 14:43 ` Raghavendra K T 1 sibling, 1 reply; 41+ messages in thread From: Srikar Dronamraju @ 2012-09-08 8:43 UTC (permalink / raw) To: Andrew Theurer Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, Peter Zijlstra > > signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..c767915 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > > again: > p_rq = task_rq(p); > + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags & > PF_VCPU)) { > + goto out_no_unlock; > + } > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > double_rq_unlock(rq, p_rq); > @@ -4856,8 +4859,6 @@ again: > if (curr->sched_class != p->sched_class) > goto out; > > - if (task_running(p_rq, p) || p->state) > - goto out; Is it possible that by this time the current thread takes double rq lock, thread p could actually be running? i.e is there merit to keep this check around even with your similar check above? > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4879,6 +4880,7 @@ again: > > out: > double_rq_unlock(rq, p_rq); > +out_no_unlock: > local_irq_restore(flags); > > if (yielded) > > -- thanks and regards Srikar ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-08 8:43 ` Srikar Dronamraju @ 2012-09-10 13:16 ` Andrew Theurer 2012-09-10 16:03 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Andrew Theurer @ 2012-09-10 13:16 UTC (permalink / raw) To: Srikar Dronamraju Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, Peter Zijlstra On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote: > > > > signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com> > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index fbf1fd0..c767915 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > > preempt) > > > > again: > > p_rq = task_rq(p); > > + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags & > > PF_VCPU)) { > > + goto out_no_unlock; > > + } > > double_rq_lock(rq, p_rq); > > while (task_rq(p) != p_rq) { > > double_rq_unlock(rq, p_rq); > > @@ -4856,8 +4859,6 @@ again: > > if (curr->sched_class != p->sched_class) > > goto out; > > > > - if (task_running(p_rq, p) || p->state) > > - goto out; > > Is it possible that by this time the current thread takes double rq > lock, thread p could actually be running? i.e is there merit to keep > this check around even with your similar check above? I think that's a good idea. I'll add that back in. > > > > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > if (yielded) { > > @@ -4879,6 +4880,7 @@ again: > > > > out: > > double_rq_unlock(rq, p_rq); > > +out_no_unlock: > > local_irq_restore(flags); > > > > if (yielded) > > > > > -Andrew Theurer ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 13:16 ` Andrew Theurer @ 2012-09-10 16:03 ` Peter Zijlstra 2012-09-10 16:56 ` Srikar Dronamraju 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2012-09-10 16:03 UTC (permalink / raw) To: habanero Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote: > > > @@ -4856,8 +4859,6 @@ again: > > > if (curr->sched_class != p->sched_class) > > > goto out; > > > > > > - if (task_running(p_rq, p) || p->state) > > > - goto out; > > > > Is it possible that by this time the current thread takes double rq > > lock, thread p could actually be running? i.e is there merit to keep > > this check around even with your similar check above? > > I think that's a good idea. I'll add that back in. Right, it needs to still be there, the test before acquiring p_rq is an optimistic test to avoid work, but you have to still test it once you acquire p_rq since the rest of the code relies on this not being so. How about something like this instead.. ? --- kernel/sched/core.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c46a011..c9ecab2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4300,6 +4300,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt) rq = this_rq(); again: + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p)) + goto out_irq; + p_rq = task_rq(p); double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 16:03 ` Peter Zijlstra @ 2012-09-10 16:56 ` Srikar Dronamraju 2012-09-10 17:12 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Srikar Dronamraju @ 2012-09-10 16:56 UTC (permalink / raw) To: Peter Zijlstra Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri * Peter Zijlstra <peterz@infradead.org> [2012-09-10 18:03:55]: > On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote: > > > > @@ -4856,8 +4859,6 @@ again: > > > > if (curr->sched_class != p->sched_class) > > > > goto out; > > > > > > > > - if (task_running(p_rq, p) || p->state) > > > > - goto out; > > > > > > Is it possible that by this time the current thread takes double rq > > > lock, thread p could actually be running? i.e is there merit to keep > > > this check around even with your similar check above? > > > > I think that's a good idea. I'll add that back in. > > Right, it needs to still be there, the test before acquiring p_rq is an > optimistic test to avoid work, but you have to still test it once you > acquire p_rq since the rest of the code relies on this not being so. > > How about something like this instead.. ? > > --- > kernel/sched/core.c | 35 ++++++++++++++++++++++++++--------- > 1 file changed, 26 insertions(+), 9 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index c46a011..c9ecab2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4300,6 +4300,23 @@ void __sched yield(void) > } > EXPORT_SYMBOL(yield); > > +/* > + * Tests preconditions required for sched_class::yield_to(). > + */ > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > +{ > + if (!curr->sched_class->yield_to_task) > + return false; > + > + if (curr->sched_class != p->sched_class) > + return false; Peter, Should we also add a check if the runq has a skip buddy (as pointed out by Raghu) and return if the skip buddy is already set. Something akin to if (p_rq->cfs_rq->skip) return false; So if somebody has already acquired a double run queue lock and almost set the next buddy, we dont need to take run queue lock and also avoid overwriting the already set skip buddy. > + > + if (task_running(p_rq, p) || p->state) > + return false; > + > + return true; > +} > + > /** > * yield_to - yield the current processor to another thread in > * your thread group, or accelerate that thread toward the > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > rq = this_rq(); > > again: > + /* optimistic test to avoid taking locks */ > + if (!__yield_to_candidate(curr, p)) > + goto out_irq; > + > p_rq = task_rq(p); > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > goto again; > } > > - if (!curr->sched_class->yield_to_task) > - goto out; > - > - if (curr->sched_class != p->sched_class) > - goto out; > - > - if (task_running(p_rq, p) || p->state) > - goto out; > + /* validate state, holding p_rq ensures p's state cannot change */ > + if (!__yield_to_candidate(curr, p)) > + goto out_unlock; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > resched_task(p_rq->curr); > } > > -out: > +out_unlock: > double_rq_unlock(rq, p_rq); > +out_irq: > local_irq_restore(flags); > > if (yielded) > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 16:56 ` Srikar Dronamraju @ 2012-09-10 17:12 ` Peter Zijlstra 2012-09-10 19:10 ` Raghavendra K T ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Peter Zijlstra @ 2012-09-10 17:12 UTC (permalink / raw) To: Srikar Dronamraju Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > +{ > > + if (!curr->sched_class->yield_to_task) > > + return false; > > + > > + if (curr->sched_class != p->sched_class) > > + return false; > > > Peter, > > Should we also add a check if the runq has a skip buddy (as pointed out > by Raghu) and return if the skip buddy is already set. Oh right, I missed that suggestion.. the performance improvement went from 81% to 139% using this, right? It might make more sense to keep that separate, outside of this function, since its not a strict prerequisite. > > > > + if (task_running(p_rq, p) || p->state) > > + return false; > > + > > + return true; > > +} > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > bool preempt) > > rq = this_rq(); > > > > again: > > + /* optimistic test to avoid taking locks */ > > + if (!__yield_to_candidate(curr, p)) > > + goto out_irq; > > + So add something like: /* Optimistic, if we 'raced' with another yield_to(), don't bother */ if (p_rq->cfs_rq->skip) goto out_irq; > > > > p_rq = task_rq(p); > > double_rq_lock(rq, p_rq); > > But I do have a question on this optimization though,.. Why do we check p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? That is, I'd like to see this thing explained a little better. Does it go something like: p_rq is the runqueue of the task we'd like to yield to, rq is our own, they might be the same. If we have a ->skip, there's nothing we can do about it, OTOH p_rq having a ->skip and failing the yield_to() simply means us picking the next VCPU thread, which might be running on an entirely different cpu (rq) and could succeed? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 17:12 ` Peter Zijlstra @ 2012-09-10 19:10 ` Raghavendra K T 2012-09-10 20:12 ` Andrew Theurer 2012-09-11 7:04 ` Srikar Dronamraju 2 siblings, 0 replies; 41+ messages in thread From: Raghavendra K T @ 2012-09-10 19:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Srikar Dronamraju, habanero, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/10/2012 10:42 PM, Peter Zijlstra wrote: > On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: >>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) >>> +{ >>> + if (!curr->sched_class->yield_to_task) >>> + return false; >>> + >>> + if (curr->sched_class != p->sched_class) >>> + return false; >> >> >> Peter, >> >> Should we also add a check if the runq has a skip buddy (as pointed out >> by Raghu) and return if the skip buddy is already set. > > Oh right, I missed that suggestion.. the performance improvement went > from 81% to 139% using this, right? > > It might make more sense to keep that separate, outside of this > function, since its not a strict prerequisite. > >>> >>> + if (task_running(p_rq, p) || p->state) >>> + return false; >>> + >>> + return true; >>> +} > > >>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, >> bool preempt) >>> rq = this_rq(); >>> >>> again: >>> + /* optimistic test to avoid taking locks */ >>> + if (!__yield_to_candidate(curr, p)) >>> + goto out_irq; >>> + > > So add something like: > > /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > if (p_rq->cfs_rq->skip) > goto out_irq; >> >> >>> p_rq = task_rq(p); >>> double_rq_lock(rq, p_rq); >> >> > But I do have a question on this optimization though,.. Why do we check > p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > That is, I'd like to see this thing explained a little better. > > Does it go something like: p_rq is the runqueue of the task we'd like to > yield to, rq is our own, they might be the same. If we have a ->skip, > there's nothing we can do about it, OTOH p_rq having a ->skip and > failing the yield_to() simply means us picking the next VCPU thread, > which might be running on an entirely different cpu (rq) and could > succeed? > Yes, That is the intention (mean checking p_rq->cfs->skip). Though we may be overdoing this check in the scenario when multiple vcpus of same VM pinned to same CPU. I am testing the above patch. Hope to be able to get back with the results tomorrow. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 17:12 ` Peter Zijlstra 2012-09-10 19:10 ` Raghavendra K T @ 2012-09-10 20:12 ` Andrew Theurer 2012-09-10 20:19 ` Peter Zijlstra 2012-09-11 6:08 ` Raghavendra K T 2012-09-11 7:04 ` Srikar Dronamraju 2 siblings, 2 replies; 41+ messages in thread From: Andrew Theurer @ 2012-09-10 20:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > > +{ > > > + if (!curr->sched_class->yield_to_task) > > > + return false; > > > + > > > + if (curr->sched_class != p->sched_class) > > > + return false; > > > > > > Peter, > > > > Should we also add a check if the runq has a skip buddy (as pointed out > > by Raghu) and return if the skip buddy is already set. > > Oh right, I missed that suggestion.. the performance improvement went > from 81% to 139% using this, right? > > It might make more sense to keep that separate, outside of this > function, since its not a strict prerequisite. > > > > > > > + if (task_running(p_rq, p) || p->state) > > > + return false; > > > + > > > + return true; > > > +} > > > > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > bool preempt) > > > rq = this_rq(); > > > > > > again: > > > + /* optimistic test to avoid taking locks */ > > > + if (!__yield_to_candidate(curr, p)) > > > + goto out_irq; > > > + > > So add something like: > > /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > if (p_rq->cfs_rq->skip) > goto out_irq; > > > > > > > p_rq = task_rq(p); > > > double_rq_lock(rq, p_rq); > > > > > But I do have a question on this optimization though,.. Why do we check > p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > That is, I'd like to see this thing explained a little better. > > Does it go something like: p_rq is the runqueue of the task we'd like to > yield to, rq is our own, they might be the same. If we have a ->skip, > there's nothing we can do about it, OTOH p_rq having a ->skip and > failing the yield_to() simply means us picking the next VCPU thread, > which might be running on an entirely different cpu (rq) and could > succeed? Here's two new versions, both include a __yield_to_candidate(): "v3" uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq skip check. Raghu, I am not sure if this is exactly what you want implemented in v4. Results: > ple on: 2552 +/- .70% > ple on: w/fixv1: 4621 +/- 2.12% (81% improvement) > ple on: w/fixv2: 6115 (139% improvement) v3: 5735 (124% improvement) v4: 4524 ( 3% regression) Both patches included below -Andrew diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..0d98a67 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* + * if the target task is not running, then only yield if the + * current task is in guest mode + */ + if (!(p_rq->curr->flags & PF_VCPU)) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4901,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) **************** v4: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..2bec2ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* if a yield is in progress, skip */ + if (p_rq->cfs.skip) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4898,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 20:12 ` Andrew Theurer @ 2012-09-10 20:19 ` Peter Zijlstra 2012-09-10 20:31 ` Rik van Riel 2012-09-11 6:08 ` Raghavendra K T 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2012-09-10 20:19 UTC (permalink / raw) To: habanero Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote: > + /* > + * if the target task is not running, then only yield if the > + * current task is in guest mode > + */ > + if (!(p_rq->curr->flags & PF_VCPU)) > + goto out_irq; This would make yield_to() only ever work on KVM, not that I mind this too much, its a horrid thing and making it less useful for (ab)use is a good thing, still this probably wants mention somewhere :-) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 20:19 ` Peter Zijlstra @ 2012-09-10 20:31 ` Rik van Riel 0 siblings, 0 replies; 41+ messages in thread From: Rik van Riel @ 2012-09-10 20:31 UTC (permalink / raw) To: Peter Zijlstra Cc: habanero, Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/10/2012 04:19 PM, Peter Zijlstra wrote: > On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote: >> + /* >> + * if the target task is not running, then only yield if the >> + * current task is in guest mode >> + */ >> + if (!(p_rq->curr->flags & PF_VCPU)) >> + goto out_irq; > > This would make yield_to() only ever work on KVM, not that I mind this > too much, its a horrid thing and making it less useful for (ab)use is a > good thing, still this probably wants mention somewhere :-) Also, it would not preempt a non-kvm task, even if we need to do that to boost a VCPU. I think the lines above should be dropped. -- All rights reversed ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 20:12 ` Andrew Theurer 2012-09-10 20:19 ` Peter Zijlstra @ 2012-09-11 6:08 ` Raghavendra K T 2012-09-11 12:48 ` Andrew Theurer 2012-09-11 18:27 ` Andrew Theurer 1 sibling, 2 replies; 41+ messages in thread From: Raghavendra K T @ 2012-09-11 6:08 UTC (permalink / raw) To: habanero Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/11/2012 01:42 AM, Andrew Theurer wrote: > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) >>>> +{ >>>> + if (!curr->sched_class->yield_to_task) >>>> + return false; >>>> + >>>> + if (curr->sched_class != p->sched_class) >>>> + return false; >>> >>> >>> Peter, >>> >>> Should we also add a check if the runq has a skip buddy (as pointed out >>> by Raghu) and return if the skip buddy is already set. >> >> Oh right, I missed that suggestion.. the performance improvement went >> from 81% to 139% using this, right? >> >> It might make more sense to keep that separate, outside of this >> function, since its not a strict prerequisite. >> >>>> >>>> + if (task_running(p_rq, p) || p->state) >>>> + return false; >>>> + >>>> + return true; >>>> +} >> >> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, >>> bool preempt) >>>> rq = this_rq(); >>>> >>>> again: >>>> + /* optimistic test to avoid taking locks */ >>>> + if (!__yield_to_candidate(curr, p)) >>>> + goto out_irq; >>>> + >> >> So add something like: >> >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ >> if (p_rq->cfs_rq->skip) >> goto out_irq; >>> >>> >>>> p_rq = task_rq(p); >>>> double_rq_lock(rq, p_rq); >>> >>> >> But I do have a question on this optimization though,.. Why do we check >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? >> >> That is, I'd like to see this thing explained a little better. >> >> Does it go something like: p_rq is the runqueue of the task we'd like to >> yield to, rq is our own, they might be the same. If we have a ->skip, >> there's nothing we can do about it, OTOH p_rq having a ->skip and >> failing the yield_to() simply means us picking the next VCPU thread, >> which might be running on an entirely different cpu (rq) and could >> succeed? > > Here's two new versions, both include a __yield_to_candidate(): "v3" > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > skip check. Raghu, I am not sure if this is exactly what you want > implemented in v4. > Andrew, Yes that is what I had. I think there was a mis-understanding. My intention was to if there is a directed_yield happened in runqueue (say rqA), do not bother to directed yield to that. But unfortunately as PeterZ pointed that would have resulted in setting next buddy of a different run queue than rqA. So we can drop this "skip" idea. Pondering more over what to do? can we use next buddy itself ... thinking.. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-11 6:08 ` Raghavendra K T @ 2012-09-11 12:48 ` Andrew Theurer 2012-09-11 18:27 ` Andrew Theurer 1 sibling, 0 replies; 41+ messages in thread From: Andrew Theurer @ 2012-09-11 12:48 UTC (permalink / raw) To: Raghavendra K T Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>> rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >> if (p_rq->cfs_rq->skip) > >> goto out_irq; > >>> > >>> > >>>> p_rq = task_rq(p); > >>>> double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. FYI, I regretfully forgot include your recent changes to kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am going to get some results for that before I experiment any more on 3.6-rc. -Andrew ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-11 6:08 ` Raghavendra K T 2012-09-11 12:48 ` Andrew Theurer @ 2012-09-11 18:27 ` Andrew Theurer 2012-09-13 11:48 ` Raghavendra K T 2012-09-13 12:13 ` Avi Kivity 1 sibling, 2 replies; 41+ messages in thread From: Andrew Theurer @ 2012-09-11 18:27 UTC (permalink / raw) To: Raghavendra K T Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>> rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >> if (p_rq->cfs_rq->skip) > >> goto out_irq; > >>> > >>> > >>>> p_rq = task_rq(p); > >>>> double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. As I mentioned earlier today, I did not have your changes from kvm.git tree when I tested my changes. Here are your changes and my changes compared: throughput in MB/sec kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% I would be inclined to stick with your changes which are kept in kvm code. I did try both combined, and did not get good results: both changes: 4074 +/- 19.12% So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): perf on host: 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc get_pid_task() uses some rcu fucntions, wondering how scalable this is.... I tend to think of rcu as -not- having issues like this... is there a rcu stat/tracing tool which would help identify potential problems? -Andrew ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-11 18:27 ` Andrew Theurer @ 2012-09-13 11:48 ` Raghavendra K T 2012-09-13 21:30 ` Andrew Theurer 2012-09-13 12:13 ` Avi Kivity 1 sibling, 1 reply; 41+ messages in thread From: Raghavendra K T @ 2012-09-13 11:48 UTC (permalink / raw) To: Andrew Theurer Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]: > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > >>>> +{ > > >>>> + if (!curr->sched_class->yield_to_task) > > >>>> + return false; > > >>>> + > > >>>> + if (curr->sched_class != p->sched_class) > > >>>> + return false; > > >>> > > >>> > > >>> Peter, > > >>> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out > > >>> by Raghu) and return if the skip buddy is already set. > > >> > > >> Oh right, I missed that suggestion.. the performance improvement went > > >> from 81% to 139% using this, right? > > >> > > >> It might make more sense to keep that separate, outside of this > > >> function, since its not a strict prerequisite. > > >> > > >>>> > > >>>> + if (task_running(p_rq, p) || p->state) > > >>>> + return false; > > >>>> + > > >>>> + return true; > > >>>> +} > > >> > > >> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > >>> bool preempt) > > >>>> rq = this_rq(); > > >>>> > > >>>> again: > > >>>> + /* optimistic test to avoid taking locks */ > > >>>> + if (!__yield_to_candidate(curr, p)) > > >>>> + goto out_irq; > > >>>> + > > >> > > >> So add something like: > > >> > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > > >> if (p_rq->cfs_rq->skip) > > >> goto out_irq; > > >>> > > >>> > > >>>> p_rq = task_rq(p); > > >>>> double_rq_lock(rq, p_rq); > > >>> > > >>> > > >> But I do have a question on this optimization though,.. Why do we check > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > >> > > >> That is, I'd like to see this thing explained a little better. > > >> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to > > >> yield to, rq is our own, they might be the same. If we have a ->skip, > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > > >> failing the yield_to() simply means us picking the next VCPU thread, > > >> which might be running on an entirely different cpu (rq) and could > > >> succeed? > > > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > > skip check. Raghu, I am not sure if this is exactly what you want > > > implemented in v4. > > > > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > > My intention was to if there is a directed_yield happened in runqueue > > (say rqA), do not bother to directed yield to that. But unfortunately as > > PeterZ pointed that would have resulted in setting next buddy of a > > different run queue than rqA. > > So we can drop this "skip" idea. Pondering more over what to do? can we > > use next buddy itself ... thinking.. > > As I mentioned earlier today, I did not have your changes from kvm.git > tree when I tested my changes. Here are your changes and my changes > compared: > > throughput in MB/sec > > kvm_vcpu_on_spin changes: 4636 +/- 15.74% > yield_to changes: 4515 +/- 12.73% > > I would be inclined to stick with your changes which are kept in kvm > code. I did try both combined, and did not get good results: > > both changes: 4074 +/- 19.12% > > So, having both is probably not a good idea. However, I feel like > there's more work to be done. With no over-commit (10 VMs), total > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > overhead, but a reduction to ~4500 is still terrible. By contrast, > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > host). We still have what appears to be scalability problems, but now > it's not so much in runqueue locks for yield_to(), but now > get_pid_task(): > Hi Andrew, IMHO, reducing the double runqueue lock overhead is a good idea, and may be we see the benefits when we increase the overcommit further. The explaination for not seeing good benefit on top of PLE handler optimization patch is because we filter the yield_to candidates, and hence resulting in less contention for double runqueue lock. and extra code overhead during genuine yield_to might have resulted in some degradation in the case you tested. However, did you use cfs.next also?. I hope it helps, when we combine. Here is the result that is showing positive benefit. I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). +-----------+-----------+-----------+------------+-----------+ kernbench time in sec, lower is better +-----------+-----------+-----------+------------+-----------+ base stddev patched stddev %improve +-----------+-----------+-----------+------------+-----------+ 1x 44.3880 1.8699 40.8180 1.9173 8.04271 2x 96.7580 4.2787 93.4188 3.5150 3.45108 +-----------+-----------+-----------+------------+-----------+ +-----------+-----------+-----------+------------+-----------+ ebizzy record/sec higher is better +-----------+-----------+-----------+------------+-----------+ base stddev patched stddev %improve +-----------+-----------+-----------+------------+-----------+ 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029 +-----------+-----------+-----------+------------+-----------+ Below is the patch which combine suggestions of peterZ on your original approach with cfs.next (already posted by Srikar in the other thread) ----8<---- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..8551f57 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,24 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, + struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* if next buddy is set, assume yield is in progress */ + if (p_rq->cfs.next) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4899,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-13 11:48 ` Raghavendra K T @ 2012-09-13 21:30 ` Andrew Theurer 2012-09-14 17:10 ` Andrew Jones ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Andrew Theurer @ 2012-09-13 21:30 UTC (permalink / raw) To: Raghavendra K T Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: > * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]: > > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > > >>>> +{ > > > >>>> + if (!curr->sched_class->yield_to_task) > > > >>>> + return false; > > > >>>> + > > > >>>> + if (curr->sched_class != p->sched_class) > > > >>>> + return false; > > > >>> > > > >>> > > > >>> Peter, > > > >>> > > > >>> Should we also add a check if the runq has a skip buddy (as pointed out > > > >>> by Raghu) and return if the skip buddy is already set. > > > >> > > > >> Oh right, I missed that suggestion.. the performance improvement went > > > >> from 81% to 139% using this, right? > > > >> > > > >> It might make more sense to keep that separate, outside of this > > > >> function, since its not a strict prerequisite. > > > >> > > > >>>> > > > >>>> + if (task_running(p_rq, p) || p->state) > > > >>>> + return false; > > > >>>> + > > > >>>> + return true; > > > >>>> +} > > > >> > > > >> > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > > >>> bool preempt) > > > >>>> rq = this_rq(); > > > >>>> > > > >>>> again: > > > >>>> + /* optimistic test to avoid taking locks */ > > > >>>> + if (!__yield_to_candidate(curr, p)) > > > >>>> + goto out_irq; > > > >>>> + > > > >> > > > >> So add something like: > > > >> > > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > > > >> if (p_rq->cfs_rq->skip) > > > >> goto out_irq; > > > >>> > > > >>> > > > >>>> p_rq = task_rq(p); > > > >>>> double_rq_lock(rq, p_rq); > > > >>> > > > >>> > > > >> But I do have a question on this optimization though,.. Why do we check > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > > >> > > > >> That is, I'd like to see this thing explained a little better. > > > >> > > > >> Does it go something like: p_rq is the runqueue of the task we'd like to > > > >> yield to, rq is our own, they might be the same. If we have a ->skip, > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > > > >> failing the yield_to() simply means us picking the next VCPU thread, > > > >> which might be running on an entirely different cpu (rq) and could > > > >> succeed? > > > > > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > > > skip check. Raghu, I am not sure if this is exactly what you want > > > > implemented in v4. > > > > > > > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > > > My intention was to if there is a directed_yield happened in runqueue > > > (say rqA), do not bother to directed yield to that. But unfortunately as > > > PeterZ pointed that would have resulted in setting next buddy of a > > > different run queue than rqA. > > > So we can drop this "skip" idea. Pondering more over what to do? can we > > > use next buddy itself ... thinking.. > > > > As I mentioned earlier today, I did not have your changes from kvm.git > > tree when I tested my changes. Here are your changes and my changes > > compared: > > > > throughput in MB/sec > > > > kvm_vcpu_on_spin changes: 4636 +/- 15.74% > > yield_to changes: 4515 +/- 12.73% > > > > I would be inclined to stick with your changes which are kept in kvm > > code. I did try both combined, and did not get good results: > > > > both changes: 4074 +/- 19.12% > > > > So, having both is probably not a good idea. However, I feel like > > there's more work to be done. With no over-commit (10 VMs), total > > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > > overhead, but a reduction to ~4500 is still terrible. By contrast, > > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > > host). We still have what appears to be scalability problems, but now > > it's not so much in runqueue locks for yield_to(), but now > > get_pid_task(): > > > > Hi Andrew, > IMHO, reducing the double runqueue lock overhead is a good idea, > and may be we see the benefits when we increase the overcommit further. > > The explaination for not seeing good benefit on top of PLE handler > optimization patch is because we filter the yield_to candidates, > and hence resulting in less contention for double runqueue lock. > and extra code overhead during genuine yield_to might have resulted in > some degradation in the case you tested. > > However, did you use cfs.next also?. I hope it helps, when we combine. > > Here is the result that is showing positive benefit. > I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). > > +-----------+-----------+-----------+------------+-----------+ > kernbench time in sec, lower is better > +-----------+-----------+-----------+------------+-----------+ > base stddev patched stddev %improve > +-----------+-----------+-----------+------------+-----------+ > 1x 44.3880 1.8699 40.8180 1.9173 8.04271 > 2x 96.7580 4.2787 93.4188 3.5150 3.45108 > +-----------+-----------+-----------+------------+-----------+ > > > +-----------+-----------+-----------+------------+-----------+ > ebizzy record/sec higher is better > +-----------+-----------+-----------+------------+-----------+ > base stddev patched stddev %improve > +-----------+-----------+-----------+------------+-----------+ > 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343 > 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029 > +-----------+-----------+-----------+------------+-----------+ > > > Below is the patch which combine suggestions of peterZ on your > original approach with cfs.next (already posted by Srikar in the other > thread) I did get a chance to run with the below patch and your changes in kvm.git, but the results were not too different: Dbench, 10 x 16-way VMs on 80-way host: kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% both changes from above: 4074 +/- 19.12% ...plus cfs.next check: 4418 +/- 16.97% Still hovering around 4500 MB/sec The concern I have is that even though we have gone through changes to help reduce the candidate vcpus we yield to, we still have a very poor idea of which vcpu really needs to run. The result is high cpu usage in the get_pid_task and still some contention in the double runqueue lock. To make this scalable, we either need to significantly reduce the occurrence of the lock-holder preemption, or do a much better job of knowing which vcpu needs to run (and not unnecessarily yielding to vcpus which do not need to run). On reducing the occurrence: The worst case for lock-holder preemption is having vcpus of same VM on the same runqueue. This guarantees the situation of 1 vcpu running while another [of the same VM] is not. To prove the point, I ran the same test, but with vcpus restricted to a range of host cpus, such that any single VM's vcpus can never be on the same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, vcpu-1's are on host cpus 5-9, and so on. Here is the result: kvm_cpu_spin, and all yield_to changes, plus restricted vcpu placement: 8823 +/- 3.20% much, much better On picking a better vcpu to yield to: I really hesitate to rely on paravirt hint [telling us which vcpu is holding a lock], but I am not sure how else to reduce the candidate vcpus to yield to. I suspect we are yielding to way more vcpus than are prempted lock-holders, and that IMO is just work accomplishing nothing. Trying to think of way to further reduce candidate vcpus.... -Andrew > > ----8<---- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..8551f57 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4820,6 +4820,24 @@ void __sched yield(void) > } > EXPORT_SYMBOL(yield); > > +/* > + * Tests preconditions required for sched_class::yield_to(). > + */ > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, > + struct rq *p_rq) > +{ > + if (!curr->sched_class->yield_to_task) > + return false; > + > + if (curr->sched_class != p->sched_class) > + return false; > + > + if (task_running(p_rq, p) || p->state) > + return false; > + > + return true; > +} > + > /** > * yield_to - yield the current processor to another thread in > * your thread group, or accelerate that thread toward the > @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > > again: > p_rq = task_rq(p); > + > + /* optimistic test to avoid taking locks */ > + if (!__yield_to_candidate(curr, p, p_rq)) > + goto out_irq; > + > + /* if next buddy is set, assume yield is in progress */ > + if (p_rq->cfs.next) > + goto out_irq; > + > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > double_rq_unlock(rq, p_rq); > goto again; > } > > - if (!curr->sched_class->yield_to_task) > - goto out; > - > - if (curr->sched_class != p->sched_class) > - goto out; > - > - if (task_running(p_rq, p) || p->state) > - goto out; > + /* validate state, holding p_rq ensures p's state cannot change */ > + if (!__yield_to_candidate(curr, p, p_rq)) > + goto out_unlock; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4877,8 +4899,9 @@ again: > rq->skip_clock_update = 0; > } > > -out: > +out_unlock: > double_rq_unlock(rq, p_rq); > +out_irq: > local_irq_restore(flags); > > if (yielded) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-13 21:30 ` Andrew Theurer @ 2012-09-14 17:10 ` Andrew Jones 2012-09-15 16:08 ` Raghavendra K T 2012-09-14 20:34 ` Konrad Rzeszutek Wilk 2012-09-16 8:55 ` Avi Kivity 2 siblings, 1 reply; 41+ messages in thread From: Andrew Jones @ 2012-09-14 17:10 UTC (permalink / raw) To: Andrew Theurer Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote: > On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: > > * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]: > > > > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > > > >>>> +{ > > > > >>>> + if (!curr->sched_class->yield_to_task) > > > > >>>> + return false; > > > > >>>> + > > > > >>>> + if (curr->sched_class != p->sched_class) > > > > >>>> + return false; > > > > >>> > > > > >>> > > > > >>> Peter, > > > > >>> > > > > >>> Should we also add a check if the runq has a skip buddy (as pointed out > > > > >>> by Raghu) and return if the skip buddy is already set. > > > > >> > > > > >> Oh right, I missed that suggestion.. the performance improvement went > > > > >> from 81% to 139% using this, right? > > > > >> > > > > >> It might make more sense to keep that separate, outside of this > > > > >> function, since its not a strict prerequisite. > > > > >> > > > > >>>> > > > > >>>> + if (task_running(p_rq, p) || p->state) > > > > >>>> + return false; > > > > >>>> + > > > > >>>> + return true; > > > > >>>> +} > > > > >> > > > > >> > > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > > > >>> bool preempt) > > > > >>>> rq = this_rq(); > > > > >>>> > > > > >>>> again: > > > > >>>> + /* optimistic test to avoid taking locks */ > > > > >>>> + if (!__yield_to_candidate(curr, p)) > > > > >>>> + goto out_irq; > > > > >>>> + > > > > >> > > > > >> So add something like: > > > > >> > > > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > > > > >> if (p_rq->cfs_rq->skip) > > > > >> goto out_irq; > > > > >>> > > > > >>> > > > > >>>> p_rq = task_rq(p); > > > > >>>> double_rq_lock(rq, p_rq); > > > > >>> > > > > >>> > > > > >> But I do have a question on this optimization though,.. Why do we check > > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > > > >> > > > > >> That is, I'd like to see this thing explained a little better. > > > > >> > > > > >> Does it go something like: p_rq is the runqueue of the task we'd like to > > > > >> yield to, rq is our own, they might be the same. If we have a ->skip, > > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > > > > >> failing the yield_to() simply means us picking the next VCPU thread, > > > > >> which might be running on an entirely different cpu (rq) and could > > > > >> succeed? > > > > > > > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > > > > skip check. Raghu, I am not sure if this is exactly what you want > > > > > implemented in v4. > > > > > > > > > > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > > > > My intention was to if there is a directed_yield happened in runqueue > > > > (say rqA), do not bother to directed yield to that. But unfortunately as > > > > PeterZ pointed that would have resulted in setting next buddy of a > > > > different run queue than rqA. > > > > So we can drop this "skip" idea. Pondering more over what to do? can we > > > > use next buddy itself ... thinking.. > > > > > > As I mentioned earlier today, I did not have your changes from kvm.git > > > tree when I tested my changes. Here are your changes and my changes > > > compared: > > > > > > throughput in MB/sec > > > > > > kvm_vcpu_on_spin changes: 4636 +/- 15.74% > > > yield_to changes: 4515 +/- 12.73% > > > > > > I would be inclined to stick with your changes which are kept in kvm > > > code. I did try both combined, and did not get good results: > > > > > > both changes: 4074 +/- 19.12% > > > > > > So, having both is probably not a good idea. However, I feel like > > > there's more work to be done. With no over-commit (10 VMs), total > > > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > > > overhead, but a reduction to ~4500 is still terrible. By contrast, > > > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > > > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > > > host). We still have what appears to be scalability problems, but now > > > it's not so much in runqueue locks for yield_to(), but now > > > get_pid_task(): > > > > > > > Hi Andrew, > > IMHO, reducing the double runqueue lock overhead is a good idea, > > and may be we see the benefits when we increase the overcommit further. > > > > The explaination for not seeing good benefit on top of PLE handler > > optimization patch is because we filter the yield_to candidates, > > and hence resulting in less contention for double runqueue lock. > > and extra code overhead during genuine yield_to might have resulted in > > some degradation in the case you tested. > > > > However, did you use cfs.next also?. I hope it helps, when we combine. > > > > Here is the result that is showing positive benefit. > > I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). > > > > +-----------+-----------+-----------+------------+-----------+ > > kernbench time in sec, lower is better > > +-----------+-----------+-----------+------------+-----------+ > > base stddev patched stddev %improve > > +-----------+-----------+-----------+------------+-----------+ > > 1x 44.3880 1.8699 40.8180 1.9173 8.04271 > > 2x 96.7580 4.2787 93.4188 3.5150 3.45108 > > +-----------+-----------+-----------+------------+-----------+ > > > > > > +-----------+-----------+-----------+------------+-----------+ > > ebizzy record/sec higher is better > > +-----------+-----------+-----------+------------+-----------+ > > base stddev patched stddev %improve > > +-----------+-----------+-----------+------------+-----------+ > > 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343 > > 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029 > > +-----------+-----------+-----------+------------+-----------+ > > > > > > Below is the patch which combine suggestions of peterZ on your > > original approach with cfs.next (already posted by Srikar in the other > > thread) > > I did get a chance to run with the below patch and your changes in > kvm.git, but the results were not too different: > > Dbench, 10 x 16-way VMs on 80-way host: > > kvm_vcpu_on_spin changes: 4636 +/- 15.74% > yield_to changes: 4515 +/- 12.73% > both changes from above: 4074 +/- 19.12% > ...plus cfs.next check: 4418 +/- 16.97% > > Still hovering around 4500 MB/sec > > The concern I have is that even though we have gone through changes to > help reduce the candidate vcpus we yield to, we still have a very poor > idea of which vcpu really needs to run. The result is high cpu usage in > the get_pid_task and still some contention in the double runqueue lock. > To make this scalable, we either need to significantly reduce the > occurrence of the lock-holder preemption, or do a much better job of > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > which do not need to run). > > On reducing the occurrence: The worst case for lock-holder preemption > is having vcpus of same VM on the same runqueue. This guarantees the > situation of 1 vcpu running while another [of the same VM] is not. To > prove the point, I ran the same test, but with vcpus restricted to a > range of host cpus, such that any single VM's vcpus can never be on the > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > kvm_cpu_spin, and all > yield_to changes, plus > restricted vcpu placement: 8823 +/- 3.20% much, much better > > On picking a better vcpu to yield to: I really hesitate to rely on > paravirt hint [telling us which vcpu is holding a lock], but I am not > sure how else to reduce the candidate vcpus to yield to. I suspect we > are yielding to way more vcpus than are prempted lock-holders, and that > IMO is just work accomplishing nothing. Trying to think of way to > further reduce candidate vcpus.... > wrt to yielding to vcpus for the same cpu, I recently noticed that there's a bug in yield_to_task_fair. yield_task_fair() calls clear_buddies(), so if we're yielding to a task that has been running on the same cpu that we're currently running on, and thus is also on the current cfs runqueue, then our 'who to pick next' hint is getting cleared right after we set it. I had hoped that the patch below would show a general improvement in the vpu overcommit performance, however the results were variable - no worse, no better. Based on your results above showing good improvement from interleaving vcpus across the cpus, then that means there was a decent percent of these types of yields going on. So since the patch didn't change much that indicates that the next hinting isn't generally taken too seriously by the scheduler. Anyway, the patch should correct the code per its design, and testing shows that it didn't make anything worse, so I'll post it soon. Also, in order to try and improve how far set-next can jump ahead in the queue, I tested a kernel with group scheduling compiled out (libvirt uses cgroups and I'm not sure autogroups may affect things). I did get slight improvement with that, but nothing to write home to mom about. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c219bf8..7d8a21d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) return false; + /* We're yielding, so tell the scheduler we don't want to be picked */ + yield_task_fair(rq); + /* Tell the scheduler that we'd really like pse to run next. */ set_next_buddy(se); - yield_task_fair(rq); - return true; } > > > > > ----8<---- > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index fbf1fd0..8551f57 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4820,6 +4820,24 @@ void __sched yield(void) > > } > > EXPORT_SYMBOL(yield); > > > > +/* > > + * Tests preconditions required for sched_class::yield_to(). > > + */ > > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, > > + struct rq *p_rq) > > +{ > > + if (!curr->sched_class->yield_to_task) > > + return false; > > + > > + if (curr->sched_class != p->sched_class) > > + return false; > > + > > + if (task_running(p_rq, p) || p->state) > > + return false; > > + > > + return true; > > +} > > + > > /** > > * yield_to - yield the current processor to another thread in > > * your thread group, or accelerate that thread toward the > > @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > > > > again: > > p_rq = task_rq(p); > > + > > + /* optimistic test to avoid taking locks */ > > + if (!__yield_to_candidate(curr, p, p_rq)) > > + goto out_irq; > > + > > + /* if next buddy is set, assume yield is in progress */ > > + if (p_rq->cfs.next) > > + goto out_irq; > > + > > double_rq_lock(rq, p_rq); > > while (task_rq(p) != p_rq) { > > double_rq_unlock(rq, p_rq); > > goto again; > > } > > > > - if (!curr->sched_class->yield_to_task) > > - goto out; > > - > > - if (curr->sched_class != p->sched_class) > > - goto out; > > - > > - if (task_running(p_rq, p) || p->state) > > - goto out; > > + /* validate state, holding p_rq ensures p's state cannot change */ > > + if (!__yield_to_candidate(curr, p, p_rq)) > > + goto out_unlock; > > > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > if (yielded) { > > @@ -4877,8 +4899,9 @@ again: > > rq->skip_clock_update = 0; > > } > > > > -out: > > +out_unlock: > > double_rq_unlock(rq, p_rq); > > +out_irq: > > local_irq_restore(flags); > > > > if (yielded) > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-14 17:10 ` Andrew Jones @ 2012-09-15 16:08 ` Raghavendra K T 2012-09-17 13:48 ` Andrew Jones 0 siblings, 1 reply; 41+ messages in thread From: Raghavendra K T @ 2012-09-15 16:08 UTC (permalink / raw) To: Andrew Jones Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/14/2012 10:40 PM, Andrew Jones wrote: > On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote: >> On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: >>> * Andrew Theurer<habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]: >>> [...] >> >> On picking a better vcpu to yield to: I really hesitate to rely on >> paravirt hint [telling us which vcpu is holding a lock], but I am not >> sure how else to reduce the candidate vcpus to yield to. I suspect we >> are yielding to way more vcpus than are prempted lock-holders, and that >> IMO is just work accomplishing nothing. Trying to think of way to >> further reduce candidate vcpus.... >> > > wrt to yielding to vcpus for the same cpu, I recently noticed that > there's a bug in yield_to_task_fair. yield_task_fair() calls > clear_buddies(), so if we're yielding to a task that has been running on > the same cpu that we're currently running on, and thus is also on the > current cfs runqueue, then our 'who to pick next' hint is getting cleared > right after we set it. > > I had hoped that the patch below would show a general improvement in the > vpu overcommit performance, however the results were variable - no worse, > no better. Based on your results above showing good improvement from > interleaving vcpus across the cpus, then that means there was a decent > percent of these types of yields going on. So since the patch didn't > change much that indicates that the next hinting isn't generally taken > too seriously by the scheduler. Anyway, the patch should correct the > code per its design, and testing shows that it didn't make anything worse, > so I'll post it soon. Also, in order to try and improve how far set-next > can jump ahead in the queue, I tested a kernel with group scheduling > compiled out (libvirt uses cgroups and I'm not sure autogroups may affect > things). I did get slight improvement with that, but nothing to write home > to mom about. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c219bf8..7d8a21d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp > if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) > return false; > > + /* We're yielding, so tell the scheduler we don't want to be picked */ > + yield_task_fair(rq); > + > /* Tell the scheduler that we'd really like pse to run next. */ > set_next_buddy(se); > > - yield_task_fair(rq); > - > return true; > } > Hi Drew, Agree with your fix and tested the patch too.. results are pretty much same. puzzled why so. thinking ... may be we hit this when #vcpu (of a VM) > #pcpu? (pigeonhole principle ;)). ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-15 16:08 ` Raghavendra K T @ 2012-09-17 13:48 ` Andrew Jones 0 siblings, 0 replies; 41+ messages in thread From: Andrew Jones @ 2012-09-17 13:48 UTC (permalink / raw) To: Raghavendra K T Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Sat, Sep 15, 2012 at 09:38:54PM +0530, Raghavendra K T wrote: > On 09/14/2012 10:40 PM, Andrew Jones wrote: > >On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote: > >>On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: > >>>* Andrew Theurer<habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]: > >>> > [...] > >> > >>On picking a better vcpu to yield to: I really hesitate to rely on > >>paravirt hint [telling us which vcpu is holding a lock], but I am not > >>sure how else to reduce the candidate vcpus to yield to. I suspect we > >>are yielding to way more vcpus than are prempted lock-holders, and that > >>IMO is just work accomplishing nothing. Trying to think of way to > >>further reduce candidate vcpus.... > >> > > > >wrt to yielding to vcpus for the same cpu, I recently noticed that > >there's a bug in yield_to_task_fair. yield_task_fair() calls > >clear_buddies(), so if we're yielding to a task that has been running on > >the same cpu that we're currently running on, and thus is also on the > >current cfs runqueue, then our 'who to pick next' hint is getting cleared > >right after we set it. > > > >I had hoped that the patch below would show a general improvement in the > >vpu overcommit performance, however the results were variable - no worse, > >no better. Based on your results above showing good improvement from > >interleaving vcpus across the cpus, then that means there was a decent > >percent of these types of yields going on. So since the patch didn't > >change much that indicates that the next hinting isn't generally taken > >too seriously by the scheduler. Anyway, the patch should correct the > >code per its design, and testing shows that it didn't make anything worse, > >so I'll post it soon. Also, in order to try and improve how far set-next > >can jump ahead in the queue, I tested a kernel with group scheduling > >compiled out (libvirt uses cgroups and I'm not sure autogroups may affect > >things). I did get slight improvement with that, but nothing to write home > >to mom about. > > > >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >index c219bf8..7d8a21d 100644 > >--- a/kernel/sched/fair.c > >+++ b/kernel/sched/fair.c > >@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp > > if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) > > return false; > > > >+ /* We're yielding, so tell the scheduler we don't want to be picked */ > >+ yield_task_fair(rq); > >+ > > /* Tell the scheduler that we'd really like pse to run next. */ > > set_next_buddy(se); > > > >- yield_task_fair(rq); > >- > > return true; > > } > > > > Hi Drew, Agree with your fix and tested the patch too.. results are > pretty much same. puzzled why so. Looking at the code I see that the next hint might be used more frequently if we bump up sysctl/kernel.sched_wakeup_granularity_ns. I also just found out that some virt tuned profiles do that, so maybe I should try running with one of those profiles. > > thinking ... may be we hit this when #vcpu (of a VM) > #pcpu? > (pigeonhole principle ;)). Not sure, but I haven't done any experiments where a single VM has > #vcpus than the system as pcpus. For my vcpu overcommit I increase the VM count, where each VM has #vcpus <= #pcpus. Drew ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-13 21:30 ` Andrew Theurer 2012-09-14 17:10 ` Andrew Jones @ 2012-09-14 20:34 ` Konrad Rzeszutek Wilk 2012-09-17 8:02 ` Andrew Jones 2012-09-16 8:55 ` Avi Kivity 2 siblings, 1 reply; 41+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-14 20:34 UTC (permalink / raw) To: habanero Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri > The concern I have is that even though we have gone through changes to > help reduce the candidate vcpus we yield to, we still have a very poor > idea of which vcpu really needs to run. The result is high cpu usage in > the get_pid_task and still some contention in the double runqueue lock. > To make this scalable, we either need to significantly reduce the > occurrence of the lock-holder preemption, or do a much better job of > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > which do not need to run). The patches that Raghavendra has been posting do accomplish that. > > On reducing the occurrence: The worst case for lock-holder preemption > is having vcpus of same VM on the same runqueue. This guarantees the > situation of 1 vcpu running while another [of the same VM] is not. To > prove the point, I ran the same test, but with vcpus restricted to a > range of host cpus, such that any single VM's vcpus can never be on the > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > kvm_cpu_spin, and all > yield_to changes, plus > restricted vcpu placement: 8823 +/- 3.20% much, much better > > On picking a better vcpu to yield to: I really hesitate to rely on > paravirt hint [telling us which vcpu is holding a lock], but I am not > sure how else to reduce the candidate vcpus to yield to. I suspect we > are yielding to way more vcpus than are prempted lock-holders, and that > IMO is just work accomplishing nothing. Trying to think of way to > further reduce candidate vcpus.... ... the patches are posted - you could try them out? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-14 20:34 ` Konrad Rzeszutek Wilk @ 2012-09-17 8:02 ` Andrew Jones 0 siblings, 0 replies; 41+ messages in thread From: Andrew Jones @ 2012-09-17 8:02 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, rkrcmar On Fri, Sep 14, 2012 at 04:34:24PM -0400, Konrad Rzeszutek Wilk wrote: > > The concern I have is that even though we have gone through changes to > > help reduce the candidate vcpus we yield to, we still have a very poor > > idea of which vcpu really needs to run. The result is high cpu usage in > > the get_pid_task and still some contention in the double runqueue lock. > > To make this scalable, we either need to significantly reduce the > > occurrence of the lock-holder preemption, or do a much better job of > > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > > which do not need to run). > > The patches that Raghavendra has been posting do accomplish that. > > > > On reducing the occurrence: The worst case for lock-holder preemption > > is having vcpus of same VM on the same runqueue. This guarantees the > > situation of 1 vcpu running while another [of the same VM] is not. To > > prove the point, I ran the same test, but with vcpus restricted to a > > range of host cpus, such that any single VM's vcpus can never be on the > > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > > > kvm_cpu_spin, and all > > yield_to changes, plus > > restricted vcpu placement: 8823 +/- 3.20% much, much better > > > > On picking a better vcpu to yield to: I really hesitate to rely on > > paravirt hint [telling us which vcpu is holding a lock], but I am not > > sure how else to reduce the candidate vcpus to yield to. I suspect we > > are yielding to way more vcpus than are prempted lock-holders, and that > > IMO is just work accomplishing nothing. Trying to think of way to > > further reduce candidate vcpus.... > > ... the patches are posted - you could try them out? Radim and I have done some testing with the pvticketlock series. While we saw a gain over PLE alone, it wasn't huge, and without PLE also enabled it could hardly support 2.0x overcommit. spinlocks aren't the only place where cpu_relax() is called within a relatively tight loop, so it's likely that PLE yielding just generally helps by getting schedule() called more frequently. Drew ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-13 21:30 ` Andrew Theurer 2012-09-14 17:10 ` Andrew Jones 2012-09-14 20:34 ` Konrad Rzeszutek Wilk @ 2012-09-16 8:55 ` Avi Kivity 2012-09-17 8:10 ` Andrew Jones 2012-09-18 3:03 ` Andrew Theurer 2 siblings, 2 replies; 41+ messages in thread From: Avi Kivity @ 2012-09-16 8:55 UTC (permalink / raw) To: habanero Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/14/2012 12:30 AM, Andrew Theurer wrote: > The concern I have is that even though we have gone through changes to > help reduce the candidate vcpus we yield to, we still have a very poor > idea of which vcpu really needs to run. The result is high cpu usage in > the get_pid_task and still some contention in the double runqueue lock. > To make this scalable, we either need to significantly reduce the > occurrence of the lock-holder preemption, or do a much better job of > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > which do not need to run). > > On reducing the occurrence: The worst case for lock-holder preemption > is having vcpus of same VM on the same runqueue. This guarantees the > situation of 1 vcpu running while another [of the same VM] is not. To > prove the point, I ran the same test, but with vcpus restricted to a > range of host cpus, such that any single VM's vcpus can never be on the > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > kvm_cpu_spin, and all > yield_to changes, plus > restricted vcpu placement: 8823 +/- 3.20% much, much better > > On picking a better vcpu to yield to: I really hesitate to rely on > paravirt hint [telling us which vcpu is holding a lock], but I am not > sure how else to reduce the candidate vcpus to yield to. I suspect we > are yielding to way more vcpus than are prempted lock-holders, and that > IMO is just work accomplishing nothing. Trying to think of way to > further reduce candidate vcpus.... I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. That other vcpu gets work done (unless it is in pause loop itself) and the yielding vcpu gets put to sleep for a while, so it doesn't spend cycles spinning. While we haven't fixed the problem at least the guest is accomplishing work, and meanwhile the real lock holder may get naturally scheduled and clear the lock. The main problem with this theory is that the experiments don't seem to bear it out. So maybe one of the assumptions is wrong - the yielding vcpu gets scheduled early. That could be the case if the two vcpus are on different runqueues - you could be changing the relative priority of vcpus on the target runqueue, but still remain on top yourself. Is this possible with the current code? Maybe we should prefer vcpus on the same runqueue as yield_to targets, and only fall back to remote vcpus when we see it didn't help. Let's examine a few cases: 1. spinner on cpu 0, lock holder on cpu 0 win! 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 Spinner gets put to sleep, random vcpus get to work, low lock contention (no double_rq_lock), by the time spinner gets scheduled we might have won 3. spinner on cpu 0, another spinner on cpu 0 Worst case, we'll just spin some more. Need to detect this case and migrate something in. 4. spinner on cpu 0, alone Similar It seems we need to tie in to the load balancer. Would changing the priority of the task while it is spinning help the load balancer? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-16 8:55 ` Avi Kivity @ 2012-09-17 8:10 ` Andrew Jones 2012-09-18 3:03 ` Andrew Theurer 1 sibling, 0 replies; 41+ messages in thread From: Andrew Jones @ 2012-09-17 8:10 UTC (permalink / raw) To: Avi Kivity Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Sun, Sep 16, 2012 at 11:55:28AM +0300, Avi Kivity wrote: > On 09/14/2012 12:30 AM, Andrew Theurer wrote: > > > The concern I have is that even though we have gone through changes to > > help reduce the candidate vcpus we yield to, we still have a very poor > > idea of which vcpu really needs to run. The result is high cpu usage in > > the get_pid_task and still some contention in the double runqueue lock. > > To make this scalable, we either need to significantly reduce the > > occurrence of the lock-holder preemption, or do a much better job of > > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > > which do not need to run). > > > > On reducing the occurrence: The worst case for lock-holder preemption > > is having vcpus of same VM on the same runqueue. This guarantees the > > situation of 1 vcpu running while another [of the same VM] is not. To > > prove the point, I ran the same test, but with vcpus restricted to a > > range of host cpus, such that any single VM's vcpus can never be on the > > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > > > kvm_cpu_spin, and all > > yield_to changes, plus > > restricted vcpu placement: 8823 +/- 3.20% much, much better > > > > On picking a better vcpu to yield to: I really hesitate to rely on > > paravirt hint [telling us which vcpu is holding a lock], but I am not > > sure how else to reduce the candidate vcpus to yield to. I suspect we > > are yielding to way more vcpus than are prempted lock-holders, and that > > IMO is just work accomplishing nothing. Trying to think of way to > > further reduce candidate vcpus.... > > I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. > That other vcpu gets work done (unless it is in pause loop itself) and > the yielding vcpu gets put to sleep for a while, so it doesn't spend > cycles spinning. While we haven't fixed the problem at least the guest > is accomplishing work, and meanwhile the real lock holder may get > naturally scheduled and clear the lock. > > The main problem with this theory is that the experiments don't seem to > bear it out. So maybe one of the assumptions is wrong - the yielding > vcpu gets scheduled early. That could be the case if the two vcpus are > on different runqueues - you could be changing the relative priority of > vcpus on the target runqueue, but still remain on top yourself. Is this > possible with the current code? > > Maybe we should prefer vcpus on the same runqueue as yield_to targets, > and only fall back to remote vcpus when we see it didn't help. I thought about this a bit recently too, but didn't pursue it, because I figured it would actually increase the get_pid_task and double_rq_lock contention time if we have to hunt too long for a vcpu that matches a more strict criteria. But, I guess if we can implement a special "reschedule" to run on the current cpu which prioritizes runnable/non-running vcpus, then it should be just as fast or faster for it to look through the runqueue first, than it is to look through all the vcpus first. Drew > > Let's examine a few cases: > > 1. spinner on cpu 0, lock holder on cpu 0 > > win! > > 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 > > Spinner gets put to sleep, random vcpus get to work, low lock contention > (no double_rq_lock), by the time spinner gets scheduled we might have won > > 3. spinner on cpu 0, another spinner on cpu 0 > > Worst case, we'll just spin some more. Need to detect this case and > migrate something in. > > 4. spinner on cpu 0, alone > > Similar > > > It seems we need to tie in to the load balancer. > > Would changing the priority of the task while it is spinning help the > load balancer? > > -- > error compiling committee.c: too many arguments to function > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-16 8:55 ` Avi Kivity 2012-09-17 8:10 ` Andrew Jones @ 2012-09-18 3:03 ` Andrew Theurer 2012-09-19 13:39 ` Avi Kivity 1 sibling, 1 reply; 41+ messages in thread From: Andrew Theurer @ 2012-09-18 3:03 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote: > On 09/14/2012 12:30 AM, Andrew Theurer wrote: > > > The concern I have is that even though we have gone through changes to > > help reduce the candidate vcpus we yield to, we still have a very poor > > idea of which vcpu really needs to run. The result is high cpu usage in > > the get_pid_task and still some contention in the double runqueue lock. > > To make this scalable, we either need to significantly reduce the > > occurrence of the lock-holder preemption, or do a much better job of > > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > > which do not need to run). > > > > On reducing the occurrence: The worst case for lock-holder preemption > > is having vcpus of same VM on the same runqueue. This guarantees the > > situation of 1 vcpu running while another [of the same VM] is not. To > > prove the point, I ran the same test, but with vcpus restricted to a > > range of host cpus, such that any single VM's vcpus can never be on the > > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > > > kvm_cpu_spin, and all > > yield_to changes, plus > > restricted vcpu placement: 8823 +/- 3.20% much, much better > > > > On picking a better vcpu to yield to: I really hesitate to rely on > > paravirt hint [telling us which vcpu is holding a lock], but I am not > > sure how else to reduce the candidate vcpus to yield to. I suspect we > > are yielding to way more vcpus than are prempted lock-holders, and that > > IMO is just work accomplishing nothing. Trying to think of way to > > further reduce candidate vcpus.... > > I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. > That other vcpu gets work done (unless it is in pause loop itself) and > the yielding vcpu gets put to sleep for a while, so it doesn't spend > cycles spinning. While we haven't fixed the problem at least the guest > is accomplishing work, and meanwhile the real lock holder may get > naturally scheduled and clear the lock. OK, yes, if the other thread gets useful work done, then it is not wasteful. I was thinking of the worst case scenario, where any other vcpu would likely spin as well, and the host side cpu-time for switching vcpu threads was not all that productive. Well, I suppose it does help eliminate potential lock holding vcpus; it just seems to be not that efficient or fast enough. > The main problem with this theory is that the experiments don't seem to > bear it out. Granted, my test case is quite brutal. It's nothing but over-committed VMs which always have some spin lock activity. However, we really should try to fix the worst case scenario. > So maybe one of the assumptions is wrong - the yielding > vcpu gets scheduled early. That could be the case if the two vcpus are > on different runqueues - you could be changing the relative priority of > vcpus on the target runqueue, but still remain on top yourself. Is this > possible with the current code? > > Maybe we should prefer vcpus on the same runqueue as yield_to targets, > and only fall back to remote vcpus when we see it didn't help. > > Let's examine a few cases: > > 1. spinner on cpu 0, lock holder on cpu 0 > > win! > > 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 > > Spinner gets put to sleep, random vcpus get to work, low lock contention > (no double_rq_lock), by the time spinner gets scheduled we might have won > > 3. spinner on cpu 0, another spinner on cpu 0 > > Worst case, we'll just spin some more. Need to detect this case and > migrate something in. Well, we can certainly experiment and see what we get. IMO, the key to getting this working really well on the large VMs is finding the lock-holding cpu -quickly-. What I think is happening is that we go through a relatively long process to get to that one right vcpu. I guess I need to find a faster way to get there. > 4. spinner on cpu 0, alone > > Similar > > > It seems we need to tie in to the load balancer. > > Would changing the priority of the task while it is spinning help the > load balancer? Not sure. -Andrew ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-18 3:03 ` Andrew Theurer @ 2012-09-19 13:39 ` Avi Kivity 0 siblings, 0 replies; 41+ messages in thread From: Avi Kivity @ 2012-09-19 13:39 UTC (permalink / raw) To: habanero Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/18/2012 06:03 AM, Andrew Theurer wrote: > On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote: >> On 09/14/2012 12:30 AM, Andrew Theurer wrote: >> >> > The concern I have is that even though we have gone through changes to >> > help reduce the candidate vcpus we yield to, we still have a very poor >> > idea of which vcpu really needs to run. The result is high cpu usage in >> > the get_pid_task and still some contention in the double runqueue lock. >> > To make this scalable, we either need to significantly reduce the >> > occurrence of the lock-holder preemption, or do a much better job of >> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus >> > which do not need to run). >> > >> > On reducing the occurrence: The worst case for lock-holder preemption >> > is having vcpus of same VM on the same runqueue. This guarantees the >> > situation of 1 vcpu running while another [of the same VM] is not. To >> > prove the point, I ran the same test, but with vcpus restricted to a >> > range of host cpus, such that any single VM's vcpus can never be on the >> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, >> > vcpu-1's are on host cpus 5-9, and so on. Here is the result: >> > >> > kvm_cpu_spin, and all >> > yield_to changes, plus >> > restricted vcpu placement: 8823 +/- 3.20% much, much better >> > >> > On picking a better vcpu to yield to: I really hesitate to rely on >> > paravirt hint [telling us which vcpu is holding a lock], but I am not >> > sure how else to reduce the candidate vcpus to yield to. I suspect we >> > are yielding to way more vcpus than are prempted lock-holders, and that >> > IMO is just work accomplishing nothing. Trying to think of way to >> > further reduce candidate vcpus.... >> >> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. >> That other vcpu gets work done (unless it is in pause loop itself) and >> the yielding vcpu gets put to sleep for a while, so it doesn't spend >> cycles spinning. While we haven't fixed the problem at least the guest >> is accomplishing work, and meanwhile the real lock holder may get >> naturally scheduled and clear the lock. > > OK, yes, if the other thread gets useful work done, then it is not > wasteful. I was thinking of the worst case scenario, where any other > vcpu would likely spin as well, and the host side cpu-time for switching > vcpu threads was not all that productive. Well, I suppose it does help > eliminate potential lock holding vcpus; it just seems to be not that > efficient or fast enough. If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then yes, we must iterate over N-1 vcpus until we find Mr. Right. Eventually it's not-a-timeslice will expire and we go through this again. If N*y_yield is comparable to the timeslice, we start losing efficiency. Because of lock contention, t_yield can scale with the number of host cpus. So in this worst case, we get quadratic behaviour. One way out is to increase the not-a-timeslice. Can we get spinning vcpus to do that for running vcpus, if they cannot find a runnable-but-not-running vcpu? That's not guaranteed to help, if we boost a running vcpu too much it will skew how vcpu runtime is distributed even after the lock is released. > >> The main problem with this theory is that the experiments don't seem to >> bear it out. > > Granted, my test case is quite brutal. It's nothing but over-committed > VMs which always have some spin lock activity. However, we really > should try to fix the worst case scenario. Yes. And other guests may not scale as well as Linux, so they may show this behaviour more often. > >> So maybe one of the assumptions is wrong - the yielding >> vcpu gets scheduled early. That could be the case if the two vcpus are >> on different runqueues - you could be changing the relative priority of >> vcpus on the target runqueue, but still remain on top yourself. Is this >> possible with the current code? >> >> Maybe we should prefer vcpus on the same runqueue as yield_to targets, >> and only fall back to remote vcpus when we see it didn't help. >> >> Let's examine a few cases: >> >> 1. spinner on cpu 0, lock holder on cpu 0 >> >> win! >> >> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 >> >> Spinner gets put to sleep, random vcpus get to work, low lock contention >> (no double_rq_lock), by the time spinner gets scheduled we might have won >> >> 3. spinner on cpu 0, another spinner on cpu 0 >> >> Worst case, we'll just spin some more. Need to detect this case and >> migrate something in. > > Well, we can certainly experiment and see what we get. > > IMO, the key to getting this working really well on the large VMs is > finding the lock-holding cpu -quickly-. What I think is happening is > that we go through a relatively long process to get to that one right > vcpu. I guess I need to find a faster way to get there. pvspinlocks will find the right one, every time. Otherwise I see no way to do this. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-11 18:27 ` Andrew Theurer 2012-09-13 11:48 ` Raghavendra K T @ 2012-09-13 12:13 ` Avi Kivity 1 sibling, 0 replies; 41+ messages in thread From: Avi Kivity @ 2012-09-13 12:13 UTC (permalink / raw) To: habanero Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri On 09/11/2012 09:27 PM, Andrew Theurer wrote: > > So, having both is probably not a good idea. However, I feel like > there's more work to be done. With no over-commit (10 VMs), total > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > overhead, but a reduction to ~4500 is still terrible. By contrast, > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > host). We still have what appears to be scalability problems, but now > it's not so much in runqueue locks for yield_to(), but now > get_pid_task(): > > perf on host: > > 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task > 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock > 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to > 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin > 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to > 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock > 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run > 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run > 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest > 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule > 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq > 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task > 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel > 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair > 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe > 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc > > get_pid_task() uses some rcu fucntions, wondering how scalable this > is.... I tend to think of rcu as -not- having issues like this... is > there a rcu stat/tracing tool which would help identify potential > problems? It's not, it's the atomics + cache line bouncing. We're basically guaranteed to bounce here. Here we're finally paying for the ioctl() based interface. A syscall based interface would have a 1:1 correspondence between vcpus and tasks, so these games would be unnecessary. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-10 17:12 ` Peter Zijlstra 2012-09-10 19:10 ` Raghavendra K T 2012-09-10 20:12 ` Andrew Theurer @ 2012-09-11 7:04 ` Srikar Dronamraju 2 siblings, 0 replies; 41+ messages in thread From: Srikar Dronamraju @ 2012-09-11 7:04 UTC (permalink / raw) To: Peter Zijlstra Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri > > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > bool preempt) > > > rq = this_rq(); > > > > > > again: > > > + /* optimistic test to avoid taking locks */ > > > + if (!__yield_to_candidate(curr, p)) > > > + goto out_irq; > > > + > > So add something like: > > /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > if (p_rq->cfs_rq->skip) > goto out_irq; > > > > > > > p_rq = task_rq(p); > > > double_rq_lock(rq, p_rq); > > > > > But I do have a question on this optimization though,.. Why do we check > p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > That is, I'd like to see this thing explained a little better. > > Does it go something like: p_rq is the runqueue of the task we'd like to > yield to, rq is our own, they might be the same. If we have a ->skip, > there's nothing we can do about it, OTOH p_rq having a ->skip and > failing the yield_to() simply means us picking the next VCPU thread, > which might be running on an entirely different cpu (rq) and could > succeed? > Oh this made me look back at yield_to() again. I had misread the yield_to_task_fair() code. I had wrongly thought that both ->skip and ->next buddies for the p_rq would be set. But it looks like only ->next for the p_rq is set and ->skip is set for rq. This should also explains why Andrew saw a regression when checking for ->skip flag instead of PF_VCPU. Can we check for p_rq->cfs.next and bail out if @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* if next buddy is set, assume yield is in progress */ + if (p_rq->cfs.next) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4898,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler 2012-09-07 19:42 ` Andrew Theurer 2012-09-08 8:43 ` Srikar Dronamraju @ 2012-09-10 14:43 ` Raghavendra K T 1 sibling, 0 replies; 41+ messages in thread From: Raghavendra K T @ 2012-09-10 14:43 UTC (permalink / raw) To: habanero Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar, KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri, Peter Zijlstra On 09/08/2012 01:12 AM, Andrew Theurer wrote: > On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote: >> CCing PeterZ also. >> >> On 09/07/2012 06:41 PM, Andrew Theurer wrote: >>> I have noticed recently that PLE/yield_to() is still not that scalable >>> for really large guests, sometimes even with no CPU over-commit. I have >>> a small change that make a very big difference. [...] >> We are indeed avoiding CPUS in guest mode when we check >> task->flags& PF_VCPU in vcpu_on_spin path. Doesn't that suffice? > My understanding is that it checks if the candidate vcpu task is in > guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a > target to yield to if it is already in guest mode. I am concerned about > a different vcpu, possibly from a different VM (let's call it g2vcpuN), > but it also located on the same runqueue as g1vcpuN -and- running. That > vcpu, g2vcpuN, may also be doing a directed yield, and it may already be > holding the rq lock. Or it could be in guest mode. If it is in guest > mode, then let's still target this rq, and try to yield to g1vcpuN. > However, if g2vcpuN is not in guest mode, then don't bother trying. - If a non vcpu task was currently running, this change can ignore request to yield to a target vcpu. The target vcpu could be the most eligible vcpu causing other vcpus to do ple exits. Is it possible to modify the check to deal with only vcpu tasks? - Should we use p_rq->cfs_rq->skip instead to let us know that some yield was active at this time? - Cpu 1 cpu2 cpu3 a1 a2 a3 b1 b2 b3 c2(yield target of a1) c3(yield target of a2) If vcpu a1 is doing directed yield to vcpu c2; current vcpu a2 on target cpu is also doing a directed yield(to some vcpu c3). Then this change will only allow vcpu a2 will do a schedule() to b2 (if a2 -> c3 yield is successful). Do we miss yielding to a vcpu c2? a1 might not find a suitable vcpu to yield and might go back to spinning. Is my understanding correct? > Patch include below. > > Here's the new, v2 result with the previous two: > > 10 VMs, 16-way each, all running dbench (2x cpu over-commit) > throughput +/- stddev > ----- ----- > ple on: 2552 +/- .70% > ple on: w/fixv1: 4621 +/- 2.12% (81% improvement) > ple on: w/fixv2: 6115* (139% improvement) > The numbers look great. > [*] I do not have stdev yet because all 10 runs are not complete > > for v1 to v2, host CPU dropped from 60% to 50%. Time in spin_lock() is > also dropping: > [...] > > So this seems to be working. However I wonder just how far we can take > this. Ideally we need to be in<3-4% in host for PLE work, like I > observe for the 8-way VMs. We are still way off. > > -Andrew > > > signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..c767915 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > > again: > p_rq = task_rq(p); > + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags& > PF_VCPU)) { While we are checking the flags of p_rq->curr task, the task p can migrate to some other runqueue. In this case will we miss yielding to the most eligible vcpu? > + goto out_no_unlock; > + } Nit: We dont need parenthesis above. ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2012-09-19 13:40 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T 2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T 2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T 2012-07-18 14:39 ` Raghavendra K T 2012-07-19 9:47 ` [RESEND PATCH " Raghavendra K T 2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti 2012-07-22 12:34 ` Raghavendra K T 2012-07-22 12:43 ` Avi Kivity 2012-07-23 7:35 ` Christian Borntraeger 2012-07-22 17:58 ` Rik van Riel 2012-07-23 10:03 ` Avi Kivity 2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer 2012-09-07 18:06 ` Raghavendra K T 2012-09-07 19:42 ` Andrew Theurer 2012-09-08 8:43 ` Srikar Dronamraju 2012-09-10 13:16 ` Andrew Theurer 2012-09-10 16:03 ` Peter Zijlstra 2012-09-10 16:56 ` Srikar Dronamraju 2012-09-10 17:12 ` Peter Zijlstra 2012-09-10 19:10 ` Raghavendra K T 2012-09-10 20:12 ` Andrew Theurer 2012-09-10 20:19 ` Peter Zijlstra 2012-09-10 20:31 ` Rik van Riel 2012-09-11 6:08 ` Raghavendra K T 2012-09-11 12:48 ` Andrew Theurer 2012-09-11 18:27 ` Andrew Theurer 2012-09-13 11:48 ` Raghavendra K T 2012-09-13 21:30 ` Andrew Theurer 2012-09-14 17:10 ` Andrew Jones 2012-09-15 16:08 ` Raghavendra K T 2012-09-17 13:48 ` Andrew Jones 2012-09-14 20:34 ` Konrad Rzeszutek Wilk 2012-09-17 8:02 ` Andrew Jones 2012-09-16 8:55 ` Avi Kivity 2012-09-17 8:10 ` Andrew Jones 2012-09-18 3:03 ` Andrew Theurer 2012-09-19 13:39 ` Avi Kivity 2012-09-13 12:13 ` Avi Kivity 2012-09-11 7:04 ` Srikar Dronamraju 2012-09-10 14:43 ` Raghavendra K T
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).