linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
@ 2012-07-18 13:37 Raghavendra K T
  2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
	Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel


Currently Pause Loop Exit (PLE) handler is doing directed yield to a
random vcpu on pl-exit. We already have filtering while choosing
the candidate to yield_to. This change adds more checks while choosing
a candidate to yield_to.

On a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit. 
Such a yield can lead to the vcpu spinning again.

The patchset keeps track of the pause loop exit and gives chance to a
vcpu which has:

 (a) Not done pause loop exit at all (probably he is preempted lock-holder)

 (b) vcpu skipped in last iteration because it did pause loop exit, and
 probably has become eligible now (next eligible lock holder)

This concept also helps in cpu relax interception cases which use same handler.

Changes since V4:
 - Naming Change (Avi):
  struct ple ==> struct spin_loop
  cpu_relax_intercepted ==> in_spin_loop
  vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
 - mark vcpu in spinloop as not eligible to avoid influence of previous exit

Changes since V3:
 - arch specific fix/changes (Christian)

Changes since v2:
 - Move ple structure to common code (Avi)
 - rename pause_loop_exited to cpu_relax_intercepted (Avi)
 - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
 - Drop superfluous curly braces (Ingo)

Changes since v1:
 - Add more documentation for structure and algorithm and Rename
   plo ==> ple (Rik).
 - change dy_eligible initial value to false. (otherwise very first directed
    yield will not be skipped. (Nikunj)
 - fixup signoff/from issue

Future enhancements:
  (1) Currently we have a boolean to decide on eligibility of vcpu. It
    would be nice if I get feedback on guest (>32 vcpu) whether we can
    improve better with integer counter. (with counter = say f(log n )).
  
  (2) We have not considered system load during iteration of vcpu. With
   that information we can limit the scan and also decide whether schedule()
   is better. [ I am able to use #kicked vcpus to decide on this But may
   be there are better ideas like information from global loadavg.]

  (3) We can exploit this further with PV patches since it also knows about
   next eligible lock-holder.

Summary: There is a very good improvement for kvm based guest on PLE machine.
The V5 has huge improvement for kbench.

+-----------+-----------+-----------+------------+-----------+
   base_rik    stdev       patched      stdev       %improve
+-----------+-----------+-----------+------------+-----------+
              kernbench (time in sec lesser is better)
+-----------+-----------+-----------+------------+-----------+
 1x    49.2300     1.0171    22.6842     0.3073    117.0233 %
 2x    91.9358     1.7768    53.9608     1.0154    70.37516 %
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
              ebizzy (records/sec more is better)
+-----------+-----------+-----------+------------+-----------+
 1x  1129.2500    28.6793    2125.6250    32.8239    88.23334 %
 2x  1892.3750    75.1112    2377.1250   181.6822    25.61596 %
+-----------+-----------+-----------+------------+-----------+

Note: The patches are tested on x86.

 Links
  V4: https://lkml.org/lkml/2012/7/16/80
  V3: https://lkml.org/lkml/2012/7/12/437
  V2: https://lkml.org/lkml/2012/7/10/392
  V1: https://lkml.org/lkml/2012/7/9/32

 Raghavendra K T (3):
   config: Add config to support ple or cpu relax optimzation 
   kvm : Note down when cpu relax intercepted or pause loop exited 
   kvm : Choose a better candidate for directed yield 
---
 arch/s390/kvm/Kconfig    |    1 +
 arch/x86/kvm/Kconfig     |    1 +
 include/linux/kvm_host.h |   39 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/Kconfig         |    3 +++
 virt/kvm/kvm_main.c      |   41 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 85 insertions(+), 0 deletions(-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation
  2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
@ 2012-07-18 13:37 ` Raghavendra K T
  2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel
  Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
	Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/s390/kvm/Kconfig |    1 +
 arch/x86/kvm/Kconfig  |    1 +
 virt/kvm/Kconfig      |    3 +++
 3 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 78eb984..a6e2677 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	depends on HAVE_KVM && EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
 	select ANON_INODES
+	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
 	  virtualization capability on the mainframe. This should work
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a28f338..45c044f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -37,6 +37,7 @@ config KVM
 	select TASK_DELAY_ACCT
 	select PERF_EVENTS
 	select HAVE_KVM_MSI
+	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 28694f4..d01b24b 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -21,3 +21,6 @@ config KVM_ASYNC_PF
 
 config HAVE_KVM_MSI
        bool
+
+config HAVE_KVM_CPU_RELAX_INTERCEPT
+       bool


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited
  2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
  2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
@ 2012-07-18 13:37 ` Raghavendra K T
  2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:37 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
	Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Noting pause loop exited vcpu or cpu relax intercepted helps in
filtering right candidate to yield. Wrong selection of vcpu;
i.e., a vcpu that just did a pl-exit or cpu relax intercepted may
contribute to performance degradation.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>

 include/linux/kvm_host.h |   34 ++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |    5 +++++
 2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c446435..34ce296 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -183,6 +183,18 @@ struct kvm_vcpu {
 	} async_pf;
 #endif
 
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+	/*
+	 * Cpu relax intercept or pause loop exit optimization
+	 * in_spin_loop: set when a vcpu does a pause loop exit
+	 *  or cpu relax intercepted.
+	 * dy_eligible: indicates whether vcpu is eligible for directed yield.
+	 */
+	struct {
+		bool in_spin_loop;
+		bool dy_eligible;
+	} spin_loop;
+#endif
 	struct kvm_vcpu_arch arch;
 };
 
@@ -890,5 +902,27 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
 	}
 }
 
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+
+static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
+{
+	vcpu->spin_loop.in_spin_loop = val;
+}
+static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
+{
+	vcpu->spin_loop.dy_eligible = val;
+}
+
+#else /* !CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
+
+static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
+{
+}
+
+static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
+{
+}
+
+#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..3d6ffc8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -236,6 +236,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	}
 	vcpu->run = page_address(page);
 
+	kvm_vcpu_set_in_spin_loop(vcpu, false);
+	kvm_vcpu_set_dy_eligible(vcpu, false);
+
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
@@ -1577,6 +1580,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int pass;
 	int i;
 
+	kvm_vcpu_set_in_spin_loop(me, true);
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
@@ -1602,6 +1606,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 		}
 	}
+	kvm_vcpu_set_in_spin_loop(me, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
  2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
  2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
  2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
@ 2012-07-18 13:38 ` Raghavendra K T
  2012-07-18 14:39   ` Raghavendra K T
  2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
  2012-07-23 10:03 ` Avi Kivity
  4 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 13:38 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel
  Cc: Srikar, S390, Carsten Otte, Christian Borntraeger, KVM,
	Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Currently, on a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit or
cpu relax intercepted. Such a yield can lead to the vcpu spinning
again and hence degrade the performance.

The patchset keeps track of the pause loop exit/cpu relax interception
and gives chance to a vcpu which:
 (a) Has not done pause loop exit or cpu relax intercepted at all
     (probably he is preempted lock-holder)
 (b) Was skipped in last iteration because it did pause loop exit or
     cpu relax intercepted, and probably has become eligible now
     (next eligible lock holder)

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>

 include/linux/kvm_host.h |    5 +++++
 virt/kvm/kvm_main.c      |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 34ce296..952427d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
 }
 
+static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+	return true;
+}
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3d6ffc8..bf9fb97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,6 +1571,39 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+/*
+ * Helper that checks whether a VCPU is eligible for directed yield.
+ * Most eligible candidate to yield is decided by following heuristics:
+ *
+ *  (a) VCPU which has not done pl-exit or cpu relax intercepted recently
+ *  (preempted lock holder), indicated by @in_spin_loop.
+ *  Set at the beiginning and cleared at the end of interception/PLE handler.
+ *
+ *  (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
+ *  chance last time (mostly it has become eligible now since we have probably
+ *  yielded to lockholder in last iteration. This is done by toggling
+ *  @dy_eligible each time a VCPU checked for eligibility.)
+ *
+ *  Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
+ *  to preempted lock-holder could result in wrong VCPU selection and CPU
+ *  burning. Giving priority for a potential lock-holder increases lock
+ *  progress.
+ */
+bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+	bool eligible;
+
+	eligible = !vcpu->spin_loop.in_spin_loop ||
+			(vcpu->spin_loop.in_spin_loop &&
+			 vcpu->spin_loop.dy_eligible);
+
+	if (vcpu->spin_loop.in_spin_loop)
+		vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible;
+
+	return eligible;
+}
+#endif
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
@@ -1599,6 +1632,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
 				continue;
+			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+				continue;
 			if (kvm_vcpu_yield_to(vcpu)) {
 				kvm->last_boosted_vcpu = i;
 				yielded = 1;
@@ -1607,6 +1642,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 		}
 	}
 	kvm_vcpu_set_in_spin_loop(me, false);
+	kvm_vcpu_set_dy_eligible(me, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
  2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-18 14:39   ` Raghavendra K T
  2012-07-19  9:47     ` [RESEND PATCH " Raghavendra K T
  0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-07-18 14:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel, Srikar, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/18/2012 07:08 PM, Raghavendra K T wrote:
> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
> +bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
> +{
> +	bool eligible;
> +
> +	eligible = !vcpu->spin_loop.in_spin_loop ||
> +			(vcpu->spin_loop.in_spin_loop&&
> +			 vcpu->spin_loop.dy_eligible);
> +
> +	if (vcpu->spin_loop.in_spin_loop)
> +		vcpu->spin_loop.dy_eligible = !vcpu->spin_loop.dy_eligible;
> +
> +	return eligible;
> +}

I should have added a comment like:
Since algorithm is based on heuristics, accessing another vcpu data
without locking does not harm. It may result in trying to yield to  same 
VCPU, fail and continue with next and so on.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RESEND PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield
  2012-07-18 14:39   ` Raghavendra K T
@ 2012-07-19  9:47     ` Raghavendra K T
  0 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-19  9:47 UTC (permalink / raw)
  To: Raghavendra K T, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar
  Cc: S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel

Currently, on a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit or
cpu relax intercepted. Such a yield can lead to the vcpu spinning
again and hence degrade the performance.

The patchset keeps track of the pause loop exit/cpu relax interception
and gives chance to a vcpu which:
 (a) Has not done pause loop exit or cpu relax intercepted at all
     (probably he is preempted lock-holder)
 (b) Was skipped in last iteration because it did pause loop exit or
     cpu relax intercepted, and probably has become eligible now
     (next eligible lock holder)

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
V2 was:
Reviewed-by: Rik van Riel <riel@redhat.com>

 Changelog: Added comment on locking as suggested by Avi

 include/linux/kvm_host.h |    5 +++++
 virt/kvm/kvm_main.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 34ce296..952427d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -923,6 +923,11 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 {
 }
 
+static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+	return true;
+}
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3d6ffc8..8fda756 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,6 +1571,43 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
+#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
+/*
+ * Helper that checks whether a VCPU is eligible for directed yield.
+ * Most eligible candidate to yield is decided by following heuristics:
+ *
+ *  (a) VCPU which has not done pl-exit or cpu relax intercepted recently
+ *  (preempted lock holder), indicated by @in_spin_loop.
+ *  Set at the beiginning and cleared at the end of interception/PLE handler.
+ *
+ *  (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
+ *  chance last time (mostly it has become eligible now since we have probably
+ *  yielded to lockholder in last iteration. This is done by toggling
+ *  @dy_eligible each time a VCPU checked for eligibility.)
+ *
+ *  Yielding to a recently pl-exited/cpu relax intercepted VCPU before yielding
+ *  to preempted lock-holder could result in wrong VCPU selection and CPU
+ *  burning. Giving priority for a potential lock-holder increases lock
+ *  progress.
+ *
+ *  Since algorithm is based on heuristics, accessing another VCPU data without
+ *  locking does not harm. It may result in trying to yield to  same VCPU, fail
+ *  and continue with next VCPU and so on.
+ */
+bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
+{
+	bool eligible;
+
+	eligible = !vcpu->spin_loop.in_spin_loop ||
+			(vcpu->spin_loop.in_spin_loop &&
+			 vcpu->spin_loop.dy_eligible);
+
+	if (vcpu->spin_loop.in_spin_loop)
+		kvm_vcpu_set_dy_eligible(vcpu, !vcpu->spin_loop.dy_eligible);
+
+	return eligible;
+}
+#endif
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
@@ -1599,6 +1636,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
 				continue;
+			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+				continue;
 			if (kvm_vcpu_yield_to(vcpu)) {
 				kvm->last_boosted_vcpu = i;
 				yielded = 1;
@@ -1607,6 +1646,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 		}
 	}
 	kvm_vcpu_set_in_spin_loop(me, false);
+
+	/* Ensure vcpu is not eligible during next spinloop */
+	kvm_vcpu_set_dy_eligible(me, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
                   ` (2 preceding siblings ...)
  2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-20 17:36 ` Marcelo Tosatti
  2012-07-22 12:34   ` Raghavendra K T
  2012-07-23 10:03 ` Avi Kivity
  4 siblings, 1 reply; 41+ messages in thread
From: Marcelo Tosatti @ 2012-07-20 17:36 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
> 
> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
> random vcpu on pl-exit. We already have filtering while choosing
> the candidate to yield_to. This change adds more checks while choosing
> a candidate to yield_to.
> 
> On a large vcpu guests, there is a high probability of
> yielding to the same vcpu who had recently done a pause-loop exit. 
> Such a yield can lead to the vcpu spinning again.
> 
> The patchset keeps track of the pause loop exit and gives chance to a
> vcpu which has:
> 
>  (a) Not done pause loop exit at all (probably he is preempted lock-holder)
> 
>  (b) vcpu skipped in last iteration because it did pause loop exit, and
>  probably has become eligible now (next eligible lock holder)
> 
> This concept also helps in cpu relax interception cases which use same handler.
> 
> Changes since V4:
>  - Naming Change (Avi):
>   struct ple ==> struct spin_loop
>   cpu_relax_intercepted ==> in_spin_loop
>   vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
>  - mark vcpu in spinloop as not eligible to avoid influence of previous exit
> 
> Changes since V3:
>  - arch specific fix/changes (Christian)
> 
> Changes since v2:
>  - Move ple structure to common code (Avi)
>  - rename pause_loop_exited to cpu_relax_intercepted (Avi)
>  - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
>  - Drop superfluous curly braces (Ingo)
> 
> Changes since v1:
>  - Add more documentation for structure and algorithm and Rename
>    plo ==> ple (Rik).
>  - change dy_eligible initial value to false. (otherwise very first directed
>     yield will not be skipped. (Nikunj)
>  - fixup signoff/from issue
> 
> Future enhancements:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
>     would be nice if I get feedback on guest (>32 vcpu) whether we can
>     improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>    that information we can limit the scan and also decide whether schedule()
>    is better. [ I am able to use #kicked vcpus to decide on this But may
>    be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>    next eligible lock-holder.
> 
> Summary: There is a very good improvement for kvm based guest on PLE machine.
> The V5 has huge improvement for kbench.
> 
> +-----------+-----------+-----------+------------+-----------+
>    base_rik    stdev       patched      stdev       %improve
> +-----------+-----------+-----------+------------+-----------+
>               kernbench (time in sec lesser is better)
> +-----------+-----------+-----------+------------+-----------+
>  1x    49.2300     1.0171    22.6842     0.3073    117.0233 %
>  2x    91.9358     1.7768    53.9608     1.0154    70.37516 %
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>               ebizzy (records/sec more is better)
> +-----------+-----------+-----------+------------+-----------+
>  1x  1129.2500    28.6793    2125.6250    32.8239    88.23334 %
>  2x  1892.3750    75.1112    2377.1250   181.6822    25.61596 %
> +-----------+-----------+-----------+------------+-----------+
> 
> Note: The patches are tested on x86.
> 
>  Links
>   V4: https://lkml.org/lkml/2012/7/16/80
>   V3: https://lkml.org/lkml/2012/7/12/437
>   V2: https://lkml.org/lkml/2012/7/10/392
>   V1: https://lkml.org/lkml/2012/7/9/32
> 
>  Raghavendra K T (3):
>    config: Add config to support ple or cpu relax optimzation 
>    kvm : Note down when cpu relax intercepted or pause loop exited 
>    kvm : Choose a better candidate for directed yield 
> ---
>  arch/s390/kvm/Kconfig    |    1 +
>  arch/x86/kvm/Kconfig     |    1 +
>  include/linux/kvm_host.h |   39 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/Kconfig         |    3 +++
>  virt/kvm/kvm_main.c      |   41 +++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 85 insertions(+), 0 deletions(-)

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
@ 2012-07-22 12:34   ` Raghavendra K T
  2012-07-22 12:43     ` Avi Kivity
  2012-07-22 17:58     ` Rik van Riel
  0 siblings, 2 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-07-22 12:34 UTC (permalink / raw)
  To: Marcelo Tosatti, Avi Kivity, Rik van Riel, Christian Borntraeger
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
	Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/20/2012 11:06 PM, Marcelo Tosatti wrote:
> On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
>>
>> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
>> random vcpu on pl-exit. We already have filtering while choosing
>> the candidate to yield_to. This change adds more checks while choosing
>> a candidate to yield_to.
>>
>> On a large vcpu guests, there is a high probability of
>> yielding to the same vcpu who had recently done a pause-loop exit.
>> Such a yield can lead to the vcpu spinning again.
>>
>> The patchset keeps track of the pause loop exit and gives chance to a
>> vcpu which has:
>>
>>   (a) Not done pause loop exit at all (probably he is preempted lock-holder)
>>
>>   (b) vcpu skipped in last iteration because it did pause loop exit, and
>>   probably has become eligible now (next eligible lock holder)
>>
>> This concept also helps in cpu relax interception cases which use same handler.
>>
>> Changes since V4:
>>   - Naming Change (Avi):
>>    struct ple ==>  struct spin_loop
>>    cpu_relax_intercepted ==>  in_spin_loop
>>    vcpu_check_and_update_eligible ==>  vcpu_eligible_for_directed_yield
>>   - mark vcpu in spinloop as not eligible to avoid influence of previous exit
>>
>> Changes since V3:
>>   - arch specific fix/changes (Christian)
>>
>> Changes since v2:
>>   - Move ple structure to common code (Avi)
>>   - rename pause_loop_exited to cpu_relax_intercepted (Avi)
>>   - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
>>   - Drop superfluous curly braces (Ingo)
>>
>> Changes since v1:
>>   - Add more documentation for structure and algorithm and Rename
>>     plo ==>  ple (Rik).
>>   - change dy_eligible initial value to false. (otherwise very first directed
>>      yield will not be skipped. (Nikunj)
>>   - fixup signoff/from issue
>>
>> Future enhancements:
>>    (1) Currently we have a boolean to decide on eligibility of vcpu. It
>>      would be nice if I get feedback on guest (>32 vcpu) whether we can
>>      improve better with integer counter. (with counter = say f(log n )).
>>
>>    (2) We have not considered system load during iteration of vcpu. With
>>     that information we can limit the scan and also decide whether schedule()
>>     is better. [ I am able to use #kicked vcpus to decide on this But may
>>     be there are better ideas like information from global loadavg.]
>>
>>    (3) We can exploit this further with PV patches since it also knows about
>>     next eligible lock-holder.
>>
>> Summary: There is a very good improvement for kvm based guest on PLE machine.
>> The V5 has huge improvement for kbench.
>>
>> +-----------+-----------+-----------+------------+-----------+
>>     base_rik    stdev       patched      stdev       %improve
>> +-----------+-----------+-----------+------------+-----------+
>>                kernbench (time in sec lesser is better)
>> +-----------+-----------+-----------+------------+-----------+
>>   1x    49.2300     1.0171    22.6842     0.3073    117.0233 %
>>   2x    91.9358     1.7768    53.9608     1.0154    70.37516 %
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                ebizzy (records/sec more is better)
>> +-----------+-----------+-----------+------------+-----------+
>>   1x  1129.2500    28.6793    2125.6250    32.8239    88.23334 %
>>   2x  1892.3750    75.1112    2377.1250   181.6822    25.61596 %
>> +-----------+-----------+-----------+------------+-----------+
>>
>> Note: The patches are tested on x86.
>>
>>   Links
>>    V4: https://lkml.org/lkml/2012/7/16/80
>>    V3: https://lkml.org/lkml/2012/7/12/437
>>    V2: https://lkml.org/lkml/2012/7/10/392
>>    V1: https://lkml.org/lkml/2012/7/9/32
>>
>>   Raghavendra K T (3):
>>     config: Add config to support ple or cpu relax optimzation
>>     kvm : Note down when cpu relax intercepted or pause loop exited
>>     kvm : Choose a better candidate for directed yield
>> ---
>>   arch/s390/kvm/Kconfig    |    1 +
>>   arch/x86/kvm/Kconfig     |    1 +
>>   include/linux/kvm_host.h |   39 +++++++++++++++++++++++++++++++++++++++
>>   virt/kvm/Kconfig         |    3 +++
>>   virt/kvm/kvm_main.c      |   41 +++++++++++++++++++++++++++++++++++++++++
>>   5 files changed, 85 insertions(+), 0 deletions(-)
>
> Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com>
>

Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
if this series looks good now.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-22 12:34   ` Raghavendra K T
@ 2012-07-22 12:43     ` Avi Kivity
  2012-07-23  7:35       ` Christian Borntraeger
  2012-07-22 17:58     ` Rik van Riel
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-07-22 12:43 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Marcelo Tosatti, Rik van Riel, Christian Borntraeger,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
	Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/22/2012 03:34 PM, Raghavendra K T wrote:
> 
> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
> if this series looks good now.
> 

It looks fine to me.  Christian, is this okay for s390?

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-22 12:34   ` Raghavendra K T
  2012-07-22 12:43     ` Avi Kivity
@ 2012-07-22 17:58     ` Rik van Riel
  1 sibling, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2012-07-22 17:58 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Marcelo Tosatti, Avi Kivity, Christian Borntraeger,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Srikar, S390,
	Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86,
	Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/22/2012 08:34 AM, Raghavendra K T wrote:
> On 07/20/2012 11:06 PM, Marcelo Tosatti wrote:
>> On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:
>>>
>>> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
>>> random vcpu on pl-exit. We already have filtering while choosing
>>> the candidate to yield_to. This change adds more checks while choosing
>>> a candidate to yield_to.
>>>
>>> On a large vcpu guests, there is a high probability of
>>> yielding to the same vcpu who had recently done a pause-loop exit.
>>> Such a yield can lead to the vcpu spinning again.
>>>
>>> The patchset keeps track of the pause loop exit and gives chance to a
>>> vcpu which has:
>>>
>>> (a) Not done pause loop exit at all (probably he is preempted
>>> lock-holder)
>>>
>>> (b) vcpu skipped in last iteration because it did pause loop exit, and
>>> probably has become eligible now (next eligible lock holder)
>>>
>>> This concept also helps in cpu relax interception cases which use
>>> same handler.
>>>
>>> Changes since V4:
>>> - Naming Change (Avi):
>>> struct ple ==> struct spin_loop
>>> cpu_relax_intercepted ==> in_spin_loop
>>> vcpu_check_and_update_eligible ==> vcpu_eligible_for_directed_yield
>>> - mark vcpu in spinloop as not eligible to avoid influence of
>>> previous exit
>>>
>>> Changes since V3:
>>> - arch specific fix/changes (Christian)
>>>
>>> Changes since v2:
>>> - Move ple structure to common code (Avi)
>>> - rename pause_loop_exited to cpu_relax_intercepted (Avi)
>>> - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
>>> - Drop superfluous curly braces (Ingo)
>>>
>>> Changes since v1:
>>> - Add more documentation for structure and algorithm and Rename
>>> plo ==> ple (Rik).
>>> - change dy_eligible initial value to false. (otherwise very first
>>> directed
>>> yield will not be skipped. (Nikunj)
>>> - fixup signoff/from issue
>>>
>>> Future enhancements:
>>> (1) Currently we have a boolean to decide on eligibility of vcpu. It
>>> would be nice if I get feedback on guest (>32 vcpu) whether we can
>>> improve better with integer counter. (with counter = say f(log n )).
>>>
>>> (2) We have not considered system load during iteration of vcpu. With
>>> that information we can limit the scan and also decide whether
>>> schedule()
>>> is better. [ I am able to use #kicked vcpus to decide on this But may
>>> be there are better ideas like information from global loadavg.]
>>>
>>> (3) We can exploit this further with PV patches since it also knows
>>> about
>>> next eligible lock-holder.
>>>
>>> Summary: There is a very good improvement for kvm based guest on PLE
>>> machine.
>>> The V5 has huge improvement for kbench.
>>>
>>> +-----------+-----------+-----------+------------+-----------+
>>> base_rik stdev patched stdev %improve
>>> +-----------+-----------+-----------+------------+-----------+
>>> kernbench (time in sec lesser is better)
>>> +-----------+-----------+-----------+------------+-----------+
>>> 1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
>>> 2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> +-----------+-----------+-----------+------------+-----------+
>>> ebizzy (records/sec more is better)
>>> +-----------+-----------+-----------+------------+-----------+
>>> 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
>>> 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> Note: The patches are tested on x86.
>>>
>>> Links
>>> V4: https://lkml.org/lkml/2012/7/16/80
>>> V3: https://lkml.org/lkml/2012/7/12/437
>>> V2: https://lkml.org/lkml/2012/7/10/392
>>> V1: https://lkml.org/lkml/2012/7/9/32
>>>
>>> Raghavendra K T (3):
>>> config: Add config to support ple or cpu relax optimzation
>>> kvm : Note down when cpu relax intercepted or pause loop exited
>>> kvm : Choose a better candidate for directed yield
>>> ---
>>> arch/s390/kvm/Kconfig | 1 +
>>> arch/x86/kvm/Kconfig | 1 +
>>> include/linux/kvm_host.h | 39 +++++++++++++++++++++++++++++++++++++++
>>> virt/kvm/Kconfig | 3 +++
>>> virt/kvm/kvm_main.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>> 5 files changed, 85 insertions(+), 0 deletions(-)
>>
>> Reviewed-by: Marcelo Tosatti<mtosatti@redhat.com>
>>
>
> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
> if this series looks good now.

The series looks good to me.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-22 12:43     ` Avi Kivity
@ 2012-07-23  7:35       ` Christian Borntraeger
  0 siblings, 0 replies; 41+ messages in thread
From: Christian Borntraeger @ 2012-07-23  7:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Marcelo Tosatti, Rik van Riel, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Srikar, S390, Carsten Otte, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 22/07/12 14:43, Avi Kivity wrote:
> On 07/22/2012 03:34 PM, Raghavendra K T wrote:
>>
>> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
>> if this series looks good now.
>>
> 
> It looks fine to me.  Christian, is this okay for s390?
> 
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
  2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
                   ` (3 preceding siblings ...)
  2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
@ 2012-07-23 10:03 ` Avi Kivity
  2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
  4 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-07-23 10:03 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, S390, Carsten Otte, Christian Borntraeger,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/18/2012 04:37 PM, Raghavendra K T wrote:
> Currently Pause Loop Exit (PLE) handler is doing directed yield to a
> random vcpu on pl-exit. We already have filtering while choosing
> the candidate to yield_to. This change adds more checks while choosing
> a candidate to yield_to.
> 
> On a large vcpu guests, there is a high probability of
> yielding to the same vcpu who had recently done a pause-loop exit. 
> Such a yield can lead to the vcpu spinning again.
> 
> The patchset keeps track of the pause loop exit and gives chance to a
> vcpu which has:
> 
>  (a) Not done pause loop exit at all (probably he is preempted lock-holder)
> 
>  (b) vcpu skipped in last iteration because it did pause loop exit, and
>  probably has become eligible now (next eligible lock holder)
> 
> This concept also helps in cpu relax interception cases which use same handler.
> 

Thanks, applied to 'queue'.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-07-23 10:03 ` Avi Kivity
@ 2012-09-07 13:11   ` Andrew Theurer
  2012-09-07 18:06     ` Raghavendra K T
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-07 13:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	Srikar, KVM, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	Srivatsa Vaddagiri

I have noticed recently that PLE/yield_to() is still not that scalable
for really large guests, sometimes even with no CPU over-commit.  I have
a small change that make a very big difference.

First, let me explain what I saw:

Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
thread Westmere-EX system:  645 seconds!

Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
runqueue lock for yield_to()

So, I added some schedstats to yield_to(), one to count when we failed
this test in yield_to()

    if (task_running(p_rq, p) || p->state)

and one when we pass all the conditions and get to actually yield:

     yielded = curr->sched_class->yield_to_task(rq, p, preempt);


And during boot up of this guest, I saw:


failed yield_to() because task is running: 8368810426
successful yield_to(): 13077658
                      0.156022% of yield_to calls
                      1 out of 640 yield_to calls

Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
each one trying to get two locks.  This is happening on all [but one]
vcpus at around the same time.  Not going to work well.

So, since the check for a running task is nearly always true, I moved
that -before- the double runqueue lock, so 99.84% of the attempts do not
take the locks.  Now, I do not know is this [not getting the locks] is a
problem.  However, I'd rather have a little inaccurate test for a
running vcpu than burning 98% of CPU in host kernel.  With the change
the VM boot time went to:  100 seconds, an 85% reduction in time.

I also wanted to check to see this did not affect truly over-committed
situations, so I first started with smaller VMs at 2x cpu over-commit:

16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
           throughput +/- stddev
               -----     -----
ple off:        2281 +/- 7.32%  (really bad as expected)
ple on:        19796 +/- 1.36%
ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)

In this case the VMs are small enough, that we do not loop through
enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
for both default ple and with yield_to() fix.

So I went on to a bigger VM:

10 VMs, 16-way each, all running dbench (2x cpu over-commit)
           throughput +/- stddev
               -----     -----
ple on:         2552 +/- .70%
ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)

This is where we start seeing a major difference.  Without the fix, host
cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
guest went from 30 to 40%).  I believe this is on the right track to
reduce the spin lock contention, still get proper directed yield, and
therefore improve the guest CPU available and its performance.

However, we still have lock contention, and I think we can reduce it
even more.  We have eliminated some attempts at double runqueue lock
acquire because the check for the target vcpu is running is now before
the lock.  However, even if the target-to-yield-to vcpu [for the same
guest upon we PLE exited] is not running, the physical
processor/runqueue that target-to-yield-to vcpu is located on could be
running a different VM's vcpu -and- going through a directed yield,
therefore that run queue lock may already acquired.  We do not want to
just spin and wait, we want to move to the next candidate vcpu.  We need
a check to see if the smp processor/runqueue is already in a directed
yield.  Or, perhaps we just check if that cpu is not in guest mode, and
if so, we skip that yield attempt for that vcpu and move to the next
candidate vcpu.  So, my question is:  given a runqueue, what's the best
way to check if that corresponding phys cpu is not in guest mode?

Here's the changes so far (schedstat changes not included here):

signed-off-by:  Andrew Theurer <habanero@linux.vnet.ibm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..f8eff8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)
 
 again:
 	p_rq = task_rq(p);
+	if (task_running(p_rq, p) || p->state) {
+		goto out_no_unlock;
+	}
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
@@ -4856,8 +4859,6 @@ again:
 	if (curr->sched_class != p->sched_class)
 		goto out;
 
-	if (task_running(p_rq, p) || p->state)
-		goto out;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4879,6 +4880,7 @@ again:
 
 out:
 	double_rq_unlock(rq, p_rq);
+out_no_unlock:
 	local_irq_restore(flags);
 
 	if (yielded)

  





^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
@ 2012-09-07 18:06     ` Raghavendra K T
  2012-09-07 19:42       ` Andrew Theurer
  0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-07 18:06 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
	KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
	Peter Zijlstra

CCing PeterZ also.

On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> I have noticed recently that PLE/yield_to() is still not that scalable
> for really large guests, sometimes even with no CPU over-commit.  I have
> a small change that make a very big difference.
>
> First, let me explain what I saw:
>
> Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> thread Westmere-EX system:  645 seconds!
>
> Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> runqueue lock for yield_to()
>
> So, I added some schedstats to yield_to(), one to count when we failed
> this test in yield_to()
>
>      if (task_running(p_rq, p) || p->state)
>
> and one when we pass all the conditions and get to actually yield:
>
>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>
>
> And during boot up of this guest, I saw:
>
>
> failed yield_to() because task is running: 8368810426
> successful yield_to(): 13077658
>                        0.156022% of yield_to calls
>                        1 out of 640 yield_to calls
>
> Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
> each one trying to get two locks.  This is happening on all [but one]
> vcpus at around the same time.  Not going to work well.
>

True and interesting. I had once thought of reducing overall O(n^2)
iteration to O(n log(n)) iterations by reducing number of candidates
to search to O(log(n)) instead of current O(n). May be I have to get 
back to my experiment modes.

> So, since the check for a running task is nearly always true, I moved
> that -before- the double runqueue lock, so 99.84% of the attempts do not
> take the locks.  Now, I do not know is this [not getting the locks] is a
> problem.  However, I'd rather have a little inaccurate test for a
> running vcpu than burning 98% of CPU in host kernel.  With the change
> the VM boot time went to:  100 seconds, an 85% reduction in time.
>
> I also wanted to check to see this did not affect truly over-committed
> situations, so I first started with smaller VMs at 2x cpu over-commit:
>
> 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
>             throughput +/- stddev
>                 -----     -----
> ple off:        2281 +/- 7.32%  (really bad as expected)
> ple on:        19796 +/- 1.36%
> ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)
>
> In this case the VMs are small enough, that we do not loop through
> enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
> for both default ple and with yield_to() fix.
>
> So I went on to a bigger VM:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
>             throughput +/- stddev
>                 -----     -----
> ple on:         2552 +/- .70%
> ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)
>
> This is where we start seeing a major difference.  Without the fix, host
> cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
> guest went from 30 to 40%).  I believe this is on the right track to
> reduce the spin lock contention, still get proper directed yield, and
> therefore improve the guest CPU available and its performance.
>
> However, we still have lock contention, and I think we can reduce it
> even more.  We have eliminated some attempts at double runqueue lock
> acquire because the check for the target vcpu is running is now before
> the lock.  However, even if the target-to-yield-to vcpu [for the same
> guest upon we PLE exited] is not running, the physical
> processor/runqueue that target-to-yield-to vcpu is located on could be
> running a different VM's vcpu -and- going through a directed yield,
> therefore that run queue lock may already acquired.  We do not want to
> just spin and wait, we want to move to the next candidate vcpu.  We need
> a check to see if the smp processor/runqueue is already in a directed
> yield.  Or, perhaps we just check if that cpu is not in guest mode, and
> if so, we skip that yield attempt for that vcpu and move to the next
> candidate vcpu.  So, my question is:  given a runqueue, what's the best
> way to check if that corresponding phys cpu is not in guest mode?
>

We are indeed avoiding CPUS in guest mode when we check
task->flags & PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

> Here's the changes so far (schedstat changes not included here):
>
> signed-off-by:  Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..f8eff8c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
>   again:
>   	p_rq = task_rq(p);
> +	if (task_running(p_rq, p) || p->state) {
> +		goto out_no_unlock;
> +	}
>   	double_rq_lock(rq, p_rq);
>   	while (task_rq(p) != p_rq) {
>   		double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
>   	if (curr->sched_class != p->sched_class)
>   		goto out;
>
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
>
>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   	if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
>   out:
>   	double_rq_unlock(rq, p_rq);
> +out_no_unlock:
>   	local_irq_restore(flags);
>
>   	if (yielded)
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-07 18:06     ` Raghavendra K T
@ 2012-09-07 19:42       ` Andrew Theurer
  2012-09-08  8:43         ` Srikar Dronamraju
  2012-09-10 14:43         ` Raghavendra K T
  0 siblings, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-07 19:42 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
	KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
	Peter Zijlstra

On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:
> CCing PeterZ also.
> 
> On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> > I have noticed recently that PLE/yield_to() is still not that scalable
> > for really large guests, sometimes even with no CPU over-commit.  I have
> > a small change that make a very big difference.
> >
> > First, let me explain what I saw:
> >
> > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> > thread Westmere-EX system:  645 seconds!
> >
> > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> > runqueue lock for yield_to()
> >
> > So, I added some schedstats to yield_to(), one to count when we failed
> > this test in yield_to()
> >
> >      if (task_running(p_rq, p) || p->state)
> >
> > and one when we pass all the conditions and get to actually yield:
> >
> >       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >
> >
> > And during boot up of this guest, I saw:
> >
> >
> > failed yield_to() because task is running: 8368810426
> > successful yield_to(): 13077658
> >                        0.156022% of yield_to calls
> >                        1 out of 640 yield_to calls
> >
> > Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
> > each one trying to get two locks.  This is happening on all [but one]
> > vcpus at around the same time.  Not going to work well.
> >
> 
> True and interesting. I had once thought of reducing overall O(n^2)
> iteration to O(n log(n)) iterations by reducing number of candidates
> to search to O(log(n)) instead of current O(n). May be I have to get 
> back to my experiment modes.
> 
> > So, since the check for a running task is nearly always true, I moved
> > that -before- the double runqueue lock, so 99.84% of the attempts do not
> > take the locks.  Now, I do not know is this [not getting the locks] is a
> > problem.  However, I'd rather have a little inaccurate test for a
> > running vcpu than burning 98% of CPU in host kernel.  With the change
> > the VM boot time went to:  100 seconds, an 85% reduction in time.
> >
> > I also wanted to check to see this did not affect truly over-committed
> > situations, so I first started with smaller VMs at 2x cpu over-commit:
> >
> > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
> >             throughput +/- stddev
> >                 -----     -----
> > ple off:        2281 +/- 7.32%  (really bad as expected)
> > ple on:        19796 +/- 1.36%
> > ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)
> >
> > In this case the VMs are small enough, that we do not loop through
> > enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
> > for both default ple and with yield_to() fix.
> >
> > So I went on to a bigger VM:
> >
> > 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> >             throughput +/- stddev
> >                 -----     -----
> > ple on:         2552 +/- .70%
> > ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)
> >
> > This is where we start seeing a major difference.  Without the fix, host
> > cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
> > guest went from 30 to 40%).  I believe this is on the right track to
> > reduce the spin lock contention, still get proper directed yield, and
> > therefore improve the guest CPU available and its performance.
> >
> > However, we still have lock contention, and I think we can reduce it
> > even more.  We have eliminated some attempts at double runqueue lock
> > acquire because the check for the target vcpu is running is now before
> > the lock.  However, even if the target-to-yield-to vcpu [for the same
> > guest upon we PLE exited] is not running, the physical
> > processor/runqueue that target-to-yield-to vcpu is located on could be
> > running a different VM's vcpu -and- going through a directed yield,
> > therefore that run queue lock may already acquired.  We do not want to
> > just spin and wait, we want to move to the next candidate vcpu.  We need
> > a check to see if the smp processor/runqueue is already in a directed
> > yield.  Or, perhaps we just check if that cpu is not in guest mode, and
> > if so, we skip that yield attempt for that vcpu and move to the next
> > candidate vcpu.  So, my question is:  given a runqueue, what's the best
> > way to check if that corresponding phys cpu is not in guest mode?
> >
> 
> We are indeed avoiding CPUS in guest mode when we check
> task->flags & PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

My understanding is that it checks if the candidate vcpu task is in
guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
target to yield to if it is already in guest mode.  I am concerned about
a different vcpu, possibly from a different VM (let's call it g2vcpuN),
but it also located on the same runqueue as g1vcpuN -and- running.  That
vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
holding the rq lock.  Or it could be in guest mode.  If it is in guest
mode, then let's still target this rq, and try to yield to g1vcpuN.
However, if g2vcpuN is not in guest mode, then don't bother trying.
Patch include below.

Here's the new, v2 result with the previous two:

10 VMs, 16-way each, all running dbench (2x cpu over-commit)
            throughput +/- stddev
                 -----     -----
ple on:           2552 +/- .70%
ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
ple on: w/fixv2:  6115*           (139% improvement)

[*] I do not have stdev yet because all 10 runs are not complete

for v1 to v2, host CPU dropped from 60% to 50%.  Time in spin_lock() is
also dropping:

v1:
>     28.60%     264266  qemu-system-x86  [kernel.kallsyms]        [k] _raw_spin_lock
>      9.21%      85109  qemu-system-x86  [kernel.kallsyms]        [k] __schedule
>      4.98%      46035  qemu-system-x86  [kernel.kallsyms]        [k] _raw_spin_lock_irq
>      4.12%      38066  qemu-system-x86  [kernel.kallsyms]        [k] yield_to
>      4.01%      37114  qemu-system-x86  [kvm]                    [k] kvm_vcpu_on_spin
>      3.52%      32507  qemu-system-x86  [kernel.kallsyms]        [k] get_pid_task
>      3.37%      31141  qemu-system-x86  [kernel.kallsyms]        [k] native_load_tr_desc
>      3.26%      30121  qemu-system-x86  [kvm]                    [k] __vcpu_run
>      3.01%      27844  qemu-system-x86  [kvm_intel]              [k] vmx_vcpu_run
>      2.98%      27592  qemu-system-x86  [kernel.kallsyms]        [k] resched_task
>      2.72%      25108  qemu-system-x86  [kernel.kallsyms]        [k] __srcu_read_lock
>      2.25%      20758  qemu-system-x86  [kernel.kallsyms]        [k] native_write_msr_safe
>      2.01%      18522  qemu-system-x86  [kvm]                    [k] kvm_vcpu_yield_to
>      1.89%      17508  qemu-system-x86  [kvm]                    [k] vcpu_enter_guest

v2: 
>     10.12%      83285  qemu-system-x86  [kernel.kallsyms]        [k] __schedule
>      9.74%      80106  qemu-system-x86  [kernel.kallsyms]        [k] yield_to
>      7.38%      60686  qemu-system-x86  [kernel.kallsyms]        [k] get_pid_task
>      6.28%      51695  qemu-system-x86  [kernel.kallsyms]        [k] _raw_spin_lock
>      5.24%      43128  qemu-system-x86  [kvm]                    [k] kvm_vcpu_on_spin
>      5.04%      41517  qemu-system-x86  [kernel.kallsyms]        [k] __srcu_read_lock
>      4.11%      33823  qemu-system-x86  [kvm_intel]              [k] vmx_vcpu_run
>      3.62%      29827  qemu-system-x86  [kernel.kallsyms]        [k] native_load_tr_desc
>      3.57%      29413  qemu-system-x86  [kvm]                    [k] kvm_vcpu_yield_to
>      3.44%      28297  qemu-system-x86  [kvm]                    [k] __vcpu_run
>      2.93%      24116  qemu-system-x86  [kvm]                    [k] vcpu_enter_guest
>      2.60%      21379  qemu-system-x86  [kernel.kallsyms]        [k] resched_task

So this seems to be working.  However I wonder just how far we can take
this.  Ideally we need to be in <3-4% in host for PLE work, like I
observe for the 8-way VMs.  We are still way off.

-Andrew


signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..c767915 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)
 
 again:
 	p_rq = task_rq(p);
+	if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
PF_VCPU)) {
+		goto out_no_unlock;
+	}
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
@@ -4856,8 +4859,6 @@ again:
 	if (curr->sched_class != p->sched_class)
 		goto out;
 
-	if (task_running(p_rq, p) || p->state)
-		goto out;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4879,6 +4880,7 @@ again:
 
 out:
 	double_rq_unlock(rq, p_rq);
+out_no_unlock:
 	local_irq_restore(flags);
 
 	if (yielded)



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-07 19:42       ` Andrew Theurer
@ 2012-09-08  8:43         ` Srikar Dronamraju
  2012-09-10 13:16           ` Andrew Theurer
  2012-09-10 14:43         ` Raghavendra K T
  1 sibling, 1 reply; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-08  8:43 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	Srivatsa Vaddagiri, Peter Zijlstra

> 
> signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..c767915 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
> 
>  again:
>  	p_rq = task_rq(p);
> +	if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> PF_VCPU)) {
> +		goto out_no_unlock;
> +	}
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
>  	if (curr->sched_class != p->sched_class)
>  		goto out;
> 
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;

Is it possible that by this time the current thread takes double rq
lock, thread p could actually be running?  i.e is there merit to keep
this check around even with your similar check above?

> 
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
> 
>  out:
>  	double_rq_unlock(rq, p_rq);
> +out_no_unlock:
>  	local_irq_restore(flags);
> 
>  	if (yielded)
> 
> 

-- 
thanks and regards
Srikar


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-08  8:43         ` Srikar Dronamraju
@ 2012-09-10 13:16           ` Andrew Theurer
  2012-09-10 16:03             ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-10 13:16 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	Srivatsa Vaddagiri, Peter Zijlstra

On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote:
> > 
> > signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..c767915 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> > preempt)
> > 
> >  again:
> >  	p_rq = task_rq(p);
> > +	if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> > PF_VCPU)) {
> > +		goto out_no_unlock;
> > +	}
> >  	double_rq_lock(rq, p_rq);
> >  	while (task_rq(p) != p_rq) {
> >  		double_rq_unlock(rq, p_rq);
> > @@ -4856,8 +4859,6 @@ again:
> >  	if (curr->sched_class != p->sched_class)
> >  		goto out;
> > 
> > -	if (task_running(p_rq, p) || p->state)
> > -		goto out;
> 
> Is it possible that by this time the current thread takes double rq
> lock, thread p could actually be running?  i.e is there merit to keep
> this check around even with your similar check above?

I think that's a good idea.  I'll add that back in.
> 
> > 
> >  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >  	if (yielded) {
> > @@ -4879,6 +4880,7 @@ again:
> > 
> >  out:
> >  	double_rq_unlock(rq, p_rq);
> > +out_no_unlock:
> >  	local_irq_restore(flags);
> > 
> >  	if (yielded)
> > 
> > 
> 

-Andrew Theurer



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-07 19:42       ` Andrew Theurer
  2012-09-08  8:43         ` Srikar Dronamraju
@ 2012-09-10 14:43         ` Raghavendra K T
  1 sibling, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-10 14:43 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, Srikar,
	KVM, chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
	Peter Zijlstra

On 09/08/2012 01:12 AM, Andrew Theurer wrote:
> On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:
>> CCing PeterZ also.
>>
>> On 09/07/2012 06:41 PM, Andrew Theurer wrote:
>>> I have noticed recently that PLE/yield_to() is still not that scalable
>>> for really large guests, sometimes even with no CPU over-commit.  I have
>>> a small change that make a very big difference.
[...]
>> We are indeed avoiding CPUS in guest mode when we check
>> task->flags&  PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?
> My understanding is that it checks if the candidate vcpu task is in
> guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
> target to yield to if it is already in guest mode.  I am concerned about
> a different vcpu, possibly from a different VM (let's call it g2vcpuN),
> but it also located on the same runqueue as g1vcpuN -and- running.  That
> vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
> holding the rq lock.  Or it could be in guest mode.  If it is in guest
> mode, then let's still target this rq, and try to yield to g1vcpuN.
> However, if g2vcpuN is not in guest mode, then don't bother trying.

- If a non vcpu task was currently running, this change can ignore 
request to yield to a target vcpu. The target vcpu could be the most 
eligible vcpu causing other vcpus to do ple exits.
Is it possible to modify the check to deal with only vcpu tasks?

- Should we use p_rq->cfs_rq->skip instead to let us know that some 
yield was active at this time?

-

Cpu 1              cpu2                     cpu3
a1                  a2                        a3
b1                  b2                        b3
                     c2(yield target of a1)    c3(yield target of a2)

If vcpu a1 is doing directed yield to vcpu c2; current vcpu a2 on target 
cpu is also doing a directed yield(to some vcpu c3). Then this change 
will only allow vcpu a2 will do a schedule() to b2 (if a2 -> c3 yield is 
successful). Do we miss yielding to a vcpu c2?
a1 might not find a suitable vcpu to yield and might go back to 
spinning. Is my understanding correct?

> Patch include below.
>
> Here's the new, v2 result with the previous two:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
>              throughput +/- stddev
>                   -----     -----
> ple on:           2552 +/- .70%
> ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
> ple on: w/fixv2:  6115*           (139% improvement)
>

The numbers look great.

> [*] I do not have stdev yet because all 10 runs are not complete
>
> for v1 to v2, host CPU dropped from 60% to 50%.  Time in spin_lock() is
> also dropping:
>
[...]
>
> So this seems to be working.  However I wonder just how far we can take
> this.  Ideally we need to be in<3-4% in host for PLE work, like I
> observe for the 8-way VMs.  We are still way off.
>
> -Andrew
>
>
> signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..c767915 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
>   again:
>   	p_rq = task_rq(p);
> +	if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags&
> PF_VCPU)) {

While we are checking the flags of p_rq->curr task, the task p can 
migrate to some other runqueue. In this case will we miss yielding to 
the most eligible vcpu?

> +		goto out_no_unlock;
> +	}

Nit:
We dont need parenthesis above.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 13:16           ` Andrew Theurer
@ 2012-09-10 16:03             ` Peter Zijlstra
  2012-09-10 16:56               ` Srikar Dronamraju
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 16:03 UTC (permalink / raw)
  To: habanero
  Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > @@ -4856,8 +4859,6 @@ again:
> > >     if (curr->sched_class != p->sched_class)
> > >             goto out;
> > > 
> > > -   if (task_running(p_rq, p) || p->state)
> > > -           goto out;
> > 
> > Is it possible that by this time the current thread takes double rq
> > lock, thread p could actually be running?  i.e is there merit to keep
> > this check around even with your similar check above?
> 
> I think that's a good idea.  I'll add that back in. 

Right, it needs to still be there, the test before acquiring p_rq is an
optimistic test to avoid work, but you have to still test it once you
acquire p_rq since the rest of the code relies on this not being so.

How about something like this instead.. ?

---
 kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46a011..c9ecab2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4300,6 +4300,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	rq = this_rq();
 
 again:
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p))
+		goto out_irq;
+
 	p_rq = task_rq(p);
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
@@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 		goto again;
 	}
 
-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p))
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
 	if (yielded)


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 16:03             ` Peter Zijlstra
@ 2012-09-10 16:56               ` Srikar Dronamraju
  2012-09-10 17:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-10 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

* Peter Zijlstra <peterz@infradead.org> [2012-09-10 18:03:55]:

> On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > > @@ -4856,8 +4859,6 @@ again:
> > > >     if (curr->sched_class != p->sched_class)
> > > >             goto out;
> > > > 
> > > > -   if (task_running(p_rq, p) || p->state)
> > > > -           goto out;
> > > 
> > > Is it possible that by this time the current thread takes double rq
> > > lock, thread p could actually be running?  i.e is there merit to keep
> > > this check around even with your similar check above?
> > 
> > I think that's a good idea.  I'll add that back in. 
> 
> Right, it needs to still be there, the test before acquiring p_rq is an
> optimistic test to avoid work, but you have to still test it once you
> acquire p_rq since the rest of the code relies on this not being so.
> 
> How about something like this instead.. ?
> 
> ---
>  kernel/sched/core.c | 35 ++++++++++++++++++++++++++---------
>  1 file changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c46a011..c9ecab2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4300,6 +4300,23 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>  
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> +{
> +	if (!curr->sched_class->yield_to_task)
> +		return false;
> +
> +	if (curr->sched_class != p->sched_class)
> +		return false;


Peter, 

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.  Something akin
to 

	if (p_rq->cfs_rq->skip)
		return false;

So if somebody has already acquired a double run queue lock and almost
set the next buddy, we dont need to take run queue lock and also avoid
overwriting the already set skip buddy.

> +
> +	if (task_running(p_rq, p) || p->state)
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * yield_to - yield the current processor to another thread in
>   * your thread group, or accelerate that thread toward the
> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  	rq = this_rq();
>  
>  again:
> +	/* optimistic test to avoid taking locks */
> +	if (!__yield_to_candidate(curr, p))
> +		goto out_irq;
> +
>  	p_rq = task_rq(p);
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
> @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  		goto again;
>  	}
>  
> -	if (!curr->sched_class->yield_to_task)
> -		goto out;
> -
> -	if (curr->sched_class != p->sched_class)
> -		goto out;
> -
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +	/* validate state, holding p_rq ensures p's state cannot change */
> +	if (!__yield_to_candidate(curr, p))
> +		goto out_unlock;
>  
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  			resched_task(p_rq->curr);
>  	}
>  
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
>  
>  	if (yielded)
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 16:56               ` Srikar Dronamraju
@ 2012-09-10 17:12                 ` Peter Zijlstra
  2012-09-10 19:10                   ` Raghavendra K T
                                     ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 17:12 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > +{
> > +     if (!curr->sched_class->yield_to_task)
> > +             return false;
> > +
> > +     if (curr->sched_class != p->sched_class)
> > +             return false;
> 
> 
> Peter, 
> 
> Should we also add a check if the runq has a skip buddy (as pointed out
> by Raghu) and return if the skip buddy is already set. 

Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.

> > 
> > +     if (task_running(p_rq, p) || p->state)
> > +             return false;
> > +
> > +     return true;
> > +} 


> > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> bool preempt)
> >       rq = this_rq();
> >  
> >  again:
> > +     /* optimistic test to avoid taking locks */
> > +     if (!__yield_to_candidate(curr, p))
> > +             goto out_irq;
> > +

So add something like:

	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
	if (p_rq->cfs_rq->skip)
		goto out_irq;
> 
> 
> >       p_rq = task_rq(p);
> >       double_rq_lock(rq, p_rq);
> 
> 
But I do have a question on this optimization though,.. Why do we check
p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a ->skip,
there's nothing we can do about it, OTOH p_rq having a ->skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 17:12                 ` Peter Zijlstra
@ 2012-09-10 19:10                   ` Raghavendra K T
  2012-09-10 20:12                   ` Andrew Theurer
  2012-09-11  7:04                   ` Srikar Dronamraju
  2 siblings, 0 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-10 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, habanero, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On 09/10/2012 10:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
>>> +{
>>> +     if (!curr->sched_class->yield_to_task)
>>> +             return false;
>>> +
>>> +     if (curr->sched_class != p->sched_class)
>>> +             return false;
>>
>>
>> Peter,
>>
>> Should we also add a check if the runq has a skip buddy (as pointed out
>> by Raghu) and return if the skip buddy is already set.
>
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
>
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
>
>>>
>>> +     if (task_running(p_rq, p) || p->state)
>>> +             return false;
>>> +
>>> +     return true;
>>> +}
>
>
>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
>> bool preempt)
>>>        rq = this_rq();
>>>
>>>   again:
>>> +     /* optimistic test to avoid taking locks */
>>> +     if (!__yield_to_candidate(curr, p))
>>> +             goto out_irq;
>>> +
>
> So add something like:
>
> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> 	if (p_rq->cfs_rq->skip)
> 		goto out_irq;
>>
>>
>>>        p_rq = task_rq(p);
>>>        double_rq_lock(rq, p_rq);
>>
>>
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>
> That is, I'd like to see this thing explained a little better.
>
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?
>

Yes, That is the intention (mean checking  p_rq->cfs->skip). Though we
may be overdoing this check in the scenario when multiple vcpus of same
VM pinned to same CPU.

I am testing the above patch. Hope to be able to get back with the
results tomorrow.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 17:12                 ` Peter Zijlstra
  2012-09-10 19:10                   ` Raghavendra K T
@ 2012-09-10 20:12                   ` Andrew Theurer
  2012-09-10 20:19                     ` Peter Zijlstra
  2012-09-11  6:08                     ` Raghavendra K T
  2012-09-11  7:04                   ` Srikar Dronamraju
  2 siblings, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-10 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > +{
> > > +     if (!curr->sched_class->yield_to_task)
> > > +             return false;
> > > +
> > > +     if (curr->sched_class != p->sched_class)
> > > +             return false;
> > 
> > 
> > Peter, 
> > 
> > Should we also add a check if the runq has a skip buddy (as pointed out
> > by Raghu) and return if the skip buddy is already set. 
> 
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
> 
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
> 
> > > 
> > > +     if (task_running(p_rq, p) || p->state)
> > > +             return false;
> > > +
> > > +     return true;
> > > +} 
> 
> 
> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > >       rq = this_rq();
> > >  
> > >  again:
> > > +     /* optimistic test to avoid taking locks */
> > > +     if (!__yield_to_candidate(curr, p))
> > > +             goto out_irq;
> > > +
> 
> So add something like:
> 
> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> 	if (p_rq->cfs_rq->skip)
> 		goto out_irq;
> > 
> > 
> > >       p_rq = task_rq(p);
> > >       double_rq_lock(rq, p_rq);
> > 
> > 
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> 
> That is, I'd like to see this thing explained a little better.
> 
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?

Here's two new versions, both include a __yield_to_candidate(): "v3"
uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.

Results:
> ple on:           2552 +/- .70%
> ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
> ple on: w/fixv2:  6115           (139% improvement)
               v3:  5735           (124% improvement)
	       v4:  4524	   (  3% regression)

Both patches included below

-Andrew

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..0d98a67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_irq;
+
+	/*
+	 * if the target task is not running, then only yield if the
+	 * current task is in guest mode
+	 */
+	if (!(p_rq->curr->flags & PF_VCPU))
+		goto out_irq;
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
 		goto again;
 	}
 
-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4877,8 +4901,9 @@ again:
 		rq->skip_clock_update = 0;
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
 	if (yielded)



****************
v4:


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2bec2ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_irq;
+
+	/* if a yield is in progress, skip */
+	if (p_rq->cfs.skip)
+		goto out_irq;
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
 		goto again;
 	}
 
-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4877,8 +4898,9 @@ again:
 		rq->skip_clock_update = 0;
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
 	if (yielded)



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 20:12                   ` Andrew Theurer
@ 2012-09-10 20:19                     ` Peter Zijlstra
  2012-09-10 20:31                       ` Rik van Riel
  2012-09-11  6:08                     ` Raghavendra K T
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2012-09-10 20:19 UTC (permalink / raw)
  To: habanero
  Cc: Srikar Dronamraju, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
> +       /*
> +        * if the target task is not running, then only yield if the
> +        * current task is in guest mode
> +        */
> +       if (!(p_rq->curr->flags & PF_VCPU))
> +               goto out_irq; 

This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 20:19                     ` Peter Zijlstra
@ 2012-09-10 20:31                       ` Rik van Riel
  0 siblings, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2012-09-10 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: habanero, Srikar Dronamraju, Raghavendra K T, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On 09/10/2012 04:19 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
>> +       /*
>> +        * if the target task is not running, then only yield if the
>> +        * current task is in guest mode
>> +        */
>> +       if (!(p_rq->curr->flags & PF_VCPU))
>> +               goto out_irq;
>
> This would make yield_to() only ever work on KVM, not that I mind this
> too much, its a horrid thing and making it less useful for (ab)use is a
> good thing, still this probably wants mention somewhere :-)

Also, it would not preempt a non-kvm task, even if we need
to do that to boost a VCPU. I think the lines above should
be dropped.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 20:12                   ` Andrew Theurer
  2012-09-10 20:19                     ` Peter Zijlstra
@ 2012-09-11  6:08                     ` Raghavendra K T
  2012-09-11 12:48                       ` Andrew Theurer
  2012-09-11 18:27                       ` Andrew Theurer
  1 sibling, 2 replies; 41+ messages in thread
From: Raghavendra K T @ 2012-09-11  6:08 UTC (permalink / raw)
  To: habanero
  Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
>> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
>>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
>>>> +{
>>>> +     if (!curr->sched_class->yield_to_task)
>>>> +             return false;
>>>> +
>>>> +     if (curr->sched_class != p->sched_class)
>>>> +             return false;
>>>
>>>
>>> Peter,
>>>
>>> Should we also add a check if the runq has a skip buddy (as pointed out
>>> by Raghu) and return if the skip buddy is already set.
>>
>> Oh right, I missed that suggestion.. the performance improvement went
>> from 81% to 139% using this, right?
>>
>> It might make more sense to keep that separate, outside of this
>> function, since its not a strict prerequisite.
>>
>>>>
>>>> +     if (task_running(p_rq, p) || p->state)
>>>> +             return false;
>>>> +
>>>> +     return true;
>>>> +}
>>
>>
>>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
>>> bool preempt)
>>>>        rq = this_rq();
>>>>
>>>>   again:
>>>> +     /* optimistic test to avoid taking locks */
>>>> +     if (!__yield_to_candidate(curr, p))
>>>> +             goto out_irq;
>>>> +
>>
>> So add something like:
>>
>> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
>> 	if (p_rq->cfs_rq->skip)
>> 		goto out_irq;
>>>
>>>
>>>>        p_rq = task_rq(p);
>>>>        double_rq_lock(rq, p_rq);
>>>
>>>
>> But I do have a question on this optimization though,.. Why do we check
>> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
>>
>> That is, I'd like to see this thing explained a little better.
>>
>> Does it go something like: p_rq is the runqueue of the task we'd like to
>> yield to, rq is our own, they might be the same. If we have a ->skip,
>> there's nothing we can do about it, OTOH p_rq having a ->skip and
>> failing the yield_to() simply means us picking the next VCPU thread,
>> which might be running on an entirely different cpu (rq) and could
>> succeed?
>
> Here's two new versions, both include a __yield_to_candidate(): "v3"
> uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> skip check.  Raghu, I am not sure if this is exactly what you want
> implemented in v4.
>

Andrew, Yes that is what I had. I think there was a mis-understanding. 
My intention was to if there is a directed_yield happened in runqueue 
(say rqA), do not bother to directed yield to that. But unfortunately as 
PeterZ pointed that would have resulted in setting next buddy of a 
different run queue than rqA.
So we can drop this "skip" idea. Pondering more over what to do? can we 
use next buddy itself ... thinking..


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-10 17:12                 ` Peter Zijlstra
  2012-09-10 19:10                   ` Raghavendra K T
  2012-09-10 20:12                   ` Andrew Theurer
@ 2012-09-11  7:04                   ` Srikar Dronamraju
  2 siblings, 0 replies; 41+ messages in thread
From: Srikar Dronamraju @ 2012-09-11  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: habanero, Raghavendra K T, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > >       rq = this_rq();
> > >  
> > >  again:
> > > +     /* optimistic test to avoid taking locks */
> > > +     if (!__yield_to_candidate(curr, p))
> > > +             goto out_irq;
> > > +
> 
> So add something like:
> 
> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> 	if (p_rq->cfs_rq->skip)
> 		goto out_irq;
> > 
> > 
> > >       p_rq = task_rq(p);
> > >       double_rq_lock(rq, p_rq);
> > 
> > 
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> 
> That is, I'd like to see this thing explained a little better.
> 
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?
> 

Oh this made me look back at yield_to() again.  I had misread the
yield_to_task_fair() code. I had wrongly thought that both ->skip and
->next buddies for the p_rq would be set. But it looks like only ->next
for the p_rq is set and ->skip is set for rq.

This should also explains why Andrew saw a regression when checking for
->skip flag instead of PF_VCPU.

Can we check for p_rq->cfs.next and bail out if 


@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);

+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)

 again:
 	p_rq = task_rq(p);
+
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_irq;
+
+	/* if next buddy is set, assume yield is in progress */
+	if (p_rq->cfs.next)
+		goto out_irq;
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
 		goto again;
 	}

-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_unlock;

 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4877,8 +4898,9 @@ again:
 		rq->skip_clock_update = 0;
 	}

-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);

 	if (yielded)



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-11  6:08                     ` Raghavendra K T
@ 2012-09-11 12:48                       ` Andrew Theurer
  2012-09-11 18:27                       ` Andrew Theurer
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-11 12:48 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> >>>> +{
> >>>> +     if (!curr->sched_class->yield_to_task)
> >>>> +             return false;
> >>>> +
> >>>> +     if (curr->sched_class != p->sched_class)
> >>>> +             return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> +     if (task_running(p_rq, p) || p->state)
> >>>> +             return false;
> >>>> +
> >>>> +     return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>>        rq = this_rq();
> >>>>
> >>>>   again:
> >>>> +     /* optimistic test to avoid taking locks */
> >>>> +     if (!__yield_to_candidate(curr, p))
> >>>> +             goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >> 	if (p_rq->cfs_rq->skip)
> >> 		goto out_irq;
> >>>
> >>>
> >>>>        p_rq = task_rq(p);
> >>>>        double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check.  Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
> 
> Andrew, Yes that is what I had. I think there was a mis-understanding. 
> My intention was to if there is a directed_yield happened in runqueue 
> (say rqA), do not bother to directed yield to that. But unfortunately as 
> PeterZ pointed that would have resulted in setting next buddy of a 
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we 
> use next buddy itself ... thinking..

FYI, I regretfully forgot include your recent changes to
kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am
going to get some results for that before I experiment any more on
3.6-rc.

-Andrew



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-11  6:08                     ` Raghavendra K T
  2012-09-11 12:48                       ` Andrew Theurer
@ 2012-09-11 18:27                       ` Andrew Theurer
  2012-09-13 11:48                         ` Raghavendra K T
  2012-09-13 12:13                         ` Avi Kivity
  1 sibling, 2 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-11 18:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> >>>> +{
> >>>> +     if (!curr->sched_class->yield_to_task)
> >>>> +             return false;
> >>>> +
> >>>> +     if (curr->sched_class != p->sched_class)
> >>>> +             return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> +     if (task_running(p_rq, p) || p->state)
> >>>> +             return false;
> >>>> +
> >>>> +     return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>>        rq = this_rq();
> >>>>
> >>>>   again:
> >>>> +     /* optimistic test to avoid taking locks */
> >>>> +     if (!__yield_to_candidate(curr, p))
> >>>> +             goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >> 	if (p_rq->cfs_rq->skip)
> >> 		goto out_irq;
> >>>
> >>>
> >>>>        p_rq = task_rq(p);
> >>>>        double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check.  Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
> 
> Andrew, Yes that is what I had. I think there was a mis-understanding. 
> My intention was to if there is a directed_yield happened in runqueue 
> (say rqA), do not bother to directed yield to that. But unfortunately as 
> PeterZ pointed that would have resulted in setting next buddy of a 
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we 
> use next buddy itself ... thinking..

As I mentioned earlier today, I did not have your changes from kvm.git
tree when I tested my changes.  Here are your changes and my changes
compared:

			  throughput in MB/sec

kvm_vcpu_on_spin changes:  4636 +/- 15.74%
yield_to changes:	   4515 +/- 12.73%

I would be inclined to stick with your changes which are kept in kvm
code.  I did try both combined, and did not get good results:

both changes:		   4074 +/- 19.12%

So, having both is probably not a good idea.  However, I feel like
there's more work to be done.  With no over-commit (10 VMs), total
throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
overhead, but a reduction to ~4500 is still terrible.  By contrast,
8-way VMs with 2x over-commit have a total throughput roughly 10% less
than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
host).  We still have what appears to be scalability problems, but now
it's not so much in runqueue locks for yield_to(), but now
get_pid_task():

perf on host:

32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
 9.17%  91507 qemu-system-x86 [kvm]             [k] kvm_vcpu_on_spin
 7.74%  77257 qemu-system-x86 [kvm]             [k] kvm_vcpu_yield_to
 3.56%  35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
 3.00%  29951 qemu-system-x86 [kvm]             [k] __vcpu_run
 2.93%  29268 qemu-system-x86 [kvm_intel]       [k] vmx_vcpu_run
 2.88%  28783 qemu-system-x86 [kvm]             [k] vcpu_enter_guest
 2.59%  25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
 1.40%  13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
 1.28%  12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
 1.14%  11376 qemu-system-x86 [kvm_intel]       [k] vmcs_writel
 0.85%   8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
 0.53%   5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
 0.46%   4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc

get_pid_task() uses some rcu fucntions, wondering how scalable this
is....  I tend to think of rcu as -not- having issues like this... is
there a rcu stat/tracing tool which would help identify potential
problems?

-Andrew


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-11 18:27                       ` Andrew Theurer
@ 2012-09-13 11:48                         ` Raghavendra K T
  2012-09-13 21:30                           ` Andrew Theurer
  2012-09-13 12:13                         ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-13 11:48 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

* Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:

> On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > >>>> +{
> > >>>> +     if (!curr->sched_class->yield_to_task)
> > >>>> +             return false;
> > >>>> +
> > >>>> +     if (curr->sched_class != p->sched_class)
> > >>>> +             return false;
> > >>>
> > >>>
> > >>> Peter,
> > >>>
> > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > >>> by Raghu) and return if the skip buddy is already set.
> > >>
> > >> Oh right, I missed that suggestion.. the performance improvement went
> > >> from 81% to 139% using this, right?
> > >>
> > >> It might make more sense to keep that separate, outside of this
> > >> function, since its not a strict prerequisite.
> > >>
> > >>>>
> > >>>> +     if (task_running(p_rq, p) || p->state)
> > >>>> +             return false;
> > >>>> +
> > >>>> +     return true;
> > >>>> +}
> > >>
> > >>
> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > >>> bool preempt)
> > >>>>        rq = this_rq();
> > >>>>
> > >>>>   again:
> > >>>> +     /* optimistic test to avoid taking locks */
> > >>>> +     if (!__yield_to_candidate(curr, p))
> > >>>> +             goto out_irq;
> > >>>> +
> > >>
> > >> So add something like:
> > >>
> > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > >> 	if (p_rq->cfs_rq->skip)
> > >> 		goto out_irq;
> > >>>
> > >>>
> > >>>>        p_rq = task_rq(p);
> > >>>>        double_rq_lock(rq, p_rq);
> > >>>
> > >>>
> > >> But I do have a question on this optimization though,.. Why do we check
> > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > >>
> > >> That is, I'd like to see this thing explained a little better.
> > >>
> > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > >> failing the yield_to() simply means us picking the next VCPU thread,
> > >> which might be running on an entirely different cpu (rq) and could
> > >> succeed?
> > >
> > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > implemented in v4.
> > >
> > 
> > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > My intention was to if there is a directed_yield happened in runqueue 
> > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > PeterZ pointed that would have resulted in setting next buddy of a 
> > different run queue than rqA.
> > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > use next buddy itself ... thinking..
> 
> As I mentioned earlier today, I did not have your changes from kvm.git
> tree when I tested my changes.  Here are your changes and my changes
> compared:
> 
> 			  throughput in MB/sec
> 
> kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> yield_to changes:	   4515 +/- 12.73%
> 
> I would be inclined to stick with your changes which are kept in kvm
> code.  I did try both combined, and did not get good results:
> 
> both changes:		   4074 +/- 19.12%
> 
> So, having both is probably not a good idea.  However, I feel like
> there's more work to be done.  With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible.  By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host).  We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
>

Hi Andrew,
IMHO, reducing the double runqueue lock overhead is a good idea,
and may be  we see the benefits when we increase the overcommit further.

The explaination for not seeing good benefit on top of PLE handler
optimization patch is because we filter the yield_to candidates,
and hence resulting in less contention for double runqueue lock.
and extra code overhead during genuine yield_to might have resulted in
some degradation in the case you tested.

However, did you use cfs.next also?. I hope it helps, when we combine.

Here is the result that is showing positive benefit.
I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
  
+-----------+-----------+-----------+------------+-----------+
        kernbench time in sec, lower is better 
+-----------+-----------+-----------+------------+-----------+
       base      stddev     patched     stddev      %improve
+-----------+-----------+-----------+------------+-----------+
1x    44.3880     1.8699    40.8180     1.9173	   8.04271
2x    96.7580     4.2787    93.4188     3.5150	   3.45108
+-----------+-----------+-----------+------------+-----------+


+-----------+-----------+-----------+------------+-----------+
        ebizzy record/sec higher is better
+-----------+-----------+-----------+------------+-----------+
       base      stddev     patched     stddev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
+-----------+-----------+-----------+------------+-----------+


Below is the patch which combine suggestions of peterZ on your
original approach with cfs.next (already posted by Srikar in the other
thread)

----8<----
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..8551f57 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,24 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
+					 struct rq *p_rq)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_irq;
+
+	/* if next buddy is set, assume yield is in progress */
+	if (p_rq->cfs.next)
+		goto out_irq;
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
 		goto again;
 	}
 
-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4877,8 +4899,9 @@ again:
 		rq->skip_clock_update = 0;
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
 	if (yielded)


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-11 18:27                       ` Andrew Theurer
  2012-09-13 11:48                         ` Raghavendra K T
@ 2012-09-13 12:13                         ` Avi Kivity
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-13 12:13 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On 09/11/2012 09:27 PM, Andrew Theurer wrote:
> 
> So, having both is probably not a good idea.  However, I feel like
> there's more work to be done.  With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible.  By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host).  We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
> 
> perf on host:
> 
> 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
> 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
> 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
>  9.17%  91507 qemu-system-x86 [kvm]             [k] kvm_vcpu_on_spin
>  7.74%  77257 qemu-system-x86 [kvm]             [k] kvm_vcpu_yield_to
>  3.56%  35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
>  3.00%  29951 qemu-system-x86 [kvm]             [k] __vcpu_run
>  2.93%  29268 qemu-system-x86 [kvm_intel]       [k] vmx_vcpu_run
>  2.88%  28783 qemu-system-x86 [kvm]             [k] vcpu_enter_guest
>  2.59%  25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
>  1.40%  13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
>  1.28%  12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
>  1.14%  11376 qemu-system-x86 [kvm_intel]       [k] vmcs_writel
>  0.85%   8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
>  0.53%   5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
>  0.46%   4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
> 
> get_pid_task() uses some rcu fucntions, wondering how scalable this
> is....  I tend to think of rcu as -not- having issues like this... is
> there a rcu stat/tracing tool which would help identify potential
> problems?

It's not, it's the atomics + cache line bouncing.  We're basically
guaranteed to bounce here.

Here we're finally paying for the ioctl() based interface.  A syscall
based interface would have a 1:1 correspondence between vcpus and tasks,
so these games would be unnecessary.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-13 11:48                         ` Raghavendra K T
@ 2012-09-13 21:30                           ` Andrew Theurer
  2012-09-14 17:10                             ` Andrew Jones
                                               ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Andrew Theurer @ 2012-09-13 21:30 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Srikar Dronamraju, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, KVM, chegu vinod, LKML, X86,
	Gleb Natapov, Srivatsa Vaddagiri

On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> 
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > >>>> +{
> > > >>>> +     if (!curr->sched_class->yield_to_task)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     if (curr->sched_class != p->sched_class)
> > > >>>> +             return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> +     if (task_running(p_rq, p) || p->state)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>>        rq = this_rq();
> > > >>>>
> > > >>>>   again:
> > > >>>> +     /* optimistic test to avoid taking locks */
> > > >>>> +     if (!__yield_to_candidate(curr, p))
> > > >>>> +             goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > >> 	if (p_rq->cfs_rq->skip)
> > > >> 		goto out_irq;
> > > >>>
> > > >>>
> > > >>>>        p_rq = task_rq(p);
> > > >>>>        double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > > 
> > > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > > My intention was to if there is a directed_yield happened in runqueue 
> > > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > > PeterZ pointed that would have resulted in setting next buddy of a 
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > > use next buddy itself ... thinking..
> > 
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes.  Here are your changes and my changes
> > compared:
> > 
> > 			  throughput in MB/sec
> > 
> > kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> > yield_to changes:	   4515 +/- 12.73%
> > 
> > I would be inclined to stick with your changes which are kept in kvm
> > code.  I did try both combined, and did not get good results:
> > 
> > both changes:		   4074 +/- 19.12%
> > 
> > So, having both is probably not a good idea.  However, I feel like
> > there's more work to be done.  With no over-commit (10 VMs), total
> > throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> > overhead, but a reduction to ~4500 is still terrible.  By contrast,
> > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > host).  We still have what appears to be scalability problems, but now
> > it's not so much in runqueue locks for yield_to(), but now
> > get_pid_task():
> >
> 
> Hi Andrew,
> IMHO, reducing the double runqueue lock overhead is a good idea,
> and may be  we see the benefits when we increase the overcommit further.
> 
> The explaination for not seeing good benefit on top of PLE handler
> optimization patch is because we filter the yield_to candidates,
> and hence resulting in less contention for double runqueue lock.
> and extra code overhead during genuine yield_to might have resulted in
> some degradation in the case you tested.
> 
> However, did you use cfs.next also?. I hope it helps, when we combine.
> 
> Here is the result that is showing positive benefit.
> I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
>   
> +-----------+-----------+-----------+------------+-----------+
>         kernbench time in sec, lower is better 
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x    44.3880     1.8699    40.8180     1.9173	   8.04271
> 2x    96.7580     4.2787    93.4188     3.5150	   3.45108
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> +-----------+-----------+-----------+------------+-----------+
>         ebizzy record/sec higher is better
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
> 2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> Below is the patch which combine suggestions of peterZ on your
> original approach with cfs.next (already posted by Srikar in the other
> thread)

I did get a chance to run with the below patch and your changes in
kvm.git, but the results were not too different:

Dbench, 10 x 16-way VMs on 80-way host:

kvm_vcpu_on_spin changes:  4636 +/- 15.74%
yield_to changes:	   4515 +/- 12.73%
both changes from above:   4074 +/- 19.12%
...plus cfs.next check:    4418 +/- 16.97%

Still hovering around 4500 MB/sec

The concern I have is that even though we have gone through changes to
help reduce the candidate vcpus we yield to, we still have a very poor
idea of which vcpu really needs to run.  The result is high cpu usage in
the get_pid_task and still some contention in the double runqueue lock.
To make this scalable, we either need to significantly reduce the
occurrence of the lock-holder preemption, or do a much better job of
knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
which do not need to run).

On reducing the occurrence:  The worst case for lock-holder preemption
is having vcpus of same VM on the same runqueue.  This guarantees the
situation of 1 vcpu running while another [of the same VM] is not.  To
prove the point, I ran the same test, but with vcpus restricted to a
range of host cpus, such that any single VM's vcpus can never be on the
same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
vcpu-1's are on host cpus 5-9, and so on.  Here is the result:

kvm_cpu_spin, and all
yield_to changes, plus
restricted vcpu placement:  8823 +/- 3.20%   much, much better

On picking a better vcpu to yield to:  I really hesitate to rely on
paravirt hint [telling us which vcpu is holding a lock], but I am not
sure how else to reduce the candidate vcpus to yield to.  I suspect we
are yielding to way more vcpus than are prempted lock-holders, and that
IMO is just work accomplishing nothing.  Trying to think of way to
further reduce candidate vcpus....


-Andrew


> 
> ----8<----
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..8551f57 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4820,6 +4820,24 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
> 
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> +					 struct rq *p_rq)
> +{
> +	if (!curr->sched_class->yield_to_task)
> +		return false;
> +
> +	if (curr->sched_class != p->sched_class)
> +		return false;
> +
> +	if (task_running(p_rq, p) || p->state)
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * yield_to - yield the current processor to another thread in
>   * your thread group, or accelerate that thread toward the
> @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> 
>  again:
>  	p_rq = task_rq(p);
> +
> +	/* optimistic test to avoid taking locks */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_irq;
> +
> +	/* if next buddy is set, assume yield is in progress */
> +	if (p_rq->cfs.next)
> +		goto out_irq;
> +
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
>  		goto again;
>  	}
> 
> -	if (!curr->sched_class->yield_to_task)
> -		goto out;
> -
> -	if (curr->sched_class != p->sched_class)
> -		goto out;
> -
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +	/* validate state, holding p_rq ensures p's state cannot change */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_unlock;
> 
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4877,8 +4899,9 @@ again:
>  		rq->skip_clock_update = 0;
>  	}
> 
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
> 
>  	if (yielded)



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-13 21:30                           ` Andrew Theurer
@ 2012-09-14 17:10                             ` Andrew Jones
  2012-09-15 16:08                               ` Raghavendra K T
  2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
  2012-09-16  8:55                             ` Avi Kivity
  2 siblings, 1 reply; 41+ messages in thread
From: Andrew Jones @ 2012-09-14 17:10 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> > * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> > 
> > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > > >>>> +{
> > > > >>>> +     if (!curr->sched_class->yield_to_task)
> > > > >>>> +             return false;
> > > > >>>> +
> > > > >>>> +     if (curr->sched_class != p->sched_class)
> > > > >>>> +             return false;
> > > > >>>
> > > > >>>
> > > > >>> Peter,
> > > > >>>
> > > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > > >>> by Raghu) and return if the skip buddy is already set.
> > > > >>
> > > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > > >> from 81% to 139% using this, right?
> > > > >>
> > > > >> It might make more sense to keep that separate, outside of this
> > > > >> function, since its not a strict prerequisite.
> > > > >>
> > > > >>>>
> > > > >>>> +     if (task_running(p_rq, p) || p->state)
> > > > >>>> +             return false;
> > > > >>>> +
> > > > >>>> +     return true;
> > > > >>>> +}
> > > > >>
> > > > >>
> > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > > >>> bool preempt)
> > > > >>>>        rq = this_rq();
> > > > >>>>
> > > > >>>>   again:
> > > > >>>> +     /* optimistic test to avoid taking locks */
> > > > >>>> +     if (!__yield_to_candidate(curr, p))
> > > > >>>> +             goto out_irq;
> > > > >>>> +
> > > > >>
> > > > >> So add something like:
> > > > >>
> > > > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > > >> 	if (p_rq->cfs_rq->skip)
> > > > >> 		goto out_irq;
> > > > >>>
> > > > >>>
> > > > >>>>        p_rq = task_rq(p);
> > > > >>>>        double_rq_lock(rq, p_rq);
> > > > >>>
> > > > >>>
> > > > >> But I do have a question on this optimization though,.. Why do we check
> > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > > >>
> > > > >> That is, I'd like to see this thing explained a little better.
> > > > >>
> > > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > > >> which might be running on an entirely different cpu (rq) and could
> > > > >> succeed?
> > > > >
> > > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > > > implemented in v4.
> > > > >
> > > > 
> > > > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > > > My intention was to if there is a directed_yield happened in runqueue 
> > > > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > > > PeterZ pointed that would have resulted in setting next buddy of a 
> > > > different run queue than rqA.
> > > > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > > > use next buddy itself ... thinking..
> > > 
> > > As I mentioned earlier today, I did not have your changes from kvm.git
> > > tree when I tested my changes.  Here are your changes and my changes
> > > compared:
> > > 
> > > 			  throughput in MB/sec
> > > 
> > > kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> > > yield_to changes:	   4515 +/- 12.73%
> > > 
> > > I would be inclined to stick with your changes which are kept in kvm
> > > code.  I did try both combined, and did not get good results:
> > > 
> > > both changes:		   4074 +/- 19.12%
> > > 
> > > So, having both is probably not a good idea.  However, I feel like
> > > there's more work to be done.  With no over-commit (10 VMs), total
> > > throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> > > overhead, but a reduction to ~4500 is still terrible.  By contrast,
> > > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > > host).  We still have what appears to be scalability problems, but now
> > > it's not so much in runqueue locks for yield_to(), but now
> > > get_pid_task():
> > >
> > 
> > Hi Andrew,
> > IMHO, reducing the double runqueue lock overhead is a good idea,
> > and may be  we see the benefits when we increase the overcommit further.
> > 
> > The explaination for not seeing good benefit on top of PLE handler
> > optimization patch is because we filter the yield_to candidates,
> > and hence resulting in less contention for double runqueue lock.
> > and extra code overhead during genuine yield_to might have resulted in
> > some degradation in the case you tested.
> > 
> > However, did you use cfs.next also?. I hope it helps, when we combine.
> > 
> > Here is the result that is showing positive benefit.
> > I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
> >   
> > +-----------+-----------+-----------+------------+-----------+
> >         kernbench time in sec, lower is better 
> > +-----------+-----------+-----------+------------+-----------+
> >        base      stddev     patched     stddev      %improve
> > +-----------+-----------+-----------+------------+-----------+
> > 1x    44.3880     1.8699    40.8180     1.9173	   8.04271
> > 2x    96.7580     4.2787    93.4188     3.5150	   3.45108
> > +-----------+-----------+-----------+------------+-----------+
> > 
> > 
> > +-----------+-----------+-----------+------------+-----------+
> >         ebizzy record/sec higher is better
> > +-----------+-----------+-----------+------------+-----------+
> >        base      stddev     patched     stddev      %improve
> > +-----------+-----------+-----------+------------+-----------+
> > 1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
> > 2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
> > +-----------+-----------+-----------+------------+-----------+
> > 
> > 
> > Below is the patch which combine suggestions of peterZ on your
> > original approach with cfs.next (already posted by Srikar in the other
> > thread)
> 
> I did get a chance to run with the below patch and your changes in
> kvm.git, but the results were not too different:
> 
> Dbench, 10 x 16-way VMs on 80-way host:
> 
> kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> yield_to changes:	   4515 +/- 12.73%
> both changes from above:   4074 +/- 19.12%
> ...plus cfs.next check:    4418 +/- 16.97%
> 
> Still hovering around 4500 MB/sec
> 
> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run.  The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
> 
> On reducing the occurrence:  The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue.  This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not.  To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> 
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement:  8823 +/- 3.20%   much, much better
> 
> On picking a better vcpu to yield to:  I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to.  I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing.  Trying to think of way to
> further reduce candidate vcpus....
>

wrt to yielding to vcpus for the same cpu, I recently noticed that
there's a bug in yield_to_task_fair. yield_task_fair() calls
clear_buddies(), so if we're yielding to a task that has been running on
the same cpu that we're currently running on, and thus is also on the
current cfs runqueue, then our 'who to pick next' hint is getting cleared
right after we set it.

I had hoped that the patch below would show a general improvement in the
vpu overcommit performance, however the results were variable - no worse,
no better. Based on your results above showing good improvement from
interleaving vcpus across the cpus, then that means there was a decent
percent of these types of yields going on. So since the patch didn't
change much that indicates that the next hinting isn't generally taken
too seriously by the scheduler.  Anyway, the patch should correct the
code per its design, and testing shows that it didn't make anything worse,
so I'll post it soon. Also, in order to try and improve how far set-next
can jump ahead in the queue, I tested a kernel with group scheduling
compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
things). I did get slight improvement with that, but nothing to write home
to mom about.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c219bf8..7d8a21d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
 		return false;
 
+	/* We're yielding, so tell the scheduler we don't want to be picked */
+	yield_task_fair(rq);
+
 	/* Tell the scheduler that we'd really like pse to run next. */
 	set_next_buddy(se);
 
-	yield_task_fair(rq);
-
 	return true;
 }

> 
> > 
> > ----8<----
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..8551f57 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4820,6 +4820,24 @@ void __sched yield(void)
> >  }
> >  EXPORT_SYMBOL(yield);
> > 
> > +/*
> > + * Tests preconditions required for sched_class::yield_to().
> > + */
> > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> > +					 struct rq *p_rq)
> > +{
> > +	if (!curr->sched_class->yield_to_task)
> > +		return false;
> > +
> > +	if (curr->sched_class != p->sched_class)
> > +		return false;
> > +
> > +	if (task_running(p_rq, p) || p->state)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /**
> >   * yield_to - yield the current processor to another thread in
> >   * your thread group, or accelerate that thread toward the
> > @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> > 
> >  again:
> >  	p_rq = task_rq(p);
> > +
> > +	/* optimistic test to avoid taking locks */
> > +	if (!__yield_to_candidate(curr, p, p_rq))
> > +		goto out_irq;
> > +
> > +	/* if next buddy is set, assume yield is in progress */
> > +	if (p_rq->cfs.next)
> > +		goto out_irq;
> > +
> >  	double_rq_lock(rq, p_rq);
> >  	while (task_rq(p) != p_rq) {
> >  		double_rq_unlock(rq, p_rq);
> >  		goto again;
> >  	}
> > 
> > -	if (!curr->sched_class->yield_to_task)
> > -		goto out;
> > -
> > -	if (curr->sched_class != p->sched_class)
> > -		goto out;
> > -
> > -	if (task_running(p_rq, p) || p->state)
> > -		goto out;
> > +	/* validate state, holding p_rq ensures p's state cannot change */
> > +	if (!__yield_to_candidate(curr, p, p_rq))
> > +		goto out_unlock;
> > 
> >  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >  	if (yielded) {
> > @@ -4877,8 +4899,9 @@ again:
> >  		rq->skip_clock_update = 0;
> >  	}
> > 
> > -out:
> > +out_unlock:
> >  	double_rq_unlock(rq, p_rq);
> > +out_irq:
> >  	local_irq_restore(flags);
> > 
> >  	if (yielded)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-13 21:30                           ` Andrew Theurer
  2012-09-14 17:10                             ` Andrew Jones
@ 2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
  2012-09-17  8:02                               ` Andrew Jones
  2012-09-16  8:55                             ` Avi Kivity
  2 siblings, 1 reply; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-14 20:34 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run.  The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).

The patches that Raghavendra  has been posting do accomplish that.
>
> On reducing the occurrence:  The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue.  This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not.  To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement:  8823 +/- 3.20%   much, much better
>
> On picking a better vcpu to yield to:  I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to.  I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing.  Trying to think of way to
> further reduce candidate vcpus....

... the patches are posted -  you could try them out?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-14 17:10                             ` Andrew Jones
@ 2012-09-15 16:08                               ` Raghavendra K T
  2012-09-17 13:48                                 ` Andrew Jones
  0 siblings, 1 reply; 41+ messages in thread
From: Raghavendra K T @ 2012-09-15 16:08 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On 09/14/2012 10:40 PM, Andrew Jones wrote:
> On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
>> On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
>>> * Andrew Theurer<habanero@linux.vnet.ibm.com>  [2012-09-11 13:27:41]:
>>>
[...]
>>
>> On picking a better vcpu to yield to:  I really hesitate to rely on
>> paravirt hint [telling us which vcpu is holding a lock], but I am not
>> sure how else to reduce the candidate vcpus to yield to.  I suspect we
>> are yielding to way more vcpus than are prempted lock-holders, and that
>> IMO is just work accomplishing nothing.  Trying to think of way to
>> further reduce candidate vcpus....
>>
>
> wrt to yielding to vcpus for the same cpu, I recently noticed that
> there's a bug in yield_to_task_fair. yield_task_fair() calls
> clear_buddies(), so if we're yielding to a task that has been running on
> the same cpu that we're currently running on, and thus is also on the
> current cfs runqueue, then our 'who to pick next' hint is getting cleared
> right after we set it.
>
> I had hoped that the patch below would show a general improvement in the
> vpu overcommit performance, however the results were variable - no worse,
> no better. Based on your results above showing good improvement from
> interleaving vcpus across the cpus, then that means there was a decent
> percent of these types of yields going on. So since the patch didn't
> change much that indicates that the next hinting isn't generally taken
> too seriously by the scheduler.  Anyway, the patch should correct the
> code per its design, and testing shows that it didn't make anything worse,
> so I'll post it soon. Also, in order to try and improve how far set-next
> can jump ahead in the queue, I tested a kernel with group scheduling
> compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> things). I did get slight improvement with that, but nothing to write home
> to mom about.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c219bf8..7d8a21d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
>   	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
>   		return false;
>
> +	/* We're yielding, so tell the scheduler we don't want to be picked */
> +	yield_task_fair(rq);
> +
>   	/* Tell the scheduler that we'd really like pse to run next. */
>   	set_next_buddy(se);
>
> -	yield_task_fair(rq);
> -
>   	return true;
>   }
>

Hi Drew,  Agree with your fix and tested the patch too.. results are
pretty much same.  puzzled why so.

thinking ... may be we hit this when #vcpu (of a  VM) > #pcpu?
(pigeonhole principle ;)).


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-13 21:30                           ` Andrew Theurer
  2012-09-14 17:10                             ` Andrew Jones
  2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
@ 2012-09-16  8:55                             ` Avi Kivity
  2012-09-17  8:10                               ` Andrew Jones
  2012-09-18  3:03                               ` Andrew Theurer
  2 siblings, 2 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-16  8:55 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On 09/14/2012 12:30 AM, Andrew Theurer wrote:

> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run.  The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
> 
> On reducing the occurrence:  The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue.  This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not.  To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> 
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement:  8823 +/- 3.20%   much, much better
> 
> On picking a better vcpu to yield to:  I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to.  I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing.  Trying to think of way to
> further reduce candidate vcpus....

I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
That other vcpu gets work done (unless it is in pause loop itself) and
the yielding vcpu gets put to sleep for a while, so it doesn't spend
cycles spinning.  While we haven't fixed the problem at least the guest
is accomplishing work, and meanwhile the real lock holder may get
naturally scheduled and clear the lock.

The main problem with this theory is that the experiments don't seem to
bear it out.  So maybe one of the assumptions is wrong - the yielding
vcpu gets scheduled early.  That could be the case if the two vcpus are
on different runqueues - you could be changing the relative priority of
vcpus on the target runqueue, but still remain on top yourself.  Is this
possible with the current code?

Maybe we should prefer vcpus on the same runqueue as yield_to targets,
and only fall back to remote vcpus when we see it didn't help.

Let's examine a few cases:

1. spinner on cpu 0, lock holder on cpu 0

win!

2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0

Spinner gets put to sleep, random vcpus get to work, low lock contention
(no double_rq_lock), by the time spinner gets scheduled we might have won

3. spinner on cpu 0, another spinner on cpu 0

Worst case, we'll just spin some more.  Need to detect this case and
migrate something in.

4. spinner on cpu 0, alone

Similar


It seems we need to tie in to the load balancer.

Would changing the priority of the task while it is spinning help the
load balancer?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
@ 2012-09-17  8:02                               ` Andrew Jones
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17  8:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Avi Kivity, Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM,
	chegu vinod, LKML, X86, Gleb Natapov, Srivatsa Vaddagiri,
	rkrcmar

On Fri, Sep 14, 2012 at 04:34:24PM -0400, Konrad Rzeszutek Wilk wrote:
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> 
> The patches that Raghavendra  has been posting do accomplish that.
> >
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> >
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> >
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> ... the patches are posted -  you could try them out?

Radim and I have done some testing with the pvticketlock series. While we
saw a gain over PLE alone, it wasn't huge, and without PLE also enabled it
could hardly support 2.0x overcommit. spinlocks aren't the only place
where cpu_relax() is called within a relatively tight loop, so it's likely
that PLE yielding just generally helps by getting schedule() called more
frequently.

Drew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-16  8:55                             ` Avi Kivity
@ 2012-09-17  8:10                               ` Andrew Jones
  2012-09-18  3:03                               ` Andrew Theurer
  1 sibling, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17  8:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: habanero, Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On Sun, Sep 16, 2012 at 11:55:28AM +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.
> 
> The main problem with this theory is that the experiments don't seem to
> bear it out.  So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.

I thought about this a bit recently too, but didn't pursue it, because I
figured it would actually increase the get_pid_task and double_rq_lock
contention time if we have to hunt too long for a vcpu that matches a more
strict criteria. But, I guess if we can implement a special "reschedule"
to run on the current cpu which prioritizes runnable/non-running vcpus,
then it should be just as fast or faster for it to look through the
runqueue first, than it is to look through all the vcpus first.

Drew

> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.
> 
> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?
> 
> -- 
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-15 16:08                               ` Raghavendra K T
@ 2012-09-17 13:48                                 ` Andrew Jones
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Jones @ 2012-09-17 13:48 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Theurer, Peter Zijlstra, Srikar Dronamraju, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On Sat, Sep 15, 2012 at 09:38:54PM +0530, Raghavendra K T wrote:
> On 09/14/2012 10:40 PM, Andrew Jones wrote:
> >On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> >>On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> >>>* Andrew Theurer<habanero@linux.vnet.ibm.com>  [2012-09-11 13:27:41]:
> >>>
> [...]
> >>
> >>On picking a better vcpu to yield to:  I really hesitate to rely on
> >>paravirt hint [telling us which vcpu is holding a lock], but I am not
> >>sure how else to reduce the candidate vcpus to yield to.  I suspect we
> >>are yielding to way more vcpus than are prempted lock-holders, and that
> >>IMO is just work accomplishing nothing.  Trying to think of way to
> >>further reduce candidate vcpus....
> >>
> >
> >wrt to yielding to vcpus for the same cpu, I recently noticed that
> >there's a bug in yield_to_task_fair. yield_task_fair() calls
> >clear_buddies(), so if we're yielding to a task that has been running on
> >the same cpu that we're currently running on, and thus is also on the
> >current cfs runqueue, then our 'who to pick next' hint is getting cleared
> >right after we set it.
> >
> >I had hoped that the patch below would show a general improvement in the
> >vpu overcommit performance, however the results were variable - no worse,
> >no better. Based on your results above showing good improvement from
> >interleaving vcpus across the cpus, then that means there was a decent
> >percent of these types of yields going on. So since the patch didn't
> >change much that indicates that the next hinting isn't generally taken
> >too seriously by the scheduler.  Anyway, the patch should correct the
> >code per its design, and testing shows that it didn't make anything worse,
> >so I'll post it soon. Also, in order to try and improve how far set-next
> >can jump ahead in the queue, I tested a kernel with group scheduling
> >compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> >things). I did get slight improvement with that, but nothing to write home
> >to mom about.
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index c219bf8..7d8a21d 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
> >  	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
> >  		return false;
> >
> >+	/* We're yielding, so tell the scheduler we don't want to be picked */
> >+	yield_task_fair(rq);
> >+
> >  	/* Tell the scheduler that we'd really like pse to run next. */
> >  	set_next_buddy(se);
> >
> >-	yield_task_fair(rq);
> >-
> >  	return true;
> >  }
> >
> 
> Hi Drew,  Agree with your fix and tested the patch too.. results are
> pretty much same.  puzzled why so.

Looking at the code I see that the next hint might be used more frequently
if we bump up sysctl/kernel.sched_wakeup_granularity_ns. I also just found
out that some virt tuned profiles do that, so maybe I should try running
with one of those profiles.

> 
> thinking ... may be we hit this when #vcpu (of a  VM) > #pcpu?
> (pigeonhole principle ;)).

Not sure, but I haven't done any experiments where a single VM has >
#vcpus than the system as pcpus. For my vcpu overcommit I increase the
VM count, where each VM has #vcpus <= #pcpus.

Drew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-16  8:55                             ` Avi Kivity
  2012-09-17  8:10                               ` Andrew Jones
@ 2012-09-18  3:03                               ` Andrew Theurer
  2012-09-19 13:39                                 ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Theurer @ 2012-09-18  3:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.

OK, yes, if the other thread gets useful work done, then it is not
wasteful.  I was thinking of the worst case scenario, where any other
vcpu would likely spin as well, and the host side cpu-time for switching
vcpu threads was not all that productive.  Well, I suppose it does help
eliminate potential lock holding vcpus; it just seems to be not that
efficient or fast enough.

> The main problem with this theory is that the experiments don't seem to
> bear it out.

Granted, my test case is quite brutal.  It's nothing but over-committed
VMs which always have some spin lock activity.  However, we really
should try to fix the worst case scenario.

>   So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.

Well, we can certainly experiment and see what we get.

IMO, the key to getting this working really well on the large VMs is
finding the lock-holding cpu -quickly-.  What I think is happening is
that we go through a relatively long process to get to that one right
vcpu.  I guess I need to find a faster way to get there.

> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?

Not sure.

-Andrew







^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
  2012-09-18  3:03                               ` Andrew Theurer
@ 2012-09-19 13:39                                 ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-09-19 13:39 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Peter Zijlstra, Srikar Dronamraju,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, KVM, chegu vinod,
	LKML, X86, Gleb Natapov, Srivatsa Vaddagiri

On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>> 
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run.  The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> > 
>> > On reducing the occurrence:  The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue.  This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not.  To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>> > 
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
>> > 
>> > On picking a better vcpu to yield to:  I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing.  Trying to think of way to
>> > further reduce candidate vcpus....
>> 
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning.  While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
> 
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful.  I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive.  Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.

If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right.  Eventually
it's not-a-timeslice will expire and we go through this again.  If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus.  So in this worst case, we get quadratic behaviour.

One way out is to increase the not-a-timeslice.  Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?

That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.

> 
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
> 
> Granted, my test case is quite brutal.  It's nothing but over-committed
> VMs which always have some spin lock activity.  However, we really
> should try to fix the worst case scenario.

Yes.  And other guests may not scale as well as Linux, so they may show
this behaviour more often.

> 
>>   So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early.  That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself.  Is this
>> possible with the current code?
>> 
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>> 
>> Let's examine a few cases:
>> 
>> 1. spinner on cpu 0, lock holder on cpu 0
>> 
>> win!
>> 
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>> 
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>> 
>> 3. spinner on cpu 0, another spinner on cpu 0
>> 
>> Worst case, we'll just spin some more.  Need to detect this case and
>> migrate something in.
> 
> Well, we can certainly experiment and see what we get.
> 
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-.  What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu.  I guess I need to find a faster way to get there.

pvspinlocks will find the right one, every time.  Otherwise I see no way
to do this.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-09-19 13:40 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39   ` Raghavendra K T
2012-07-19  9:47     ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34   ` Raghavendra K T
2012-07-22 12:43     ` Avi Kivity
2012-07-23  7:35       ` Christian Borntraeger
2012-07-22 17:58     ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06     ` Raghavendra K T
2012-09-07 19:42       ` Andrew Theurer
2012-09-08  8:43         ` Srikar Dronamraju
2012-09-10 13:16           ` Andrew Theurer
2012-09-10 16:03             ` Peter Zijlstra
2012-09-10 16:56               ` Srikar Dronamraju
2012-09-10 17:12                 ` Peter Zijlstra
2012-09-10 19:10                   ` Raghavendra K T
2012-09-10 20:12                   ` Andrew Theurer
2012-09-10 20:19                     ` Peter Zijlstra
2012-09-10 20:31                       ` Rik van Riel
2012-09-11  6:08                     ` Raghavendra K T
2012-09-11 12:48                       ` Andrew Theurer
2012-09-11 18:27                       ` Andrew Theurer
2012-09-13 11:48                         ` Raghavendra K T
2012-09-13 21:30                           ` Andrew Theurer
2012-09-14 17:10                             ` Andrew Jones
2012-09-15 16:08                               ` Raghavendra K T
2012-09-17 13:48                                 ` Andrew Jones
2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
2012-09-17  8:02                               ` Andrew Jones
2012-09-16  8:55                             ` Avi Kivity
2012-09-17  8:10                               ` Andrew Jones
2012-09-18  3:03                               ` Andrew Theurer
2012-09-19 13:39                                 ` Avi Kivity
2012-09-13 12:13                         ` Avi Kivity
2012-09-11  7:04                   ` Srikar Dronamraju
2012-09-10 14:43         ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).