All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
@ 2012-07-09  6:20 Raghavendra K T
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-09  6:20 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel


Currently Pause Looop Exit (PLE) handler is doing directed yield to a
random VCPU on PL exit. Though we already have filtering while choosing
the candidate to yield_to, we can do better.

Problem is, for large vcpu guests, we have more probability of yielding
to a bad vcpu. We are not able to prevent directed yield to same guy who
has done PL exit recently, who perhaps spins again and wastes CPU.

Fix that by keeping track of who has done PL exit. So The Algorithm in series
give chance to a VCPU which has:

 (a) Not done PLE exit at all (probably he is preempted lock-holder)

 (b) VCPU skipped in last iteration because it did PL exit, and probably
 has become eligible now (next eligible lock holder)

Future enhancemnets:
  (1) Currently we have a boolean to decide on eligibility of vcpu. It
    would be nice if I get feedback on guest (>32 vcpu) whether we can
    improve better with integer counter. (with counter = say f(log n )).
  
  (2) We have not considered system load during iteration of vcpu. With
   that information we can limit the scan and also decide whether schedule()
   is better. [ I am able to use #kicked vcpus to decide on this But may
   be there are better ideas like information from global loadavg.]

  (3) We can exploit this further with PV patches since it also knows about
   next eligible lock-holder.

Summary: There is a huge improvement for moderate / no overcommit scenario
 for kvm based guest on PLE machine (which is difficult ;) ).

Result:
Base : kernel 3.5.0-rc5 with Rik's Ple handler fix

Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
          32 core machine

Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
  with test kernels 

Guest: fedora 16 with 32 vcpus 8GB memory. 

Benchmarks:
1) kernbench: kernbench-0.5 (kernbench -f -H -M -o 2*vcpu)
Very first run in kernbench is omitted.

2) sysbench: 0.4.12
sysbench --test=oltp --db-driver=pgsql prepare
sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp --oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
Note that driver for this pgsql.

3) ebizzy: release 0.3
cmd: ebizzy -S 120 

              1) kernbench (time in sec lesser is better)
+-----------+-----------+-----------+------------+-----------+
   base_rik    stdev       patched      stdev       %improve
+-----------+-----------+-----------+------------+-----------+
1x  49.2300     1.0171	    38.3792     1.3659	   28.27261%
2x  91.9358     1.7768	    85.8842     1.6654      7.04623%
+-----------+-----------+-----------+------------+-----------+

              2) sysbench (time in sec lesser is better)
+-----------+-----------+-----------+------------+-----------+
   base_rik    stdev       patched      stdev       %improve
+-----------+-----------+-----------+------------+-----------+
1x  12.1623     0.0942	    12.1674     0.3126	  -0.04192%
2x  14.3069     0.8520	    14.1879     0.6811	   0.83874%
+-----------+-----------+-----------+------------+-----------+

Note that 1x scenario differs in only third decimal place and
degradation/improvemnet for sysbench will not be seen even with
higher confidence interval.


              3) ebizzy (records/sec more is better)
+-----------+-----------+-----------+------------+-----------+
   base_rik    stdev       patched      stdev       %improve
+-----------+-----------+-----------+------------+-----------+
1x  1129.2500  28.6793    2316.6250    53.0066     105.14722%
2x  1892.3750  75.1112    2386.5000   168.8033      26.11137%
+-----------+-----------+-----------+------------+-----------+

kernbench 1x: 4 fast runs = 12 runs avg
kernbench 2x: 4 fast runs = 12 runs avg

sysbench 1x: 8runs avg
sysbench 2x: 8runs avg

ebizzy 1x: 8runs avg
ebizzy 2x: 8runs avg

Thanks Vatsa and Srikar for brainstorming discussions regarding
optimizations.

 Raghavendra K T (2):
   kvm vcpu: Note down pause loop exit
   kvm PLE handler: Choose better candidate for directed yield

 arch/s390/include/asm/kvm_host.h |    5 +++++
 arch/x86/include/asm/kvm_host.h  |    9 ++++++++-
 arch/x86/kvm/svm.c               |    1 +
 arch/x86/kvm/vmx.c               |    1 +
 arch/x86/kvm/x86.c               |   18 +++++++++++++++++-
 virt/kvm/kvm_main.c              |    3 +++
 6 files changed, 35 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
@ 2012-07-09  6:20 ` Raghavendra K T
  2012-07-09  6:33     ` Raghavendra K T
                     ` (2 more replies)
  2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-09  6:20 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel
  Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Noting pause loop exited vcpu helps in filtering right candidate to yield.
Yielding to same vcpu may result in more wastage of cpu.

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    7 +++++++
 arch/x86/kvm/svm.c              |    1 +
 arch/x86/kvm/vmx.c              |    1 +
 arch/x86/kvm/x86.c              |    4 +++-
 4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db7c1f2..857ca68 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -484,6 +484,13 @@ struct kvm_vcpu_arch {
 		u64 length;
 		u64 status;
 	} osvw;
+
+	/* Pause loop exit optimization */
+	struct {
+		bool pause_loop_exited;
+		bool dy_eligible;
+	} plo;
+
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f75af40..a492f5d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm)
 
 static int pause_interception(struct vcpu_svm *svm)
 {
+	svm->vcpu.arch.plo.pause_loop_exited = true;
 	kvm_vcpu_on_spin(&(svm->vcpu));
 	return 1;
 }
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..600fb3c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4945,6 +4945,7 @@ out:
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
+	vcpu->arch.plo.pause_loop_exited = true;
 	kvm_vcpu_on_spin(vcpu);
 
 	return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be6d549..07dbd14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	if (req_immediate_exit)
 		smp_send_reschedule(vcpu->cpu);
-
+	vcpu->arch.plo.pause_loop_exited = false;
 	kvm_guest_enter();
 
 	if (unlikely(vcpu->arch.switch_db_regs)) {
@@ -6168,6 +6168,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	BUG_ON(vcpu->kvm == NULL);
 	kvm = vcpu->kvm;
 
+	vcpu->arch.plo.pause_loop_exited = false;
+	vcpu->arch.plo.dy_eligible = true;
 	vcpu->arch.emulate_ctxt.ops = &emulate_ops;
 	if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_bsp(vcpu))
 		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield
  2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
@ 2012-07-09  6:20 ` Raghavendra K T
  2012-07-09 22:30   ` Rik van Riel
  2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-09  6:20 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Currently PLE handler can repeatedly do a directed yield to same vcpu
that has recently done PL exit. This can degrade the performance
Try to yield to most eligible guy instead, by alternate yielding.

Precisely, give chance to a VCPU which has:
 (a) Not done PLE exit at all (probably he is preempted lock-holder)
 (b) VCPU skipped in last iteration because it did PL exit, and probably
 has become eligible now (next eligible lock holder)

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |    5 +++++
 arch/x86/include/asm/kvm_host.h  |    2 +-
 arch/x86/kvm/x86.c               |   14 ++++++++++++++
 virt/kvm/kvm_main.c              |    3 +++
 4 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index dd17537..884f2c4 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -256,5 +256,10 @@ struct kvm_arch{
 	struct gmap *gmap;
 };
 
+static inline bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *v)
+{
+	return true;
+}
+
 extern int sie64a(struct kvm_s390_sie_block *, u64 *);
 #endif
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 857ca68..ce01db3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -962,7 +962,7 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 
 int kvm_is_in_guest(void);
-
+bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07dbd14..24ceae8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6623,6 +6623,20 @@ bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 			kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
+bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu)
+{
+	bool eligible;
+
+	eligible = !vcpu->arch.plo.pause_loop_exited ||
+			(vcpu->arch.plo.pause_loop_exited &&
+			 vcpu->arch.plo.dy_eligible);
+
+	if (vcpu->arch.plo.pause_loop_exited)
+		vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible;
+
+	return eligible;
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..519321a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1595,6 +1595,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
 				continue;
+			if (!kvm_arch_vcpu_check_and_update_eligible(vcpu)) {
+				continue;
+			}
 			if (kvm_vcpu_yield_to(vcpu)) {
 				kvm->last_boosted_vcpu = i;
 				yielded = 1;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
@ 2012-07-09  6:33     ` Raghavendra K T
  2012-07-09 22:39   ` Rik van Riel
  2012-07-11  8:53   ` Avi Kivity
  2 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-09  6:33 UTC (permalink / raw)
  Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 11:50 AM, Raghavendra K T wrote:
> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>
> Noting pause loop exited vcpu helps in filtering right candidate to yield.
> Yielding to same vcpu may result in more wastage of cpu.
>
> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
> ---

Oops. Sorry some how sign-off and from interchanged.. interchanged


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
@ 2012-07-09  6:33     ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-09  6:33 UTC (permalink / raw)
  Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 11:50 AM, Raghavendra K T wrote:
> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>
> Noting pause loop exited vcpu helps in filtering right candidate to yield.
> Yielding to same vcpu may result in more wastage of cpu.
>
> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
> ---

Oops. Sorry some how sign-off and from interchanged.. interchanged

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
  2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-09  7:55 ` Christian Borntraeger
  2012-07-10  8:27   ` Raghavendra K T
  2012-07-11  9:06   ` Avi Kivity
  2012-07-09 21:47   ` Andrew Theurer
  2012-07-09 22:28 ` Rik van Riel
  4 siblings, 2 replies; 52+ messages in thread
From: Christian Borntraeger @ 2012-07-09  7:55 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel

On 09/07/12 08:20, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.
> 
> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:


We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor).

Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
I have to double check with others, if these cases are critical, but for now, it seems 
that your dummy implementation  for s390 is just fine. After all it is a no-op until 
we implement something.

Thanks

Christian


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
@ 2012-07-09 21:47   ` Andrew Theurer
  2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Andrew Theurer @ 2012-07-09 21:47 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.

Hi, Raghu.

> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
> 
>  (a) Not done PLE exit at all (probably he is preempted lock-holder)
> 
>  (b) VCPU skipped in last iteration because it did PL exit, and probably
>  has become eligible now (next eligible lock holder)
> 
> Future enhancemnets:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
>     would be nice if I get feedback on guest (>32 vcpu) whether we can
>     improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>    that information we can limit the scan and also decide whether schedule()
>    is better. [ I am able to use #kicked vcpus to decide on this But may
>    be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>    next eligible lock-holder.
> 
> Summary: There is a huge improvement for moderate / no overcommit scenario
>  for kvm based guest on PLE machine (which is difficult ;) ).
> 
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
> 
> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
>           32 core machine

Is this with HT enabled, therefore 64 CPU threads?

> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>   with test kernels 
> 
> Guest: fedora 16 with 32 vcpus 8GB memory. 

Can you briefly explain the 1x and 2x configs?  This of course is highly
dependent whether or not HT is enabled...

FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system).  For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.

>From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches.  So, I am concerned that at least for this test,
reducing that number even more has diminishing returns.  I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):

> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] _raw_spin_lock                  
> 5.42%      24420         qemu-kvm  [kvm]                    [k] kvm_vcpu_yield_to               
> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] get_pid_task                    
> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to                        
> 2.74%      15652         qemu-kvm  [kvm]                    [k] kvm_apic_present                
> 1.70%       8657         qemu-kvm  [kvm]                    [k] kvm_vcpu_on_spin                
> 1.45%       7889         qemu-kvm  [kvm]                    [k] vcpu_enter_guest                
         
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted.  I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).

Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all.  I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.


-Andrew Theurer


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
@ 2012-07-09 21:47   ` Andrew Theurer
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Theurer @ 2012-07-09 21:47 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.

Hi, Raghu.

> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
> 
>  (a) Not done PLE exit at all (probably he is preempted lock-holder)
> 
>  (b) VCPU skipped in last iteration because it did PL exit, and probably
>  has become eligible now (next eligible lock holder)
> 
> Future enhancemnets:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
>     would be nice if I get feedback on guest (>32 vcpu) whether we can
>     improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>    that information we can limit the scan and also decide whether schedule()
>    is better. [ I am able to use #kicked vcpus to decide on this But may
>    be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>    next eligible lock-holder.
> 
> Summary: There is a huge improvement for moderate / no overcommit scenario
>  for kvm based guest on PLE machine (which is difficult ;) ).
> 
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
> 
> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
>           32 core machine

Is this with HT enabled, therefore 64 CPU threads?

> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>   with test kernels 
> 
> Guest: fedora 16 with 32 vcpus 8GB memory. 

Can you briefly explain the 1x and 2x configs?  This of course is highly
dependent whether or not HT is enabled...

FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system).  For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.

From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches.  So, I am concerned that at least for this test,
reducing that number even more has diminishing returns.  I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):

> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] _raw_spin_lock                  
> 5.42%      24420         qemu-kvm  [kvm]                    [k] kvm_vcpu_yield_to               
> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] get_pid_task                    
> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to                        
> 2.74%      15652         qemu-kvm  [kvm]                    [k] kvm_apic_present                
> 1.70%       8657         qemu-kvm  [kvm]                    [k] kvm_vcpu_on_spin                
> 1.45%       7889         qemu-kvm  [kvm]                    [k] vcpu_enter_guest                
         
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted.  I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).

Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all.  I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.


-Andrew Theurer

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
                   ` (3 preceding siblings ...)
  2012-07-09 21:47   ` Andrew Theurer
@ 2012-07-09 22:28 ` Rik van Riel
  4 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2012-07-09 22:28 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.
>
> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
>
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
>
>   (a) Not done PLE exit at all (probably he is preempted lock-holder)
>
>   (b) VCPU skipped in last iteration because it did PL exit, and probably
>   has become eligible now (next eligible lock holder)
>
> Future enhancemnets:

Your patch series looks good to me. Simple changes with a
significant result.

However, the simple heuristic could use some comments :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield
  2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
@ 2012-07-09 22:30   ` Rik van Riel
  2012-07-10 11:46     ` Raghavendra K T
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2012-07-09 22:30 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:

> +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu)
> +{
> +	bool eligible;
> +
> +	eligible = !vcpu->arch.plo.pause_loop_exited ||
> +			(vcpu->arch.plo.pause_loop_exited&&
> +			 vcpu->arch.plo.dy_eligible);
> +
> +	if (vcpu->arch.plo.pause_loop_exited)
> +		vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible;
> +
> +	return eligible;
> +}

This is a nice simple mechanism to skip CPUs that were
eligible last time and had pause loop exits recently.

However, it could stand some documentation.  Please
add a good comment explaining how and why the algorithm
works, when arch.plo.pause_loop_exited is cleared, etc...

It would be good to make this heuristic understandable
to people who look at the code for the first time.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
  2012-07-09  6:33     ` Raghavendra K T
@ 2012-07-09 22:39   ` Rik van Riel
  2012-07-10 11:22     ` Raghavendra K T
  2012-07-11  8:53   ` Avi Kivity
  2 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2012-07-09 22:39 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:

> @@ -484,6 +484,13 @@ struct kvm_vcpu_arch {
>   		u64 length;
>   		u64 status;
>   	} osvw;
> +
> +	/* Pause loop exit optimization */
> +	struct {
> +		bool pause_loop_exited;
> +		bool dy_eligible;
> +	} plo;

I know kvm_vcpu_arch is traditionally not a well documented
structure, but it would be really nice if each variable inside
this sub-structure could get some documentation.

Also, do we really want to introduce another acronym here?

Or would we be better off simply calling this struct .ple,
since that is a name people are already familiar with.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
@ 2012-07-10  8:27   ` Raghavendra K T
  2012-07-11  9:06   ` Avi Kivity
  1 sibling, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10  8:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 01:25 PM, Christian Borntraeger wrote:
> On 09/07/12 08:20, Raghavendra K T wrote:
>> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
>> random VCPU on PL exit. Though we already have filtering while choosing
>> the candidate to yield_to, we can do better.
>>
>> Problem is, for large vcpu guests, we have more probability of yielding
>> to a bad vcpu. We are not able to prevent directed yield to same guy who
>> has done PL exit recently, who perhaps spins again and wastes CPU.
>>
>> Fix that by keeping track of who has done PL exit. So The Algorithm in series
>> give chance to a VCPU which has:
>
>
> We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor).
>
> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
> I have to double check with others, if these cases are critical, but for now, it seems
> that your dummy implementation  for s390 is just fine. After all it is a no-op until
> we implement something.
>

Thanks for the review. Nice to know that, patch has potential to help 
s390 also.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09 21:47   ` Andrew Theurer
  (?)
@ 2012-07-10  9:26   ` Raghavendra K T
  -1 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10  9:26 UTC (permalink / raw)
  To: Andrew M. Theurer
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/10/2012 03:17 AM, Andrew Theurer wrote:
 > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
 >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
 >> random VCPU on PL exit. Though we already have filtering while choosing
 >> the candidate to yield_to, we can do better.
 >
 > Hi, Raghu.
Hi Andrew,
Thank you for your analysis and inputs

 >
 >> Problem is, for large vcpu guests, we have more probability of yielding
 >> to a bad vcpu. We are not able to prevent directed yield to same guy who
 >> has done PL exit recently, who perhaps spins again and wastes CPU.
 >>
 >> Fix that by keeping track of who has done PL exit. So The Algorithm 
in series
 >> give chance to a VCPU which has:
 >>
 >>   (a) Not done PLE exit at all (probably he is preempted lock-holder)
 >>
 >>   (b) VCPU skipped in last iteration because it did PL exit, and 
probably
 >>   has become eligible now (next eligible lock holder)
 >>
 >> Future enhancemnets:
 >>    (1) Currently we have a boolean to decide on eligibility of vcpu. It
 >>      would be nice if I get feedback on guest (>32 vcpu) whether we can
 >>      improve better with integer counter. (with counter = say f(log 
n )).
 >>
 >>    (2) We have not considered system load during iteration of vcpu. With
 >>     that information we can limit the scan and also decide whether 
schedule()
 >>     is better. [ I am able to use #kicked vcpus to decide on this 
But may
 >>     be there are better ideas like information from global loadavg.]
 >>
 >>    (3) We can exploit this further with PV patches since it also 
knows about
 >>     next eligible lock-holder.
 >>
 >> Summary: There is a huge improvement for moderate / no overcommit 
scenario
 >>   for kvm based guest on PLE machine (which is difficult ;) ).
 >>
 >> Result:
 >> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
 >>
 >> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
 >>            32 core machine
 >
 > Is this with HT enabled, therefore 64 CPU threads?

No. HT disabled with 32 online CPUs

 >
 >> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) 
(GCC)
 >>    with test kernels
 >>
 >> Guest: fedora 16 with 32 vcpus 8GB memory.
 >
 > Can you briefly explain the 1x and 2x configs?  This of course is highly
 > dependent whether or not HT is enabled...

1x config:  kernbench/ebizzy/sysbench running on 1 guest (32 vcpu)
  all the benchmarks have 2*#vcpu = 64 threads

2x config:  kernbench/ebizzy/sysbench running on 2 guests each with  32
vcpu)
  all the benchmarks have 2*#vcpu = 64 threads

 >
 > FWIW, I started testing what I would call "0.5x", where I have one 40
 > vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
 > enabled, no extra load on the system).  For ebizzy, the results are
 > quite erratic from run to run, so I am inclined to discard it as a

I will be posting full run detail (individual run) in reply to this
mail since it is big. I have posted stdev also with the result.. it has
not shown too much deviation.

 > workload, but maybe I should try "1x" and "2x" cpu over-commit as well.
 >
 >> From initial observations, at least for the ebizzy workload, the
 > percentage of exits that result in a yield_to() are very low, around 1%,
 > before these patches.

Hmm Ok..
IMO for a under-committed workload, probably low percentage of yield_to
was expected, but not sure whether 1% is too less though.
But importantly,  number of successful yield_to can never measure
benefit.

With this patch what I am trying to address is to ensure successful
yield_to result in benefit.

So, I am concerned that at least for this test,
 > reducing that number even more has diminishing returns.  I am however
 > still concerned about the scalability problem with yield_to(),

So did you mean you are expected to see more yield_to overheads with 
large guests?
As already mentioned in future enhancements, one thing I will be trying 
in future would be,

a. have counter instead of boolean for skipping yield_to
b. just scan probably f(log(n)) vcpu to yield and then schedule()/ 
return depending on system load.

so we will be reducing overall vcpu iteration in PLE handler from
O(n * n) to O(n log n)

which
 > shows like this for me (perf):
 >
 >> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] 
_raw_spin_lock
 >> 5.42%      24420         qemu-kvm  [kvm]                    [k] 
kvm_vcpu_yield_to
 >> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] 
get_pid_task
 >> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to
 >> 2.74%      15652         qemu-kvm  [kvm]                    [k] 
kvm_apic_present
 >> 1.70%       8657         qemu-kvm  [kvm]                    [k] 
kvm_vcpu_on_spin
 >> 1.45%       7889         qemu-kvm  [kvm]                    [k] 
vcpu_enter_guest
 >
 > For the cpu threads in the host that are actually active (in this case
 > 1/2 of them), ~50% of their time is in kernel and ~43% in guest.This
 > is for a no-IO workload, so that's just incredible to see so much cpu
 > wasted.  I feel that 2 important areas to tackle are a more scalable
 > yield_to() and reducing the number of pause exits itself (hopefully by
 > just tuning ple_window for the latter).

I think this is a concern and as you stated I agree that tuning
ple_window helps here.

 >
 > Honestly, I not confident addressing this problem will improve the
 > ebizzy score. That workload is so erratic for me, that I do not trust
 > the results at all.  I have however seen consistent improvements in
 > disabling PLE for a http guest workload and a very high IOPS guest
 > workload, both with much time spent in host in the double runqueue lock
 > for yield_to(), so that's why I still gravitate toward that issue.

The problem starts (in PLE disabled) when we have workload just > 1x.We 
start burning so much of cpu.

IIRC, in 2x overcommit, kernel compilation that takes 10hr on non-PLE,
used to take just 1hr after pv patches (and should be same with PLE enabled)

If we leave PLE disabled case, I do not expect any degradation even in 
0.5 x scenario, though you say results are erratic.

Could you please let me know, When PLE was enabled,
before and after the patch did you see any degradation for 0.5x?

 > -Andrew Theurer
 >
 >


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result
  2012-07-09 21:47   ` Andrew Theurer
  (?)
  (?)
@ 2012-07-10 10:07   ` Raghavendra K T
  -1 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10 10:07 UTC (permalink / raw)
  To: habanero
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Raghavendra

On 07/10/2012 03:17 AM, Andrew Theurer wrote:
> On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
>> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
>> random VCPU on PL exit. Though we already have filtering while choosing
>> the candidate to yield_to, we can do better.
>
[...]
> Honestly, I not confident addressing this problem will improve the
> ebizzy score. That workload is so erratic for me, that I do not trust
> the results at all.  I have however seen consistent improvements in
> disabling PLE for a http guest workload and a very high IOPS guest
> workload, both with much time spent in host in the double runqueue lock
> for yield_to(), so that's why I still gravitate toward that issue.
>
Deatiled result
Base + Rik patch

ebizzy
=========

overcommit 1 x
1160 records/s
real 60.00 s
user  6.28 s
sys  1078.69 s
1130 records/s
real 60.00 s
user  5.15 s
sys  1080.51 s
1073 records/s
real 60.00 s
user  5.02 s
sys  1030.21 s
1151 records/s
real 60.00 s
user  5.51 s
sys  1097.63 s
1145 records/s
real 60.00 s
user  5.21 s
sys  1093.56 s
1149 records/s
real 60.00 s
user  5.32 s
sys  1097.30 s
1111 records/s
real 60.00 s
user  5.16 s
sys  1061.77 s
1115 records/s
real 60.00 s
user  5.16 s
sys  1066.99 s

overcommit 2 x
1818 records/s
real 60.00 s
user 11.67 s
sys  843.84 s
1809 records/s
real 60.00 s
user 11.77 s
sys  845.68 s
1865 records/s
real 60.00 s
user 11.94 s
sys  866.69 s
1822 records/s
real 60.00 s
user 12.81 s
sys  843.05 s
1928 records/s
real 60.00 s
user 14.02 s
sys  887.86 s
1915 records/s
real 60.00 s
user 11.55 s
sys  888.68 s
1997 records/s
real 60.00 s
user 11.34 s
sys  923.54 s
1985 records/s
real 60.00 s
user 11.41 s
sys  923.44 s

kernbench
===============
overcommit 1 x
Elapsed Time 49.2367 (33.6921)
User Time 243.313 (343.965)
System Time 385.21 (125.151)
Percent CPU 1243.33 (79.5257)
Context Switches 58450.7 (31603.6)
Sleeps 73987 (41782.5)
--
Elapsed Time 47.8367 (37.2156)
User Time 244.79 (349.112)
System Time 338.553 (141.732)
Percent CPU 1181 (81.074)
Context Switches 56194.3 (36421.6)
Sleeps 74355.3 (40263.5)
--
Elapsed Time 49.6067 (34.7325)
User Time 250.117 (354.008)
System Time 341.277 (57.5594)
Percent CPU 1197 (46.3573)
Context Switches 55520.3 (27748.1)
Sleeps 72673 (38997.4)
--
Elapsed Time 50.24 (36.6571)
User Time 247.873 (352.427)
System Time 349.11 (79.4226)
Percent CPU 1193.67 (50.362)
Context Switches 55153.3 (27926.2)
Sleeps 73128 (39532.4)

overcommit 2 x
Elapsed Time 91.9233 (96.6304)
User Time 278.347 (371.217)
System Time 222.447 (181.378)
Percent CPU 521.667 (46.1988)
Context Switches 49597 (35766.4)
Sleeps 77939.7 (36840.1)
--
Elapsed Time 89.48 (92.7224)
User Time 275.223 (364.737)
System Time 202.473 (172.233)
Percent CPU 497.333 (53.0031)
Context Switches 44117 (30001)
Sleeps 77196 (35746.2)
--
Elapsed Time 93.6133 (95.7924)
User Time 294.767 (379.39)
System Time 235.487 (207.567)
Percent CPU 529.667 (58.2866)
Context Switches 50588 (36669.4)
Sleeps 79323.7 (38285.8)
--
Elapsed Time 92.7267 (100.928)
User Time 286.537 (384.253)
System Time 232.983 (192.233)
Percent CPU 552 (76.961)
Context Switches 51071 (35090)
Sleeps 79059 (36466.4)

sysbench
==============
overcommit 1 x
     total time:                          12.1229s
     total number of events:              100041
     total time taken by event execution: 772.8819
--
     total time:                          12.0775s
     total number of events:              100013
     total time taken by event execution: 769.5969
--
     total time:                          12.1671s
     total number of events:              100011
     total time taken by event execution: 775.5967
--
     total time:                          12.2695s
     total number of events:              100003
     total time taken by event execution: 782.3780
--
     total time:                          12.1526s
     total number of events:              100014
     total time taken by event execution: 773.9802
--
     total time:                          12.3350s
     total number of events:              100069
     total time taken by event execution: 786.2091
--
     total time:                          12.1019s
     total number of events:              100013
     total time taken by event execution: 771.5163
--
     total time:                          12.0716s
     total number of events:              100010
     total time taken by event execution: 769.8809

overcommit 2 x
     total time:                          13.6532s
     total number of events:              100011
     total time taken by event execution: 870.0869
--
     total time:                          15.8572s
     total number of events:              100010
     total time taken by event execution: 910.6689
--
     total time:                          13.6100s
     total number of events:              100008
     total time taken by event execution: 867.1782
--
     total time:                          15.4295s
     total number of events:              100008
     total time taken by event execution: 917.8441
--
     total time:                          13.8994s
     total number of events:              100004
     total time taken by event execution: 885.6729
--
     total time:                          14.2006s
     total number of events:              100005
     total time taken by event execution: 887.0262
--
     total time:                          13.8869s
     total number of events:              100011
     total time taken by event execution: 885.3583
--
     total time:                          13.9183s
     total number of events:              100007
     total time taken by event execution: 880.4344

With Rik + PLE handler optimization patch
  ===========================================
ebizzy
==========
overcommit 1 x
2249 records/s
real 60.00 s
user  9.87 s
sys  1529.54 s
2316 records/s
real 60.00 s
user 10.51 s
sys  1550.33 s
2353 records/s
real 60.00 s
user 10.82 s
sys  1565.10 s
2365 records/s
real 60.00 s
user 10.88 s
sys  1569.00 s
2282 records/s
real 60.00 s
user 10.77 s
sys  1540.03 s
2292 records/s
real 60.00 s
user 10.60 s
sys  1553.76 s
2272 records/s
real 60.00 s
user 10.44 s
sys  1510.90 s
2404 records/s
real 60.00 s
user 10.96 s
sys  1563.49 s

overcommit 2 x
2454 records/s
real 60.00 s
user 14.66 s
sys  880.17 s
2192 records/s
real 60.00 s
user 15.56 s
sys  881.12 s
2329 records/s
real 60.00 s
user 17.56 s
sys  933.03 s
2281 records/s
real 60.00 s
user 16.22 s
sys  925.34 s
2286 records/s
real 60.00 s
user 16.93 s
sys  902.04 s
2289 records/s
real 60.00 s
user 15.53 s
sys  909.78 s
2586 records/s
real 60.00 s
user 15.38 s
sys  857.22 s
2675 records/s
real 60.00 s
user 15.93 s
sys  842.40 s

kernbench
=============
overcommit 1 x
Elapsed Time 36.6633 (33.6422)
User Time 248.303 (359.64)
System Time 123.003 (67.1702)
Percent CPU 864 (242.52)
Context Switches 44936.3 (28799.8)
Sleeps 76076.7 (41142.1)
--
Elapsed Time 37.9167 (37.3285)
User Time 247.517 (358.659)
System Time 118.883 (86.7824)
Percent CPU 807.333 (245.133)
Context Switches 44219.3 (29480.9)
Sleeps 77137.3 (42685.4)
--
Elapsed Time 39.65 (39.0432)
User Time 248.07 (357.765)
System Time 100.76 (58.7603)
Percent CPU 748.333 (199.803)
Context Switches 42332.3 (27183.7)
Sleeps 75248.7 (41084.4)
--
Elapsed Time 39.2867 (39.8316)
User Time 245.903 (356.194)
System Time 101.783 (60.4971)
Percent CPU 762.667 (186.827)
Context Switches 42289.3 (24882.1)
Sleeps 74964.7 (38139.1)

overcommit 2 x
Elapsed Time 85.6567 (92.092)
User Time 274.607 (370.598)
System Time 172.12 (134.705)
Percent CPU 496.667 (34.2977)
Context Switches 45715.7 (29180.4)
Sleeps 76054 (34844.5)
--
Elapsed Time 86.8667 (92.72)
User Time 278.767 (365.877)
System Time 193.277 (142.811)
Percent CPU 538.667 (36.5558)
Context Switches 48035.3 (32107.3)
Sleeps 78004.7 (37835.6)
--
Elapsed Time 87.38 (91.6723)
User Time 269.133 (374.608)
System Time 165.283 (122.423)
Percent CPU 465.667 (119.068)
Context Switches 45107.3 (29571.6)
Sleeps 76942.7 (33102.4)
--
Elapsed Time 83.6333 (96.6314)
User Time 267.97 (374.691)
System Time 156.843 (123.183)
Percent CPU 503 (28.5832)
Context Switches 44406.7 (30002.8)
Sleeps 78975.7 (40787.4)

sysbench
=================
overcommit 1 x
     total time:                          11.7338s
     total number of events:              100021
     total time taken by event execution: 747.8628
--
     total time:                          11.9323s
     total number of events:              100006
     total time taken by event execution: 760.7567
--
     total time:                          12.0282s
     total number of events:              100068
     total time taken by event execution: 766.2259
--
     total time:                          12.0065s
     total number of events:              100010
     total time taken by event execution: 765.0691
--
     total time:                          12.2033s
     total number of events:              100016
     total time taken by event execution: 777.9971
--
     total time:                          12.2472s
     total number of events:              100041
     total time taken by event execution: 780.9914
--
     total time:                          12.4853s
     total number of events:              100015
     total time taken by event execution: 795.9082
--
     total time:                          12.7028s
     total number of events:              100015
     total time taken by event execution: 810.4563

overcommit 2 x
     total time:                          13.7335s
     total number of events:              100005
     total time taken by event execution: 872.0665
--
     total time:                          14.0005s
     total number of events:              100010
     total time taken by event execution: 892.4587
--
     total time:                          13.8066s
     total number of events:              100008
     total time taken by event execution: 880.2714
--
     total time:                          14.6350s
     total number of events:              100006
     total time taken by event execution: 875.3052
--
     total time:                          13.8536s
     total number of events:              100007
     total time taken by event execution: 877.8040
--
     total time:                          15.7213s
     total number of events:              100007
     total time taken by event execution: 896.5455
--
     total time:                          13.9135s
     total number of events:              100007
     total time taken by event execution: 882.0964
--
     total time:                          13.8390s
     total number of events:              100009
     total time taken by event execution: 881.8267


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-09 22:39   ` Rik van Riel
@ 2012-07-10 11:22     ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10 11:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/10/2012 04:09 AM, Rik van Riel wrote:
> On 07/09/2012 02:20 AM, Raghavendra K T wrote:
>
>> @@ -484,6 +484,13 @@ struct kvm_vcpu_arch {
>> u64 length;
>> u64 status;
>> } osvw;
>> +
>> + /* Pause loop exit optimization */
>> + struct {
>> + bool pause_loop_exited;
>> + bool dy_eligible;
>> + } plo;
>
> I know kvm_vcpu_arch is traditionally not a well documented
> structure, but it would be really nice if each variable inside
> this sub-structure could get some documentation.

Sure. Will document it.

>
> Also, do we really want to introduce another acronym here?
>
> Or would we be better off simply calling this struct .ple,
> since that is a name people are already familiar with.

Yes. it makes sense to have .ple.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield
  2012-07-09 22:30   ` Rik van Riel
@ 2012-07-10 11:46     ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10 11:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/10/2012 04:00 AM, Rik van Riel wrote:
> On 07/09/2012 02:20 AM, Raghavendra K T wrote:
>
>> +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu)
>> +{
>> + bool eligible;
>> +
>> + eligible = !vcpu->arch.plo.pause_loop_exited ||
>> + (vcpu->arch.plo.pause_loop_exited&&
>> + vcpu->arch.plo.dy_eligible);
>> +
>> + if (vcpu->arch.plo.pause_loop_exited)
>> + vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible;
>> +
>> + return eligible;
>> +}
>
> This is a nice simple mechanism to skip CPUs that were
> eligible last time and had pause loop exits recently.
>
> However, it could stand some documentation. Please
> add a good comment explaining how and why the algorithm
> works, when arch.plo.pause_loop_exited is cleared, etc...
>
> It would be good to make this heuristic understandable
> to people who look at the code for the first time.
>

Thanks for the review. will do more documentation.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09 21:47   ` Andrew Theurer
                     ` (2 preceding siblings ...)
  (?)
@ 2012-07-10 11:54   ` Raghavendra K T
  2012-07-10 13:27     ` Andrew Theurer
  -1 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-10 11:54 UTC (permalink / raw)
  To: habanero
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/10/2012 03:17 AM, Andrew Theurer wrote:
> On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
>> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
>> random VCPU on PL exit. Though we already have filtering while choosing
>> the candidate to yield_to, we can do better.
>
> Hi, Raghu.
>
[...]
>
> Can you briefly explain the 1x and 2x configs?  This of course is highly
> dependent whether or not HT is enabled...
>

Sorry if I had not made very clear in earlier threads. Have you applied
Rik's following patch for base. without this you could see some 
inconsistent results perhaps.

https://lkml.org/lkml/2012/6/19/401


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-10 11:54   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
@ 2012-07-10 13:27     ` Andrew Theurer
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Theurer @ 2012-07-10 13:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On Tue, 2012-07-10 at 17:24 +0530, Raghavendra K T wrote:
> On 07/10/2012 03:17 AM, Andrew Theurer wrote:
> > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> >> random VCPU on PL exit. Though we already have filtering while choosing
> >> the candidate to yield_to, we can do better.
> >
> > Hi, Raghu.
> >
> [...]
> >
> > Can you briefly explain the 1x and 2x configs?  This of course is highly
> > dependent whether or not HT is enabled...
> >
> 
> Sorry if I had not made very clear in earlier threads. Have you applied
> Rik's following patch for base. without this you could see some 
> inconsistent results perhaps.
> 
> https://lkml.org/lkml/2012/6/19/401

Yes, I do have that applied with your patch and in my baseline.

-Andrew Theurer



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
  2012-07-09  6:33     ` Raghavendra K T
  2012-07-09 22:39   ` Rik van Riel
@ 2012-07-11  8:53   ` Avi Kivity
  2012-07-11 10:52     ` Raghavendra K T
  2 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-11  8:53 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 09:20 AM, Raghavendra K T wrote:
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> Noting pause loop exited vcpu helps in filtering right candidate to yield.
> Yielding to same vcpu may result in more wastage of cpu.
> 
>  
>  struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f75af40..a492f5d 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm)
>  
>  static int pause_interception(struct vcpu_svm *svm)
>  {
> +	svm->vcpu.arch.plo.pause_loop_exited = true;
>  	kvm_vcpu_on_spin(&(svm->vcpu));
>  	return 1;
>  }
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 32eb588..600fb3c 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4945,6 +4945,7 @@ out:
>  static int handle_pause(struct kvm_vcpu *vcpu)
>  {
>  	skip_emulated_instruction(vcpu);
> +	vcpu->arch.plo.pause_loop_exited = true;
>  	kvm_vcpu_on_spin(vcpu);
>  

This code is duplicated.  Should we move it to kvm_vcpu_on_spin?

That means the .plo structure needs to be in common code, but that's not
too bad perhaps.

> index be6d549..07dbd14 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	if (req_immediate_exit)
>  		smp_send_reschedule(vcpu->cpu);
> -
> +	vcpu->arch.plo.pause_loop_exited = false;

This adds some tiny overhead to vcpu entry.  You could remove it by
using the vcpu->requests mechanism to clear the flag, since
vcpu->requests is already checked on every entry.

>  	kvm_guest_enter();
>  

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09 21:47   ` Andrew Theurer
                     ` (3 preceding siblings ...)
  (?)
@ 2012-07-11  9:00   ` Avi Kivity
  2012-07-11 13:59     ` Raghavendra K T
  -1 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-11  9:00 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/10/2012 12:47 AM, Andrew Theurer wrote:
>          
> For the cpu threads in the host that are actually active (in this case
> 1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
> is for a no-IO workload, so that's just incredible to see so much cpu
> wasted.  I feel that 2 important areas to tackle are a more scalable
> yield_to() and reducing the number of pause exits itself (hopefully by
> just tuning ple_window for the latter).

One thing we can do is autotune ple_window.  If a ple exit fails to wake
anybody (because all vcpus are either running, sleeping, or in ple
exits) then we deduce we are not overcommitted and we can increase the
ple window.  There's the question of how to decrease it again though.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
  2012-07-10  8:27   ` Raghavendra K T
@ 2012-07-11  9:06   ` Avi Kivity
  2012-07-11 10:17     ` Christian Borntraeger
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-11  9:06 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/09/2012 10:55 AM, Christian Borntraeger wrote:
> On 09/07/12 08:20, Raghavendra K T wrote:
>> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
>> random VCPU on PL exit. Though we already have filtering while choosing
>> the candidate to yield_to, we can do better.
>> 
>> Problem is, for large vcpu guests, we have more probability of yielding
>> to a bad vcpu. We are not able to prevent directed yield to same guy who
>> has done PL exit recently, who perhaps spins again and wastes CPU.
>> 
>> Fix that by keeping track of who has done PL exit. So The Algorithm in series
>> give chance to a VCPU which has:
> 
> 
> We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor).
> 
> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.

Perhaps x86 should copy this.

> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
> I have to double check with others, if these cases are critical, but for now, it seems 
> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
> we implement something.

Does the data structure make sense for you?  If so we can move it to
common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
have to pay anything.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11  9:06   ` Avi Kivity
@ 2012-07-11 10:17     ` Christian Borntraeger
  2012-07-11 11:04       ` Avi Kivity
  2012-07-11 11:51       ` Raghavendra K T
  0 siblings, 2 replies; 52+ messages in thread
From: Christian Borntraeger @ 2012-07-11 10:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt

On 11/07/12 11:06, Avi Kivity wrote:
[...]
>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
> 
> Perhaps x86 should copy this.

See arch/s390/lib/spinlock.c
The basic idea is using several heuristics:
- loop for a given amount of loops
- check if the lock holder is currently scheduled by the hypervisor
  (smp_vcpu_scheduled, which uses the sigp sense running instruction)
  Dont know if such thing is available for x86. It must be a lot cheaper
  than a guest exit to be useful
- if lock holder is not running and we looped for a while do a directed
  yield to that cpu.

> 
>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>> I have to double check with others, if these cases are critical, but for now, it seems 
>> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
>> we implement something.
> 
> Does the data structure make sense for you?  If so we can move it to
> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
> have to pay anything.

Ignoring the name, yes the data structure itself seems based on the algorithm
and not on arch specific things. That should work. If we move that to common 
code then s390 will use that scheme automatically for the cases were we call 
kvm_vcpu_on_spin(). All others archs as well.

So this would probably improve guests that uses cpu_relax, for example
stop_machine_run. I have no measurements, though.

Christian


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-11  8:53   ` Avi Kivity
@ 2012-07-11 10:52     ` Raghavendra K T
  2012-07-11 11:18       ` Avi Kivity
  2012-07-12 10:58       ` Nikunj A Dadhania
  0 siblings, 2 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 10:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 02:23 PM, Avi Kivity wrote:
> On 07/09/2012 09:20 AM, Raghavendra K T wrote:
>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>>
>> Noting pause loop exited vcpu helps in filtering right candidate to yield.
>> Yielding to same vcpu may result in more wastage of cpu.
>>
>>
>>   struct kvm_lpage_info {
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index f75af40..a492f5d 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm)
>>
>>   static int pause_interception(struct vcpu_svm *svm)
>>   {
>> +	svm->vcpu.arch.plo.pause_loop_exited = true;
>>   	kvm_vcpu_on_spin(&(svm->vcpu));
>>   	return 1;
>>   }
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 32eb588..600fb3c 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4945,6 +4945,7 @@ out:
>>   static int handle_pause(struct kvm_vcpu *vcpu)
>>   {
>>   	skip_emulated_instruction(vcpu);
>> +	vcpu->arch.plo.pause_loop_exited = true;
>>   	kvm_vcpu_on_spin(vcpu);
>>
>
> This code is duplicated.  Should we move it to kvm_vcpu_on_spin?
>
> That means the .plo structure needs to be in common code, but that's not
> too bad perhaps.
>

Since PLE is very much tied to x86, and proposed changes are very much 
specific to PLE handler, I thought it is better to make arch specific.

So do you think it is good to move inside vcpu_on_spin and make ple
structure belong to common code?

>> index be6d549..07dbd14 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>
>>   	if (req_immediate_exit)
>>   		smp_send_reschedule(vcpu->cpu);
>> -
>> +	vcpu->arch.plo.pause_loop_exited = false;
>
> This adds some tiny overhead to vcpu entry.  You could remove it by
> using the vcpu->requests mechanism to clear the flag, since
> vcpu->requests is already checked on every entry.

So IIUC,  let's have request bit for indicating PLE,

pause_interception() /handle_pause()
{
  make_request(PLE_REQUEST)
  vcpu_on_spin()

}

check_eligibility()
  {
  !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)  && 
dy_eligible())
.
.
}

vcpu_run()
{

check_request(PLE_REQUEST)
.
.
}

Is this is the expected flow you had in mind?

[ But my only concern was not resetting for cases where we do not do 
guest_enter(). will test how that goes].

>
>>   	kvm_guest_enter();
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 10:17     ` Christian Borntraeger
@ 2012-07-11 11:04       ` Avi Kivity
  2012-07-11 11:16         ` Alexander Graf
                           ` (3 more replies)
  2012-07-11 11:51       ` Raghavendra K T
  1 sibling, 4 replies; 52+ messages in thread
From: Avi Kivity @ 2012-07-11 11:04 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
> On 11/07/12 11:06, Avi Kivity wrote:
> [...]
>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>> 
>> Perhaps x86 should copy this.
> 
> See arch/s390/lib/spinlock.c
> The basic idea is using several heuristics:
> - loop for a given amount of loops
> - check if the lock holder is currently scheduled by the hypervisor
>   (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>   Dont know if such thing is available for x86. It must be a lot cheaper
>   than a guest exit to be useful

We could make it available via shared memory, updated using preempt
notifiers.  Of course piling on more pv makes this less attractive.

> - if lock holder is not running and we looped for a while do a directed
>   yield to that cpu.
> 
>> 
>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>> I have to double check with others, if these cases are critical, but for now, it seems 
>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
>>> we implement something.
>> 
>> Does the data structure make sense for you?  If so we can move it to
>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>> have to pay anything.
> 
> Ignoring the name,

What name would you suggest?

> yes the data structure itself seems based on the algorithm
> and not on arch specific things. That should work. If we move that to common 
> code then s390 will use that scheme automatically for the cases were we call 
> kvm_vcpu_on_spin(). All others archs as well.

ARM doesn't have an instruction for cpu_relax(), so it can't intercept
it.  Given ppc's dislike of overcommit, and the way it implements
cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
intercept those either, but I'm copying the ppc people in case I'm
wrong.  So it's s390 and x86.

> So this would probably improve guests that uses cpu_relax, for example
> stop_machine_run. I have no measurements, though.

smp_call_function() too (though that can be converted to directed yield
too).  It seems worthwhile.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:04       ` Avi Kivity
@ 2012-07-11 11:16         ` Alexander Graf
  2012-07-11 11:23           ` Avi Kivity
  2012-07-11 11:18         ` Christian Borntraeger
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Alexander Graf @ 2012-07-11 11:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt


On 11.07.2012, at 13:04, Avi Kivity wrote:

> On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
>> On 11/07/12 11:06, Avi Kivity wrote:
>> [...]
>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>> 
>>> Perhaps x86 should copy this.
>> 
>> See arch/s390/lib/spinlock.c
>> The basic idea is using several heuristics:
>> - loop for a given amount of loops
>> - check if the lock holder is currently scheduled by the hypervisor
>>  (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>  Dont know if such thing is available for x86. It must be a lot cheaper
>>  than a guest exit to be useful
> 
> We could make it available via shared memory, updated using preempt
> notifiers.  Of course piling on more pv makes this less attractive.
> 
>> - if lock holder is not running and we looped for a while do a directed
>>  yield to that cpu.
>> 
>>> 
>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>>> I have to double check with others, if these cases are critical, but for now, it seems 
>>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
>>>> we implement something.
>>> 
>>> Does the data structure make sense for you?  If so we can move it to
>>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>> have to pay anything.
>> 
>> Ignoring the name,
> 
> What name would you suggest?
> 
>> yes the data structure itself seems based on the algorithm
>> and not on arch specific things. That should work. If we move that to common 
>> code then s390 will use that scheme automatically for the cases were we call 
>> kvm_vcpu_on_spin(). All others archs as well.
> 
> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> it.  Given ppc's dislike of overcommit,

What dislike of overcommit?

> and the way it implements cpu_relax() by adjusting hw thread priority,

Yeah, I don't think we can intercept relaxing. It's basically a nop-like instruction that gives hardware hints on its current priorities.
That said, we can always add PV code.


Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-11 10:52     ` Raghavendra K T
@ 2012-07-11 11:18       ` Avi Kivity
  2012-07-11 11:56         ` Raghavendra K T
  2012-07-12 10:58       ` Nikunj A Dadhania
  1 sibling, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-11 11:18 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 01:52 PM, Raghavendra K T wrote:
> On 07/11/2012 02:23 PM, Avi Kivity wrote:
>> On 07/09/2012 09:20 AM, Raghavendra K T wrote:
>>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> Noting pause loop exited vcpu helps in filtering right candidate to
>>> yield.
>>> Yielding to same vcpu may result in more wastage of cpu.
>>>
>>>
>>>   struct kvm_lpage_info {
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index f75af40..a492f5d 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct
>>> vcpu_svm *svm)
>>>
>>>   static int pause_interception(struct vcpu_svm *svm)
>>>   {
>>> +    svm->vcpu.arch.plo.pause_loop_exited = true;
>>>       kvm_vcpu_on_spin(&(svm->vcpu));
>>>       return 1;
>>>   }
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 32eb588..600fb3c 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -4945,6 +4945,7 @@ out:
>>>   static int handle_pause(struct kvm_vcpu *vcpu)
>>>   {
>>>       skip_emulated_instruction(vcpu);
>>> +    vcpu->arch.plo.pause_loop_exited = true;
>>>       kvm_vcpu_on_spin(vcpu);
>>>
>>
>> This code is duplicated.  Should we move it to kvm_vcpu_on_spin?
>>
>> That means the .plo structure needs to be in common code, but that's not
>> too bad perhaps.
>>
> 
> Since PLE is very much tied to x86, and proposed changes are very much
> specific to PLE handler, I thought it is better to make arch specific.
> 
> So do you think it is good to move inside vcpu_on_spin and make ple
> structure belong to common code?

See the discussion with Christian.  PLE is tied to x86, but cpu_relax()
and facilities to trap it are not.

>>
>> This adds some tiny overhead to vcpu entry.  You could remove it by
>> using the vcpu->requests mechanism to clear the flag, since
>> vcpu->requests is already checked on every entry.
> 
> So IIUC,  let's have request bit for indicating PLE,
> 
> pause_interception() /handle_pause()
> {
>  make_request(PLE_REQUEST)
>  vcpu_on_spin()
> 
> }
> 
> check_eligibility()
>  {
>  !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)  &&
> dy_eligible())
> .
> .
> }
> 
> vcpu_run()
> {
> 
> check_request(PLE_REQUEST)
> .
> .
> }
> 
> Is this is the expected flow you had in mind?

Yes, something like that.

> 
> [ But my only concern was not resetting for cases where we do not do
> guest_enter(). will test how that goes].

Hm, suppose we're the next-in-line for a ticket lock and exit due to
PLE.  The lock holder completes and unlocks, which really assigns the
lock to us.  So now we are the lock owner, yet we are marked as don't
yield-to-us in the PLE code.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:04       ` Avi Kivity
  2012-07-11 11:16         ` Alexander Graf
@ 2012-07-11 11:18         ` Christian Borntraeger
  2012-07-11 11:39           ` Avi Kivity
  2012-07-12  2:17         ` Benjamin Herrenschmidt
  2012-07-12 10:38         ` Nikunj A Dadhania
  3 siblings, 1 reply; 52+ messages in thread
From: Christian Borntraeger @ 2012-07-11 11:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 11/07/12 13:04, Avi Kivity wrote:
> On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
>> On 11/07/12 11:06, Avi Kivity wrote:
>> [...]
>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>>
>>> Perhaps x86 should copy this.
>>
>> See arch/s390/lib/spinlock.c
>> The basic idea is using several heuristics:
>> - loop for a given amount of loops
>> - check if the lock holder is currently scheduled by the hypervisor
>>   (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>   Dont know if such thing is available for x86. It must be a lot cheaper
>>   than a guest exit to be useful
> 
> We could make it available via shared memory, updated using preempt
> notifiers.  Of course piling on more pv makes this less attractive.
> 
>> - if lock holder is not running and we looped for a while do a directed
>>   yield to that cpu.
>>
>>>
>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>>> I have to double check with others, if these cases are critical, but for now, it seems 
>>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
>>>> we implement something.
>>>
>>> Does the data structure make sense for you?  If so we can move it to
>>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>> have to pay anything.
>>
>> Ignoring the name,
> 
> What name would you suggest?

maybe vcpu_no_progress instead of pause_loop_exited

> 
>> yes the data structure itself seems based on the algorithm
>> and not on arch specific things. That should work. If we move that to common 
>> code then s390 will use that scheme automatically for the cases were we call 
>> kvm_vcpu_on_spin(). All others archs as well.
> 
> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> it.  Given ppc's dislike of overcommit, and the way it implements
> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
> intercept those either, but I'm copying the ppc people in case I'm
> wrong.  So it's s390 and x86.
> 
>> So this would probably improve guests that uses cpu_relax, for example
>> stop_machine_run. I have no measurements, though.
> 
> smp_call_function() too (though that can be converted to directed yield
> too).  It seems worthwhile.

Indeed. For those places where is is possible I would like to see an
architecture primitive for directed yield. That could be useful for 
other places as well (e.g. maybe lib/spinlock_debug.c, which has no 
yielding at all)

Christian




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:16         ` Alexander Graf
@ 2012-07-11 11:23           ` Avi Kivity
  2012-07-11 11:52             ` Alexander Graf
  2012-07-12  2:19             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 52+ messages in thread
From: Avi Kivity @ 2012-07-11 11:23 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt

On 07/11/2012 02:16 PM, Alexander Graf wrote:
>> 
>>> yes the data structure itself seems based on the algorithm
>>> and not on arch specific things. That should work. If we move that to common 
>>> code then s390 will use that scheme automatically for the cases were we call 
>>> kvm_vcpu_on_spin(). All others archs as well.
>> 
>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
>> it.  Given ppc's dislike of overcommit,
> 
> What dislike of overcommit?

I understood ppc virtualization is more of the partitioning sort.
Perhaps I misunderstood it.  But the reliance on device assignment, the
restrictions on scheduling, etc. all point to it.

> 
>> and the way it implements cpu_relax() by adjusting hw thread priority,
> 
> Yeah, I don't think we can intercept relaxing.

... and the lack of ability to intercept cpu_relax() ...

> It's basically a nop-like instruction that gives hardware hints on its current priorities.

That's what x88 PAUSE does.  But we can intercept it (and not just any
execution - we can restrict intercept to tight loops executed more than
a specific number of times).

> That said, we can always add PV code.

Sure, but that's defeated by advancements like self-tuning PLE exits.
It's hard to get this right.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:18         ` Christian Borntraeger
@ 2012-07-11 11:39           ` Avi Kivity
  2012-07-12  5:11             ` Raghavendra K T
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-11 11:39 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 07/11/2012 02:18 PM, Christian Borntraeger wrote:
> On 11/07/12 13:04, Avi Kivity wrote:
>> On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
>>> On 11/07/12 11:06, Avi Kivity wrote:
>>> [...]
>>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>>>
>>>> Perhaps x86 should copy this.
>>>
>>> See arch/s390/lib/spinlock.c
>>> The basic idea is using several heuristics:
>>> - loop for a given amount of loops
>>> - check if the lock holder is currently scheduled by the hypervisor
>>>   (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>>   Dont know if such thing is available for x86. It must be a lot cheaper
>>>   than a guest exit to be useful
>> 
>> We could make it available via shared memory, updated using preempt
>> notifiers.  Of course piling on more pv makes this less attractive.
>> 
>>> - if lock holder is not running and we looped for a while do a directed
>>>   yield to that cpu.
>>>
>>>>
>>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>>>> I have to double check with others, if these cases are critical, but for now, it seems 
>>>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until 
>>>>> we implement something.
>>>>
>>>> Does the data structure make sense for you?  If so we can move it to
>>>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>>> have to pay anything.
>>>
>>> Ignoring the name,
>> 
>> What name would you suggest?
> 
> maybe vcpu_no_progress instead of pause_loop_exited

Ah, I thouht you objected to the CONFIG var.  Maybe call it
cpu_relax_intercepted since that's the linuxy name for the instruction.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 10:17     ` Christian Borntraeger
  2012-07-11 11:04       ` Avi Kivity
@ 2012-07-11 11:51       ` Raghavendra K T
  2012-07-11 11:55         ` Christian Borntraeger
  2012-07-11 13:04         ` Raghavendra K T
  1 sibling, 2 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 11:51 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt

On 07/11/2012 03:47 PM, Christian Borntraeger wrote:
> On 11/07/12 11:06, Avi Kivity wrote:
> [...]
>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>
>> Perhaps x86 should copy this.
>
> See arch/s390/lib/spinlock.c
> The basic idea is using several heuristics:
> - loop for a given amount of loops
> - check if the lock holder is currently scheduled by the hypervisor
>    (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>    Dont know if such thing is available for x86. It must be a lot cheaper
>    than a guest exit to be useful

Unfortunately we do not have information on lock-holder.

> - if lock holder is not running and we looped for a while do a directed
>    yield to that cpu.
>
>>
>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>> I have to double check with others, if these cases are critical, but for now, it seems
>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until
>>> we implement something.
>>
>> Does the data structure make sense for you?  If so we can move it to
>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>> have to pay anything.
>
> Ignoring the name, yes the data structure itself seems based on the algorithm
> and not on arch specific things. That should work.

Ok. can you please elaborate, on the flow.

  If we move that to common
> code then s390 will use that scheme automatically for the cases were we call
> kvm_vcpu_on_spin(). All others archs as well.
>
> So this would probably improve guests that uses cpu_relax, for example
> stop_machine_run. I have no measurements, though.
>
> Christian
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:23           ` Avi Kivity
@ 2012-07-11 11:52             ` Alexander Graf
  2012-07-11 12:48               ` Avi Kivity
  2012-07-12  2:19             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 52+ messages in thread
From: Alexander Graf @ 2012-07-11 11:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt


On 11.07.2012, at 13:23, Avi Kivity wrote:

> On 07/11/2012 02:16 PM, Alexander Graf wrote:
>>> 
>>>> yes the data structure itself seems based on the algorithm
>>>> and not on arch specific things. That should work. If we move that to common 
>>>> code then s390 will use that scheme automatically for the cases were we call 
>>>> kvm_vcpu_on_spin(). All others archs as well.
>>> 
>>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
>>> it.  Given ppc's dislike of overcommit,
>> 
>> What dislike of overcommit?
> 
> I understood ppc virtualization is more of the partitioning sort.
> Perhaps I misunderstood it.  But the reliance on device assignment, the
> restrictions on scheduling, etc. all point to it.

Well, you need to distinguish the different PPC targets here. In the embedded world, partitioning is obviously the biggest use case, though overcommit is possible. For big servers however, we usually do want overcommit and we do support it within the constraints hardware gives us.

It's really no different from x86 when it comes to the use case wideness :).

> 
>> 
>>> and the way it implements cpu_relax() by adjusting hw thread priority,
>> 
>> Yeah, I don't think we can intercept relaxing.
> 
> ... and the lack of ability to intercept cpu_relax() ...
> 
>> It's basically a nop-like instruction that gives hardware hints on its current priorities.
> 
> That's what x88 PAUSE does.  But we can intercept it (and not just any
> execution - we can restrict intercept to tight loops executed more than
> a specific number of times).

Yeah, it's pretty hard to fetch that information from PPC, since unlike x86 we split the hint from the loop. But I'll let Ben speak to that, he certainly knows way better how the hardware works.

> 
>> That said, we can always add PV code.
> 
> Sure, but that's defeated by advancements like self-tuning PLE exits.
> It's hard to get this right.

Well, eventually everything we do in PV is going to be moot as soon as hardware catches up. In most cases from what I've seen it's only useful as an interim solution. But for that time it's good to have :).


Alex


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:51       ` Raghavendra K T
@ 2012-07-11 11:55         ` Christian Borntraeger
  2012-07-11 12:04           ` Raghavendra K T
  2012-07-11 13:04         ` Raghavendra K T
  1 sibling, 1 reply; 52+ messages in thread
From: Christian Borntraeger @ 2012-07-11 11:55 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt

On 11/07/12 13:51, Raghavendra K T wrote:
>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>>
>>> Perhaps x86 should copy this.
>>
>> See arch/s390/lib/spinlock.c
>> The basic idea is using several heuristics:
>> - loop for a given amount of loops
>> - check if the lock holder is currently scheduled by the hypervisor
>>    (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>    Dont know if such thing is available for x86. It must be a lot cheaper
>>    than a guest exit to be useful
> 
> Unfortunately we do not have information on lock-holder.

That would be an independent patch and requires guest changes.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-11 11:18       ` Avi Kivity
@ 2012-07-11 11:56         ` Raghavendra K T
  2012-07-11 12:41           ` Andrew Jones
  0 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 11:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 04:48 PM, Avi Kivity wrote:
> On 07/11/2012 01:52 PM, Raghavendra K T wrote:
>> On 07/11/2012 02:23 PM, Avi Kivity wrote:
>>> On 07/09/2012 09:20 AM, Raghavendra K T wrote:
>>>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>>>>
>>>> Noting pause loop exited vcpu helps in filtering right candidate to
>>>> yield.
>>>> Yielding to same vcpu may result in more wastage of cpu.
>>>>
>>>>
>>>>    struct kvm_lpage_info {
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index f75af40..a492f5d 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct
>>>> vcpu_svm *svm)
>>>>
>>>>    static int pause_interception(struct vcpu_svm *svm)
>>>>    {
>>>> +    svm->vcpu.arch.plo.pause_loop_exited = true;
>>>>        kvm_vcpu_on_spin(&(svm->vcpu));
>>>>        return 1;
>>>>    }
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 32eb588..600fb3c 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -4945,6 +4945,7 @@ out:
>>>>    static int handle_pause(struct kvm_vcpu *vcpu)
>>>>    {
>>>>        skip_emulated_instruction(vcpu);
>>>> +    vcpu->arch.plo.pause_loop_exited = true;
>>>>        kvm_vcpu_on_spin(vcpu);
>>>>
>>>
>>> This code is duplicated.  Should we move it to kvm_vcpu_on_spin?
>>>
>>> That means the .plo structure needs to be in common code, but that's not
>>> too bad perhaps.
>>>
>>
>> Since PLE is very much tied to x86, and proposed changes are very much
>> specific to PLE handler, I thought it is better to make arch specific.
>>
>> So do you think it is good to move inside vcpu_on_spin and make ple
>> structure belong to common code?
>
> See the discussion with Christian.  PLE is tied to x86, but cpu_relax()
> and facilities to trap it are not.

Yep.

>
>>>
>>> This adds some tiny overhead to vcpu entry.  You could remove it by
>>> using the vcpu->requests mechanism to clear the flag, since
>>> vcpu->requests is already checked on every entry.
>>
>> So IIUC,  let's have request bit for indicating PLE,
>>
>> pause_interception() /handle_pause()
>> {
>>   make_request(PLE_REQUEST)
>>   vcpu_on_spin()
>>
>> }
>>
>> check_eligibility()
>>   {
>>   !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)&&
>> dy_eligible())
>> .
>> .
>> }
>>
>> vcpu_run()
>> {
>>
>> check_request(PLE_REQUEST)
>> .
>> .
>> }
>>
>> Is this is the expected flow you had in mind?
>
> Yes, something like that.

ok..

>
>>
>> [ But my only concern was not resetting for cases where we do not do
>> guest_enter(). will test how that goes].
>
> Hm, suppose we're the next-in-line for a ticket lock and exit due to
> PLE.  The lock holder completes and unlocks, which really assigns the
> lock to us.  So now we are the lock owner, yet we are marked as don't
> yield-to-us in the PLE code.

Yes.. off-topic but that is solved by kicked flag in PV spinlocks.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:55         ` Christian Borntraeger
@ 2012-07-11 12:04           ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 12:04 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt

On 07/11/2012 05:25 PM, Christian Borntraeger wrote:
> On 11/07/12 13:51, Raghavendra K T wrote:
>>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>>>
>>>> Perhaps x86 should copy this.
>>>
>>> See arch/s390/lib/spinlock.c
>>> The basic idea is using several heuristics:
>>> - loop for a given amount of loops
>>> - check if the lock holder is currently scheduled by the hypervisor
>>>     (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>>     Dont know if such thing is available for x86. It must be a lot cheaper
>>>     than a guest exit to be useful
>>
>> Unfortunately we do not have information on lock-holder.
>
> That would be an independent patch and requires guest changes.
>

Yes, AFAI think, there are two options:
(1) extend lock and use spare bit in ticketlock indicate lock is held
(2) use percpu list entry.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-11 11:56         ` Raghavendra K T
@ 2012-07-11 12:41           ` Andrew Jones
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Jones @ 2012-07-11 12:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Avi Kivity



----- Original Message -----
> >
> > Hm, suppose we're the next-in-line for a ticket lock and exit due
> > to
> > PLE.  The lock holder completes and unlocks, which really assigns
> > the
> > lock to us.  So now we are the lock owner, yet we are marked as
> > don't
> > yield-to-us in the PLE code.
> 
> Yes.. off-topic but that is solved by kicked flag in PV spinlocks.
> 

Yeah, this is a different, but related, topic. pvticketlocks not
only help yield spinning vcpus, but also allow the use of ticketlocks
for the guaranteed fairness. If we want that fairness we'll always
need some pv-ness, even if it's a mostly PLE solution. I see that as
a reasonable reason to take the pvticketlock series, assuming it
performs at least as well as PLE. The following options have all been
brought up at some point by various people, and all have their own
pluses and minuses;

PLE-only best-effort:
   + hardware-only, algorithm improvements can be made independent of
     guest OSes
   - has limited info about spinning vcpus, making it hard to improve
     the algorithm (improved with an auto-adjusting ple_window?)
   - perf enhancement/degradation is workload/ple_window dependant
   - impossible? to guarantee FIFO order

pvticketlocks:
   - have to maintain pv code, both hosts and guests
   + perf is only workload dependant (should disable ple to avoid
     interference?)
   + guarantees FIFO order
   + can fall-back on PLE-only if the guest doesn't support it

hybrid:
   + takes advantage of the hw support
   - still requires host and guest pv code
   - will likely make perf dependant on ple_window again
   + guarantees FIFO order

???: did I miss any?

I think more benchmarking of PLE vs. pvticketlocks is needed, which
I'm working on. If we see that it performs just as well or better,
then IMHO, we should consider committing Raghu's latest version of
the pvticketlock series, perhaps with and additional patch that
auto-disables PLE when pvticketlocks are enabled.

Drew

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:52             ` Alexander Graf
@ 2012-07-11 12:48               ` Avi Kivity
  0 siblings, 0 replies; 52+ messages in thread
From: Avi Kivity @ 2012-07-11 12:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt

On 07/11/2012 02:52 PM, Alexander Graf wrote:
> 
> On 11.07.2012, at 13:23, Avi Kivity wrote:
> 
>> On 07/11/2012 02:16 PM, Alexander Graf wrote:
>>>> 
>>>>> yes the data structure itself seems based on the algorithm
>>>>> and not on arch specific things. That should work. If we move that to common 
>>>>> code then s390 will use that scheme automatically for the cases were we call 
>>>>> kvm_vcpu_on_spin(). All others archs as well.
>>>> 
>>>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
>>>> it.  Given ppc's dislike of overcommit,
>>> 
>>> What dislike of overcommit?
>> 
>> I understood ppc virtualization is more of the partitioning sort.
>> Perhaps I misunderstood it.  But the reliance on device assignment, the
>> restrictions on scheduling, etc. all point to it.
> 
> Well, you need to distinguish the different PPC targets here. In the embedded world, partitioning is obviously the biggest use case, though overcommit is possible. For big servers however, we usually do want overcommit and we do support it within the constraints hardware gives us.
> 
> It's really no different from x86 when it comes to the use case wideness :).

Okay, thanks for the correction.

> 
>> 
>>> 
>>>> and the way it implements cpu_relax() by adjusting hw thread priority,
>>> 
>>> Yeah, I don't think we can intercept relaxing.
>> 
>> ... and the lack of ability to intercept cpu_relax() ...
>> 
>>> It's basically a nop-like instruction that gives hardware hints on its current priorities.
>> 
>> That's what x88 PAUSE does.  But we can intercept it (and not just any
>> execution - we can restrict intercept to tight loops executed more than
>> a specific number of times).
> 
> Yeah, it's pretty hard to fetch that information from PPC, since unlike x86 we split the hint from the loop. But I'll let Ben speak to that, he certainly knows way better how the hardware works.

On x86 it's split from the loop as well (inside cpu_relax() like ppc).
But the hardware detects the loop and lets us know about it.

> 
>> 
>>> That said, we can always add PV code.
>> 
>> Sure, but that's defeated by advancements like self-tuning PLE exits.
>> It's hard to get this right.
> 
> Well, eventually everything we do in PV is going to be moot as soon as hardware catches up. In most cases from what I've seen it's only useful as an interim solution. But for that time it's good to have :).

Depends on how interim it is.  If better hardware is coming, I'd rather
not add more pv-ness.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:51       ` Raghavendra K T
  2012-07-11 11:55         ` Christian Borntraeger
@ 2012-07-11 13:04         ` Raghavendra K T
  1 sibling, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 13:04 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt

On 07/11/2012 05:21 PM, Raghavendra K T wrote:
> On 07/11/2012 03:47 PM, Christian Borntraeger wrote:
>> On 11/07/12 11:06, Avi Kivity wrote:
[...]
>>>> So there is no win here, but there are other cases were diag44 is
>>>> used, e.g. cpu_relax.
>>>> I have to double check with others, if these cases are critical, but
>>>> for now, it seems
>>>> that your dummy implementation for s390 is just fine. After all it
>>>> is a no-op until
>>>> we implement something.
>>>
>>> Does the data structure make sense for you? If so we can move it to
>>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with
>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>> have to pay anything.
>>
>> Ignoring the name, yes the data structure itself seems based on the
>> algorithm
>> and not on arch specific things. That should work.
>
> Ok. can you please elaborate, on the flow.
>

Ok got it..  Will check how the code can be common to both x86 and s390.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11  9:00   ` Avi Kivity
@ 2012-07-11 13:59     ` Raghavendra K T
  2012-07-11 14:01       ` Raghavendra K T
  0 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 13:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 02:30 PM, Avi Kivity wrote:
> On 07/10/2012 12:47 AM, Andrew Theurer wrote:
>>
>> For the cpu threads in the host that are actually active (in this case
>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
>> is for a no-IO workload, so that's just incredible to see so much cpu
>> wasted.  I feel that 2 important areas to tackle are a more scalable
>> yield_to() and reducing the number of pause exits itself (hopefully by
>> just tuning ple_window for the latter).
>
> One thing we can do is autotune ple_window.  If a ple exit fails to wake
> anybody (because all vcpus are either running, sleeping, or in ple
> exits) then we deduce we are not overcommitted and we can increase the
> ple window.  There's the question of how to decrease it again though.
>

I see some problem here, If I interpret situation correctly. What
happens if we have two guests with one VM having no over-commit and 
other with high over-commit. (except when we have gang scheduling).

Rather we should have something tied to VM rather than rigid PLE
window.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 13:59     ` Raghavendra K T
@ 2012-07-11 14:01       ` Raghavendra K T
  2012-07-12  8:15         ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-11 14:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 07:29 PM, Raghavendra K T wrote:
> On 07/11/2012 02:30 PM, Avi Kivity wrote:
>> On 07/10/2012 12:47 AM, Andrew Theurer wrote:
>>>
>>> For the cpu threads in the host that are actually active (in this case
>>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This
>>> is for a no-IO workload, so that's just incredible to see so much cpu
>>> wasted. I feel that 2 important areas to tackle are a more scalable
>>> yield_to() and reducing the number of pause exits itself (hopefully by
>>> just tuning ple_window for the latter).
>>
>> One thing we can do is autotune ple_window. If a ple exit fails to wake
>> anybody (because all vcpus are either running, sleeping, or in ple
>> exits) then we deduce we are not overcommitted and we can increase the
>> ple window. There's the question of how to decrease it again though.
>>
>
> I see some problem here, If I interpret situation correctly. What
> happens if we have two guests with one VM having no over-commit and
> other with high over-commit. (except when we have gang scheduling).
>
Sorry, I meant less load and high load inside the guest.

> Rather we should have something tied to VM rather than rigid PLE
> window.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:04       ` Avi Kivity
  2012-07-11 11:16         ` Alexander Graf
  2012-07-11 11:18         ` Christian Borntraeger
@ 2012-07-12  2:17         ` Benjamin Herrenschmidt
  2012-07-12  8:12           ` Avi Kivity
  2012-07-12 10:38         ` Nikunj A Dadhania
  3 siblings, 1 reply; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-12  2:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Alexander Graf, Paul Mackerras

> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> it.  Given ppc's dislike of overcommit, and the way it implements
> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
> intercept those either, but I'm copying the ppc people in case I'm
> wrong.  So it's s390 and x86.

No but our spinlocks call __spin_yield()  (or __rw_yield) which does
some paravirt tricks already.

We check if the holder is currently running, and if not, we call the
H_CONFER hypercall which can be used to "give" our time slice to the
holder.

Our implementation of H_CONFER in KVM is currently a nop though.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:23           ` Avi Kivity
  2012-07-11 11:52             ` Alexander Graf
@ 2012-07-12  2:19             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-12  2:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, Christian Borntraeger, Raghavendra K T,
	H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, S390, Carsten Otte, KVM, chegu vinod,
	Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390,
	Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Paul Mackerras

On Wed, 2012-07-11 at 14:23 +0300, Avi Kivity wrote:
> On 07/11/2012 02:16 PM, Alexander Graf wrote:
> >> 
> >>> yes the data structure itself seems based on the algorithm
> >>> and not on arch specific things. That should work. If we move that to common 
> >>> code then s390 will use that scheme automatically for the cases were we call 
> >>> kvm_vcpu_on_spin(). All others archs as well.
> >> 
> >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> >> it.  Given ppc's dislike of overcommit,
> > 
> > What dislike of overcommit?
> 
> I understood ppc virtualization is more of the partitioning sort.
> Perhaps I misunderstood it.  But the reliance on device assignment, the
> restrictions on scheduling, etc. all point to it.

It historically was but that has changed quite a bit. Essentially the
user can configure partitions to be more of the "all virtualized" kind
or on the contrary more fixed partitions. The hypervisor does shared
processors and we have paravirt APIs to cede our time slice to the lock
holder. 

> >> and the way it implements cpu_relax() by adjusting hw thread priority,
> > 
> > Yeah, I don't think we can intercept relaxing.
> 
> ... and the lack of ability to intercept cpu_relax() ...
> 
> > It's basically a nop-like instruction that gives hardware hints on its current priorities.
> 
> That's what x88 PAUSE does.  But we can intercept it (and not just any
> execution - we can restrict intercept to tight loops executed more than
> a specific number of times).
> 
> > That said, we can always add PV code.
> 
> Sure, but that's defeated by advancements like self-tuning PLE exits.
> It's hard to get this right.
> 

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:39           ` Avi Kivity
@ 2012-07-12  5:11             ` Raghavendra K T
  2012-07-12  8:11               ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-12  5:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 07/11/2012 05:09 PM, Avi Kivity wrote:
> On 07/11/2012 02:18 PM, Christian Borntraeger wrote:
>> On 11/07/12 13:04, Avi Kivity wrote:
>>> On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
>>>> On 11/07/12 11:06, Avi Kivity wrote:
>>>> [...]
>>>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though.
>>>>>
>>>>> Perhaps x86 should copy this.
>>>>
>>>> See arch/s390/lib/spinlock.c
>>>> The basic idea is using several heuristics:
>>>> - loop for a given amount of loops
>>>> - check if the lock holder is currently scheduled by the hypervisor
>>>>    (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>>>    Dont know if such thing is available for x86. It must be a lot cheaper
>>>>    than a guest exit to be useful
>>>
>>> We could make it available via shared memory, updated using preempt
>>> notifiers.  Of course piling on more pv makes this less attractive.
>>>
>>>> - if lock holder is not running and we looped for a while do a directed
>>>>    yield to that cpu.
>>>>
>>>>>
>>>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax.
>>>>>> I have to double check with others, if these cases are critical, but for now, it seems
>>>>>> that your dummy implementation  for s390 is just fine. After all it is a no-op until
>>>>>> we implement something.
>>>>>
>>>>> Does the data structure make sense for you?  If so we can move it to
>>>>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>>>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>>>> have to pay anything.
>>>>
>>>> Ignoring the name,
>>>
>>> What name would you suggest?
>>
>> maybe vcpu_no_progress instead of pause_loop_exited
>
> Ah, I thouht you objected to the CONFIG var.  Maybe call it
> cpu_relax_intercepted since that's the linuxy name for the instruction.
>

Ok, just to be on same page. 'll have :
1. cpu_relax_intercepted instead of  pause_loop_exited.

2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally
selected for x86 and s390

3. make request mechanism to clear cpu_relax_intercepted.

('ll do same thing for s390 also but have not seen s390 code using 
request mechanism, so not sure if it ok.. otherwise we have to clear 
unconditionally for s390 before guest enter and for x86 we have to move 
make_request back to vmx/svm).
will post V3 with these changes.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  5:11             ` Raghavendra K T
@ 2012-07-12  8:11               ` Avi Kivity
  2012-07-12  8:32                 ` Raghavendra K T
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-12  8:11 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 07/12/2012 08:11 AM, Raghavendra K T wrote:
>> Ah, I thouht you objected to the CONFIG var.  Maybe call it
>> cpu_relax_intercepted since that's the linuxy name for the instruction.
>>
> 
> Ok, just to be on same page. 'll have :
> 1. cpu_relax_intercepted instead of  pause_loop_exited.
> 
> 2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally
> selected for x86 and s390
> 
> 3. make request mechanism to clear cpu_relax_intercepted.
> 
> ('ll do same thing for s390 also but have not seen s390 code using
> request mechanism, so not sure if it ok.. otherwise we have to clear
> unconditionally for s390 before guest enter and for x86 we have to move
> make_request back to vmx/svm).
> will post V3 with these changes.

You can leave the s390 changes to the s390 people; just make sure the
generic code is ready.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  2:17         ` Benjamin Herrenschmidt
@ 2012-07-12  8:12           ` Avi Kivity
  2012-07-12 11:24             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-12  8:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Alexander Graf, Paul Mackerras

On 07/12/2012 05:17 AM, Benjamin Herrenschmidt wrote:
>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
>> it.  Given ppc's dislike of overcommit, and the way it implements
>> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
>> intercept those either, but I'm copying the ppc people in case I'm
>> wrong.  So it's s390 and x86.
> 
> No but our spinlocks call __spin_yield()  (or __rw_yield) which does
> some paravirt tricks already.
> 
> We check if the holder is currently running, and if not, we call the
> H_CONFER hypercall which can be used to "give" our time slice to the
> holder.
> 
> Our implementation of H_CONFER in KVM is currently a nop though.

Okay, so you can join the party.  See yield_to() and kvm_vcpu_on_spin().

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 14:01       ` Raghavendra K T
@ 2012-07-12  8:15         ` Avi Kivity
  2012-07-12  8:25           ` Raghavendra K T
  0 siblings, 1 reply; 52+ messages in thread
From: Avi Kivity @ 2012-07-12  8:15 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/11/2012 05:01 PM, Raghavendra K T wrote:
> On 07/11/2012 07:29 PM, Raghavendra K T wrote:
>> On 07/11/2012 02:30 PM, Avi Kivity wrote:
>>> On 07/10/2012 12:47 AM, Andrew Theurer wrote:
>>>>
>>>> For the cpu threads in the host that are actually active (in this case
>>>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This
>>>> is for a no-IO workload, so that's just incredible to see so much cpu
>>>> wasted. I feel that 2 important areas to tackle are a more scalable
>>>> yield_to() and reducing the number of pause exits itself (hopefully by
>>>> just tuning ple_window for the latter).
>>>
>>> One thing we can do is autotune ple_window. If a ple exit fails to wake
>>> anybody (because all vcpus are either running, sleeping, or in ple
>>> exits) then we deduce we are not overcommitted and we can increase the
>>> ple window. There's the question of how to decrease it again though.
>>>
>>
>> I see some problem here, If I interpret situation correctly. What
>> happens if we have two guests with one VM having no over-commit and
>> other with high over-commit. (except when we have gang scheduling).
>>
> Sorry, I meant less load and high load inside the guest.
> 
>> Rather we should have something tied to VM rather than rigid PLE
>> window.

The problem occurs even with no overcommit at all.  One vcpu is in a
legitimately long pause loop.  All those exits accomplish nothing, since
all vcpus are scheduled.  Better to let it spin in guest mode.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  8:15         ` Avi Kivity
@ 2012-07-12  8:25           ` Raghavendra K T
  2012-07-12 12:31             ` Avi Kivity
  0 siblings, 1 reply; 52+ messages in thread
From: Raghavendra K T @ 2012-07-12  8:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/12/2012 01:45 PM, Avi Kivity wrote:
> On 07/11/2012 05:01 PM, Raghavendra K T wrote:
>> On 07/11/2012 07:29 PM, Raghavendra K T wrote:
>>> On 07/11/2012 02:30 PM, Avi Kivity wrote:
>>>> On 07/10/2012 12:47 AM, Andrew Theurer wrote:
>>>>>
>>>>> For the cpu threads in the host that are actually active (in this case
>>>>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This
>>>>> is for a no-IO workload, so that's just incredible to see so much cpu
>>>>> wasted. I feel that 2 important areas to tackle are a more scalable
>>>>> yield_to() and reducing the number of pause exits itself (hopefully by
>>>>> just tuning ple_window for the latter).
>>>>
>>>> One thing we can do is autotune ple_window. If a ple exit fails to wake
>>>> anybody (because all vcpus are either running, sleeping, or in ple
>>>> exits) then we deduce we are not overcommitted and we can increase the
>>>> ple window. There's the question of how to decrease it again though.
>>>>
>>>
>>> I see some problem here, If I interpret situation correctly. What
>>> happens if we have two guests with one VM having no over-commit and
>>> other with high over-commit. (except when we have gang scheduling).
>>>
>> Sorry, I meant less load and high load inside the guest.
>>
>>> Rather we should have something tied to VM rather than rigid PLE
>>> window.
>
> The problem occurs even with no overcommit at all.  One vcpu is in a
> legitimately long pause loop.  All those exits accomplish nothing, since
> all vcpus are scheduled.  Better to let it spin in guest mode.
>

I agree. One idea is we can have a scan_window to limit the scan of all
n vcpus each time we enter vcpu_spin, to say 2*log n initially;

then algorithm would be like;

if (yield fails)
    increase ple_window , increase scan_window

if (yield succeeds)
    decrease ple_window , decrease scan_window


and we have to set limit on what is max and min scan window and max and
min ple_window.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  8:11               ` Avi Kivity
@ 2012-07-12  8:32                 ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-12  8:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On 07/12/2012 01:41 PM, Avi Kivity wrote:
> On 07/12/2012 08:11 AM, Raghavendra K T wrote:
>>> Ah, I thouht you objected to the CONFIG var.  Maybe call it
>>> cpu_relax_intercepted since that's the linuxy name for the instruction.
>>>
>>
>> Ok, just to be on same page. 'll have :
>> 1. cpu_relax_intercepted instead of  pause_loop_exited.
>>
>> 2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally
>> selected for x86 and s390
>>
>> 3. make request mechanism to clear cpu_relax_intercepted.
>>
>> ('ll do same thing for s390 also but have not seen s390 code using
>> request mechanism, so not sure if it ok.. otherwise we have to clear
>> unconditionally for s390 before guest enter and for x86 we have to move
>> make_request back to vmx/svm).
>> will post V3 with these changes.
>
> You can leave the s390 changes to the s390 people; just make sure the
> generic code is ready.
>
Yep,
Checked the following logic with make_request and it works fine,

vcpu_spin()
{
  ple_exited = true;
.
.
make_request(KVM_REQ_CLEAR_PLE, vcpu);
}

vcpu_enter_guest()
{
  if(check_request(KVM_REQ_CLEAR_PLE))
     ple_exited = false;
.
.
}

But there is following approach that is working perfectly fine.
vcpu_spin()
{
  ple_exited = true;
.
.

  ple_exited = false;
}

I hope to go with second approach. let me know if you find any loop
hole.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-11 11:04       ` Avi Kivity
                           ` (2 preceding siblings ...)
  2012-07-12  2:17         ` Benjamin Herrenschmidt
@ 2012-07-12 10:38         ` Nikunj A Dadhania
  3 siblings, 0 replies; 52+ messages in thread
From: Nikunj A Dadhania @ 2012-07-12 10:38 UTC (permalink / raw)
  To: Avi Kivity, Christian Borntraeger
  Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner,
	Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt,
	Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt

On Wed, 11 Jul 2012 14:04:03 +0300, Avi Kivity <avi@redhat.com> wrote:
> 
> > So this would probably improve guests that uses cpu_relax, for example
> > stop_machine_run. I have no measurements, though.
> 
> smp_call_function() too (though that can be converted to directed yield
> too).  It seems worthwhile.
> 
With 

https://lkml.org/lkml/2012/6/26/266 in tip:x86/mm

which now uses smp_call_function_many in native_flush_tlb_others. It
will help that too.

Nikunj


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-11 10:52     ` Raghavendra K T
  2012-07-11 11:18       ` Avi Kivity
@ 2012-07-12 10:58       ` Nikunj A Dadhania
  2012-07-12 11:02         ` Raghavendra K T
  1 sibling, 1 reply; 52+ messages in thread
From: Nikunj A Dadhania @ 2012-07-12 10:58 UTC (permalink / raw)
  To: Raghavendra K T, Avi Kivity
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM,
	chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On Wed, 11 Jul 2012 16:22:29 +0530, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> On 07/11/2012 02:23 PM, Avi Kivity wrote:
> >
> > This adds some tiny overhead to vcpu entry.  You could remove it by
> > using the vcpu->requests mechanism to clear the flag, since
> > vcpu->requests is already checked on every entry.
> 
> So IIUC,  let's have request bit for indicating PLE,
> 
> pause_interception() /handle_pause()
> {
>   make_request(PLE_REQUEST)
>   vcpu_on_spin()
> 
> }
> 
> check_eligibility()
>   {
>   !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)  && 
> dy_eligible())
> .
> .
> }
> 
> vcpu_run()
> {
> 
> check_request(PLE_REQUEST)
>
I know check_request will clear PLE_REQUEST, but you just need a
clear_request here, right?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
  2012-07-12 10:58       ` Nikunj A Dadhania
@ 2012-07-12 11:02         ` Raghavendra K T
  0 siblings, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2012-07-12 11:02 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/12/2012 04:28 PM, Nikunj A Dadhania wrote:
> On Wed, 11 Jul 2012 16:22:29 +0530, Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>  wrote:
>> On 07/11/2012 02:23 PM, Avi Kivity wrote:
>>>
>>> This adds some tiny overhead to vcpu entry.  You could remove it by
>>> using the vcpu->requests mechanism to clear the flag, since
>>> vcpu->requests is already checked on every entry.
>>
>> So IIUC,  let's have request bit for indicating PLE,
>>
>> pause_interception() /handle_pause()
>> {
>>    make_request(PLE_REQUEST)
>>    vcpu_on_spin()
>>
>> }
>>
>> check_eligibility()
>>    {
>>    !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)&&
>> dy_eligible())
>> .
>> .
>> }
>>
>> vcpu_run()
>> {
>>
>> check_request(PLE_REQUEST)
>>
> I know check_request will clear PLE_REQUEST, but you just need a
> clear_request here, right?
>

Yes. tried to use check_request for clearing. But I ended up in
different implementation. (latest thread)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  8:12           ` Avi Kivity
@ 2012-07-12 11:24             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 52+ messages in thread
From: Benjamin Herrenschmidt @ 2012-07-12 11:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin,
	Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML,
	X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel,
	Christian Ehrhardt, Alexander Graf, Paul Mackerras

On Thu, 2012-07-12 at 11:12 +0300, Avi Kivity wrote:
> On 07/12/2012 05:17 AM, Benjamin Herrenschmidt wrote:
> >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> >> it.  Given ppc's dislike of overcommit, and the way it implements
> >> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
> >> intercept those either, but I'm copying the ppc people in case I'm
> >> wrong.  So it's s390 and x86.
> > 
> > No but our spinlocks call __spin_yield()  (or __rw_yield) which does
> > some paravirt tricks already.
> > 
> > We check if the holder is currently running, and if not, we call the
> > H_CONFER hypercall which can be used to "give" our time slice to the
> > holder.
> > 
> > Our implementation of H_CONFER in KVM is currently a nop though.
> 
> Okay, so you can join the party.  See yield_to() and kvm_vcpu_on_spin().

Thanks ! I'll have a look eventually :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
  2012-07-12  8:25           ` Raghavendra K T
@ 2012-07-12 12:31             ` Avi Kivity
  0 siblings, 0 replies; 52+ messages in thread
From: Avi Kivity @ 2012-07-12 12:31 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, S390, Carsten Otte,
	Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov,
	linux390, Srivatsa Vaddagiri, Joerg Roedel

On 07/12/2012 11:25 AM, Raghavendra K T wrote:
>>
>> The problem occurs even with no overcommit at all.  One vcpu is in a
>> legitimately long pause loop.  All those exits accomplish nothing, since
>> all vcpus are scheduled.  Better to let it spin in guest mode.
>>
> 
> I agree. One idea is we can have a scan_window to limit the scan of all
> n vcpus each time we enter vcpu_spin, to say 2*log n initially;

Not sure I agree.  The subset that we scan is in no way special, there's
no reason to suppose it would be effective.

We can make the loop exit time scale with the number of vcpus to account
for the greater effort needed to wake a vcpu.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-07-12 12:32 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
2012-07-09  6:33   ` Raghavendra K T
2012-07-09  6:33     ` Raghavendra K T
2012-07-09 22:39   ` Rik van Riel
2012-07-10 11:22     ` Raghavendra K T
2012-07-11  8:53   ` Avi Kivity
2012-07-11 10:52     ` Raghavendra K T
2012-07-11 11:18       ` Avi Kivity
2012-07-11 11:56         ` Raghavendra K T
2012-07-11 12:41           ` Andrew Jones
2012-07-12 10:58       ` Nikunj A Dadhania
2012-07-12 11:02         ` Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
2012-07-09 22:30   ` Rik van Riel
2012-07-10 11:46     ` Raghavendra K T
2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
2012-07-10  8:27   ` Raghavendra K T
2012-07-11  9:06   ` Avi Kivity
2012-07-11 10:17     ` Christian Borntraeger
2012-07-11 11:04       ` Avi Kivity
2012-07-11 11:16         ` Alexander Graf
2012-07-11 11:23           ` Avi Kivity
2012-07-11 11:52             ` Alexander Graf
2012-07-11 12:48               ` Avi Kivity
2012-07-12  2:19             ` Benjamin Herrenschmidt
2012-07-11 11:18         ` Christian Borntraeger
2012-07-11 11:39           ` Avi Kivity
2012-07-12  5:11             ` Raghavendra K T
2012-07-12  8:11               ` Avi Kivity
2012-07-12  8:32                 ` Raghavendra K T
2012-07-12  2:17         ` Benjamin Herrenschmidt
2012-07-12  8:12           ` Avi Kivity
2012-07-12 11:24             ` Benjamin Herrenschmidt
2012-07-12 10:38         ` Nikunj A Dadhania
2012-07-11 11:51       ` Raghavendra K T
2012-07-11 11:55         ` Christian Borntraeger
2012-07-11 12:04           ` Raghavendra K T
2012-07-11 13:04         ` Raghavendra K T
2012-07-09 21:47 ` Andrew Theurer
2012-07-09 21:47   ` Andrew Theurer
2012-07-10  9:26   ` Raghavendra K T
2012-07-10 10:07   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T
2012-07-10 11:54   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-10 13:27     ` Andrew Theurer
2012-07-11  9:00   ` Avi Kivity
2012-07-11 13:59     ` Raghavendra K T
2012-07-11 14:01       ` Raghavendra K T
2012-07-12  8:15         ` Avi Kivity
2012-07-12  8:25           ` Raghavendra K T
2012-07-12 12:31             ` Avi Kivity
2012-07-09 22:28 ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.