* [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler @ 2012-07-09 6:20 Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T ` (4 more replies) 0 siblings, 5 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-09 6:20 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel Currently Pause Looop Exit (PLE) handler is doing directed yield to a random VCPU on PL exit. Though we already have filtering while choosing the candidate to yield_to, we can do better. Problem is, for large vcpu guests, we have more probability of yielding to a bad vcpu. We are not able to prevent directed yield to same guy who has done PL exit recently, who perhaps spins again and wastes CPU. Fix that by keeping track of who has done PL exit. So The Algorithm in series give chance to a VCPU which has: (a) Not done PLE exit at all (probably he is preempted lock-holder) (b) VCPU skipped in last iteration because it did PL exit, and probably has become eligible now (next eligible lock holder) Future enhancemnets: (1) Currently we have a boolean to decide on eligibility of vcpu. It would be nice if I get feedback on guest (>32 vcpu) whether we can improve better with integer counter. (with counter = say f(log n )). (2) We have not considered system load during iteration of vcpu. With that information we can limit the scan and also decide whether schedule() is better. [ I am able to use #kicked vcpus to decide on this But may be there are better ideas like information from global loadavg.] (3) We can exploit this further with PV patches since it also knows about next eligible lock-holder. Summary: There is a huge improvement for moderate / no overcommit scenario for kvm based guest on PLE machine (which is difficult ;) ). Result: Base : kernel 3.5.0-rc5 with Rik's Ple handler fix Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, 32 core machine Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) with test kernels Guest: fedora 16 with 32 vcpus 8GB memory. Benchmarks: 1) kernbench: kernbench-0.5 (kernbench -f -H -M -o 2*vcpu) Very first run in kernbench is omitted. 2) sysbench: 0.4.12 sysbench --test=oltp --db-driver=pgsql prepare sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp --oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run Note that driver for this pgsql. 3) ebizzy: release 0.3 cmd: ebizzy -S 120 1) kernbench (time in sec lesser is better) +-----------+-----------+-----------+------------+-----------+ base_rik stdev patched stdev %improve +-----------+-----------+-----------+------------+-----------+ 1x 49.2300 1.0171 38.3792 1.3659 28.27261% 2x 91.9358 1.7768 85.8842 1.6654 7.04623% +-----------+-----------+-----------+------------+-----------+ 2) sysbench (time in sec lesser is better) +-----------+-----------+-----------+------------+-----------+ base_rik stdev patched stdev %improve +-----------+-----------+-----------+------------+-----------+ 1x 12.1623 0.0942 12.1674 0.3126 -0.04192% 2x 14.3069 0.8520 14.1879 0.6811 0.83874% +-----------+-----------+-----------+------------+-----------+ Note that 1x scenario differs in only third decimal place and degradation/improvemnet for sysbench will not be seen even with higher confidence interval. 3) ebizzy (records/sec more is better) +-----------+-----------+-----------+------------+-----------+ base_rik stdev patched stdev %improve +-----------+-----------+-----------+------------+-----------+ 1x 1129.2500 28.6793 2316.6250 53.0066 105.14722% 2x 1892.3750 75.1112 2386.5000 168.8033 26.11137% +-----------+-----------+-----------+------------+-----------+ kernbench 1x: 4 fast runs = 12 runs avg kernbench 2x: 4 fast runs = 12 runs avg sysbench 1x: 8runs avg sysbench 2x: 8runs avg ebizzy 1x: 8runs avg ebizzy 2x: 8runs avg Thanks Vatsa and Srikar for brainstorming discussions regarding optimizations. Raghavendra K T (2): kvm vcpu: Note down pause loop exit kvm PLE handler: Choose better candidate for directed yield arch/s390/include/asm/kvm_host.h | 5 +++++ arch/x86/include/asm/kvm_host.h | 9 ++++++++- arch/x86/kvm/svm.c | 1 + arch/x86/kvm/vmx.c | 1 + arch/x86/kvm/x86.c | 18 +++++++++++++++++- virt/kvm/kvm_main.c | 3 +++ 6 files changed, 35 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T @ 2012-07-09 6:20 ` Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T ` (2 more replies) 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T ` (3 subsequent siblings) 4 siblings, 3 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-09 6:20 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, Rik van Riel Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Noting pause loop exited vcpu helps in filtering right candidate to yield. Yielding to same vcpu may result in more wastage of cpu. From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- arch/x86/include/asm/kvm_host.h | 7 +++++++ arch/x86/kvm/svm.c | 1 + arch/x86/kvm/vmx.c | 1 + arch/x86/kvm/x86.c | 4 +++- 4 files changed, 12 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..857ca68 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -484,6 +484,13 @@ struct kvm_vcpu_arch { u64 length; u64 status; } osvw; + + /* Pause loop exit optimization */ + struct { + bool pause_loop_exited; + bool dy_eligible; + } plo; + }; struct kvm_lpage_info { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f75af40..a492f5d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm) static int pause_interception(struct vcpu_svm *svm) { + svm->vcpu.arch.plo.pause_loop_exited = true; kvm_vcpu_on_spin(&(svm->vcpu)); return 1; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 32eb588..600fb3c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4945,6 +4945,7 @@ out: static int handle_pause(struct kvm_vcpu *vcpu) { skip_emulated_instruction(vcpu); + vcpu->arch.plo.pause_loop_exited = true; kvm_vcpu_on_spin(vcpu); return 1; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index be6d549..07dbd14 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (req_immediate_exit) smp_send_reschedule(vcpu->cpu); - + vcpu->arch.plo.pause_loop_exited = false; kvm_guest_enter(); if (unlikely(vcpu->arch.switch_db_regs)) { @@ -6168,6 +6168,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; + vcpu->arch.plo.pause_loop_exited = false; + vcpu->arch.plo.dy_eligible = true; vcpu->arch.emulate_ctxt.ops = &emulate_ops; if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_bsp(vcpu)) vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T @ 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 22:39 ` Rik van Riel 2012-07-11 8:53 ` Avi Kivity 2 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-09 6:33 UTC (permalink / raw) Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 11:50 AM, Raghavendra K T wrote: > Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> > > Noting pause loop exited vcpu helps in filtering right candidate to yield. > Yielding to same vcpu may result in more wastage of cpu. > > From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> > --- Oops. Sorry some how sign-off and from interchanged.. interchanged ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit @ 2012-07-09 6:33 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-09 6:33 UTC (permalink / raw) Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 11:50 AM, Raghavendra K T wrote: > Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> > > Noting pause loop exited vcpu helps in filtering right candidate to yield. > Yielding to same vcpu may result in more wastage of cpu. > > From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> > --- Oops. Sorry some how sign-off and from interchanged.. interchanged ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T @ 2012-07-09 22:39 ` Rik van Riel 2012-07-10 11:22 ` Raghavendra K T 2012-07-11 8:53 ` Avi Kivity 2 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2012-07-09 22:39 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 02:20 AM, Raghavendra K T wrote: > @@ -484,6 +484,13 @@ struct kvm_vcpu_arch { > u64 length; > u64 status; > } osvw; > + > + /* Pause loop exit optimization */ > + struct { > + bool pause_loop_exited; > + bool dy_eligible; > + } plo; I know kvm_vcpu_arch is traditionally not a well documented structure, but it would be really nice if each variable inside this sub-structure could get some documentation. Also, do we really want to introduce another acronym here? Or would we be better off simply calling this struct .ple, since that is a name people are already familiar with. -- All rights reversed ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-09 22:39 ` Rik van Riel @ 2012-07-10 11:22 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 11:22 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Thomas Gleixner, Avi Kivity, Ingo Molnar, Marcelo Tosatti, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/10/2012 04:09 AM, Rik van Riel wrote: > On 07/09/2012 02:20 AM, Raghavendra K T wrote: > >> @@ -484,6 +484,13 @@ struct kvm_vcpu_arch { >> u64 length; >> u64 status; >> } osvw; >> + >> + /* Pause loop exit optimization */ >> + struct { >> + bool pause_loop_exited; >> + bool dy_eligible; >> + } plo; > > I know kvm_vcpu_arch is traditionally not a well documented > structure, but it would be really nice if each variable inside > this sub-structure could get some documentation. Sure. Will document it. > > Also, do we really want to introduce another acronym here? > > Or would we be better off simply calling this struct .ple, > since that is a name people are already familiar with. Yes. it makes sense to have .ple. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 22:39 ` Rik van Riel @ 2012-07-11 8:53 ` Avi Kivity 2012-07-11 10:52 ` Raghavendra K T 2 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-11 8:53 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 09:20 AM, Raghavendra K T wrote: > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > Noting pause loop exited vcpu helps in filtering right candidate to yield. > Yielding to same vcpu may result in more wastage of cpu. > > > struct kvm_lpage_info { > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index f75af40..a492f5d 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm) > > static int pause_interception(struct vcpu_svm *svm) > { > + svm->vcpu.arch.plo.pause_loop_exited = true; > kvm_vcpu_on_spin(&(svm->vcpu)); > return 1; > } > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 32eb588..600fb3c 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -4945,6 +4945,7 @@ out: > static int handle_pause(struct kvm_vcpu *vcpu) > { > skip_emulated_instruction(vcpu); > + vcpu->arch.plo.pause_loop_exited = true; > kvm_vcpu_on_spin(vcpu); > This code is duplicated. Should we move it to kvm_vcpu_on_spin? That means the .plo structure needs to be in common code, but that's not too bad perhaps. > index be6d549..07dbd14 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > if (req_immediate_exit) > smp_send_reschedule(vcpu->cpu); > - > + vcpu->arch.plo.pause_loop_exited = false; This adds some tiny overhead to vcpu entry. You could remove it by using the vcpu->requests mechanism to clear the flag, since vcpu->requests is already checked on every entry. > kvm_guest_enter(); > -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-11 8:53 ` Avi Kivity @ 2012-07-11 10:52 ` Raghavendra K T 2012-07-11 11:18 ` Avi Kivity 2012-07-12 10:58 ` Nikunj A Dadhania 0 siblings, 2 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 10:52 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 02:23 PM, Avi Kivity wrote: > On 07/09/2012 09:20 AM, Raghavendra K T wrote: >> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> >> >> Noting pause loop exited vcpu helps in filtering right candidate to yield. >> Yielding to same vcpu may result in more wastage of cpu. >> >> >> struct kvm_lpage_info { >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >> index f75af40..a492f5d 100644 >> --- a/arch/x86/kvm/svm.c >> +++ b/arch/x86/kvm/svm.c >> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm) >> >> static int pause_interception(struct vcpu_svm *svm) >> { >> + svm->vcpu.arch.plo.pause_loop_exited = true; >> kvm_vcpu_on_spin(&(svm->vcpu)); >> return 1; >> } >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index 32eb588..600fb3c 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -4945,6 +4945,7 @@ out: >> static int handle_pause(struct kvm_vcpu *vcpu) >> { >> skip_emulated_instruction(vcpu); >> + vcpu->arch.plo.pause_loop_exited = true; >> kvm_vcpu_on_spin(vcpu); >> > > This code is duplicated. Should we move it to kvm_vcpu_on_spin? > > That means the .plo structure needs to be in common code, but that's not > too bad perhaps. > Since PLE is very much tied to x86, and proposed changes are very much specific to PLE handler, I thought it is better to make arch specific. So do you think it is good to move inside vcpu_on_spin and make ple structure belong to common code? >> index be6d549..07dbd14 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -5331,7 +5331,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) >> >> if (req_immediate_exit) >> smp_send_reschedule(vcpu->cpu); >> - >> + vcpu->arch.plo.pause_loop_exited = false; > > This adds some tiny overhead to vcpu entry. You could remove it by > using the vcpu->requests mechanism to clear the flag, since > vcpu->requests is already checked on every entry. So IIUC, let's have request bit for indicating PLE, pause_interception() /handle_pause() { make_request(PLE_REQUEST) vcpu_on_spin() } check_eligibility() { !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST) && dy_eligible()) . . } vcpu_run() { check_request(PLE_REQUEST) . . } Is this is the expected flow you had in mind? [ But my only concern was not resetting for cases where we do not do guest_enter(). will test how that goes]. > >> kvm_guest_enter(); >> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-11 10:52 ` Raghavendra K T @ 2012-07-11 11:18 ` Avi Kivity 2012-07-11 11:56 ` Raghavendra K T 2012-07-12 10:58 ` Nikunj A Dadhania 1 sibling, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-11 11:18 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 01:52 PM, Raghavendra K T wrote: > On 07/11/2012 02:23 PM, Avi Kivity wrote: >> On 07/09/2012 09:20 AM, Raghavendra K T wrote: >>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> >>> >>> Noting pause loop exited vcpu helps in filtering right candidate to >>> yield. >>> Yielding to same vcpu may result in more wastage of cpu. >>> >>> >>> struct kvm_lpage_info { >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >>> index f75af40..a492f5d 100644 >>> --- a/arch/x86/kvm/svm.c >>> +++ b/arch/x86/kvm/svm.c >>> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct >>> vcpu_svm *svm) >>> >>> static int pause_interception(struct vcpu_svm *svm) >>> { >>> + svm->vcpu.arch.plo.pause_loop_exited = true; >>> kvm_vcpu_on_spin(&(svm->vcpu)); >>> return 1; >>> } >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index 32eb588..600fb3c 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -4945,6 +4945,7 @@ out: >>> static int handle_pause(struct kvm_vcpu *vcpu) >>> { >>> skip_emulated_instruction(vcpu); >>> + vcpu->arch.plo.pause_loop_exited = true; >>> kvm_vcpu_on_spin(vcpu); >>> >> >> This code is duplicated. Should we move it to kvm_vcpu_on_spin? >> >> That means the .plo structure needs to be in common code, but that's not >> too bad perhaps. >> > > Since PLE is very much tied to x86, and proposed changes are very much > specific to PLE handler, I thought it is better to make arch specific. > > So do you think it is good to move inside vcpu_on_spin and make ple > structure belong to common code? See the discussion with Christian. PLE is tied to x86, but cpu_relax() and facilities to trap it are not. >> >> This adds some tiny overhead to vcpu entry. You could remove it by >> using the vcpu->requests mechanism to clear the flag, since >> vcpu->requests is already checked on every entry. > > So IIUC, let's have request bit for indicating PLE, > > pause_interception() /handle_pause() > { > make_request(PLE_REQUEST) > vcpu_on_spin() > > } > > check_eligibility() > { > !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST) && > dy_eligible()) > . > . > } > > vcpu_run() > { > > check_request(PLE_REQUEST) > . > . > } > > Is this is the expected flow you had in mind? Yes, something like that. > > [ But my only concern was not resetting for cases where we do not do > guest_enter(). will test how that goes]. Hm, suppose we're the next-in-line for a ticket lock and exit due to PLE. The lock holder completes and unlocks, which really assigns the lock to us. So now we are the lock owner, yet we are marked as don't yield-to-us in the PLE code. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-11 11:18 ` Avi Kivity @ 2012-07-11 11:56 ` Raghavendra K T 2012-07-11 12:41 ` Andrew Jones 0 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 11:56 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 04:48 PM, Avi Kivity wrote: > On 07/11/2012 01:52 PM, Raghavendra K T wrote: >> On 07/11/2012 02:23 PM, Avi Kivity wrote: >>> On 07/09/2012 09:20 AM, Raghavendra K T wrote: >>>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> >>>> >>>> Noting pause loop exited vcpu helps in filtering right candidate to >>>> yield. >>>> Yielding to same vcpu may result in more wastage of cpu. >>>> >>>> >>>> struct kvm_lpage_info { >>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >>>> index f75af40..a492f5d 100644 >>>> --- a/arch/x86/kvm/svm.c >>>> +++ b/arch/x86/kvm/svm.c >>>> @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct >>>> vcpu_svm *svm) >>>> >>>> static int pause_interception(struct vcpu_svm *svm) >>>> { >>>> + svm->vcpu.arch.plo.pause_loop_exited = true; >>>> kvm_vcpu_on_spin(&(svm->vcpu)); >>>> return 1; >>>> } >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>>> index 32eb588..600fb3c 100644 >>>> --- a/arch/x86/kvm/vmx.c >>>> +++ b/arch/x86/kvm/vmx.c >>>> @@ -4945,6 +4945,7 @@ out: >>>> static int handle_pause(struct kvm_vcpu *vcpu) >>>> { >>>> skip_emulated_instruction(vcpu); >>>> + vcpu->arch.plo.pause_loop_exited = true; >>>> kvm_vcpu_on_spin(vcpu); >>>> >>> >>> This code is duplicated. Should we move it to kvm_vcpu_on_spin? >>> >>> That means the .plo structure needs to be in common code, but that's not >>> too bad perhaps. >>> >> >> Since PLE is very much tied to x86, and proposed changes are very much >> specific to PLE handler, I thought it is better to make arch specific. >> >> So do you think it is good to move inside vcpu_on_spin and make ple >> structure belong to common code? > > See the discussion with Christian. PLE is tied to x86, but cpu_relax() > and facilities to trap it are not. Yep. > >>> >>> This adds some tiny overhead to vcpu entry. You could remove it by >>> using the vcpu->requests mechanism to clear the flag, since >>> vcpu->requests is already checked on every entry. >> >> So IIUC, let's have request bit for indicating PLE, >> >> pause_interception() /handle_pause() >> { >> make_request(PLE_REQUEST) >> vcpu_on_spin() >> >> } >> >> check_eligibility() >> { >> !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)&& >> dy_eligible()) >> . >> . >> } >> >> vcpu_run() >> { >> >> check_request(PLE_REQUEST) >> . >> . >> } >> >> Is this is the expected flow you had in mind? > > Yes, something like that. ok.. > >> >> [ But my only concern was not resetting for cases where we do not do >> guest_enter(). will test how that goes]. > > Hm, suppose we're the next-in-line for a ticket lock and exit due to > PLE. The lock holder completes and unlocks, which really assigns the > lock to us. So now we are the lock owner, yet we are marked as don't > yield-to-us in the PLE code. Yes.. off-topic but that is solved by kicked flag in PV spinlocks. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-11 11:56 ` Raghavendra K T @ 2012-07-11 12:41 ` Andrew Jones 0 siblings, 0 replies; 52+ messages in thread From: Andrew Jones @ 2012-07-11 12:41 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Avi Kivity ----- Original Message ----- > > > > Hm, suppose we're the next-in-line for a ticket lock and exit due > > to > > PLE. The lock holder completes and unlocks, which really assigns > > the > > lock to us. So now we are the lock owner, yet we are marked as > > don't > > yield-to-us in the PLE code. > > Yes.. off-topic but that is solved by kicked flag in PV spinlocks. > Yeah, this is a different, but related, topic. pvticketlocks not only help yield spinning vcpus, but also allow the use of ticketlocks for the guaranteed fairness. If we want that fairness we'll always need some pv-ness, even if it's a mostly PLE solution. I see that as a reasonable reason to take the pvticketlock series, assuming it performs at least as well as PLE. The following options have all been brought up at some point by various people, and all have their own pluses and minuses; PLE-only best-effort: + hardware-only, algorithm improvements can be made independent of guest OSes - has limited info about spinning vcpus, making it hard to improve the algorithm (improved with an auto-adjusting ple_window?) - perf enhancement/degradation is workload/ple_window dependant - impossible? to guarantee FIFO order pvticketlocks: - have to maintain pv code, both hosts and guests + perf is only workload dependant (should disable ple to avoid interference?) + guarantees FIFO order + can fall-back on PLE-only if the guest doesn't support it hybrid: + takes advantage of the hw support - still requires host and guest pv code - will likely make perf dependant on ple_window again + guarantees FIFO order ???: did I miss any? I think more benchmarking of PLE vs. pvticketlocks is needed, which I'm working on. If we see that it performs just as well or better, then IMHO, we should consider committing Raghu's latest version of the pvticketlock series, perhaps with and additional patch that auto-disables PLE when pvticketlocks are enabled. Drew ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-11 10:52 ` Raghavendra K T 2012-07-11 11:18 ` Avi Kivity @ 2012-07-12 10:58 ` Nikunj A Dadhania 2012-07-12 11:02 ` Raghavendra K T 1 sibling, 1 reply; 52+ messages in thread From: Nikunj A Dadhania @ 2012-07-12 10:58 UTC (permalink / raw) To: Raghavendra K T, Avi Kivity Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On Wed, 11 Jul 2012 16:22:29 +0530, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > On 07/11/2012 02:23 PM, Avi Kivity wrote: > > > > This adds some tiny overhead to vcpu entry. You could remove it by > > using the vcpu->requests mechanism to clear the flag, since > > vcpu->requests is already checked on every entry. > > So IIUC, let's have request bit for indicating PLE, > > pause_interception() /handle_pause() > { > make_request(PLE_REQUEST) > vcpu_on_spin() > > } > > check_eligibility() > { > !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST) && > dy_eligible()) > . > . > } > > vcpu_run() > { > > check_request(PLE_REQUEST) > I know check_request will clear PLE_REQUEST, but you just need a clear_request here, right? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit 2012-07-12 10:58 ` Nikunj A Dadhania @ 2012-07-12 11:02 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-12 11:02 UTC (permalink / raw) To: Nikunj A Dadhania Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Marcelo Tosatti, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/12/2012 04:28 PM, Nikunj A Dadhania wrote: > On Wed, 11 Jul 2012 16:22:29 +0530, Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> wrote: >> On 07/11/2012 02:23 PM, Avi Kivity wrote: >>> >>> This adds some tiny overhead to vcpu entry. You could remove it by >>> using the vcpu->requests mechanism to clear the flag, since >>> vcpu->requests is already checked on every entry. >> >> So IIUC, let's have request bit for indicating PLE, >> >> pause_interception() /handle_pause() >> { >> make_request(PLE_REQUEST) >> vcpu_on_spin() >> >> } >> >> check_eligibility() >> { >> !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)&& >> dy_eligible()) >> . >> . >> } >> >> vcpu_run() >> { >> >> check_request(PLE_REQUEST) >> > I know check_request will clear PLE_REQUEST, but you just need a > clear_request here, right? > Yes. tried to use check_request for clearing. But I ended up in different implementation. (latest thread) ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T @ 2012-07-09 6:20 ` Raghavendra K T 2012-07-09 22:30 ` Rik van Riel 2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger ` (2 subsequent siblings) 4 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-09 6:20 UTC (permalink / raw) To: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel Cc: S390, Carsten Otte, Christian Borntraeger, KVM, Raghavendra K T, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Currently PLE handler can repeatedly do a directed yield to same vcpu that has recently done PL exit. This can degrade the performance Try to yield to most eligible guy instead, by alternate yielding. Precisely, give chance to a VCPU which has: (a) Not done PLE exit at all (probably he is preempted lock-holder) (b) VCPU skipped in last iteration because it did PL exit, and probably has become eligible now (next eligible lock holder) Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- arch/s390/include/asm/kvm_host.h | 5 +++++ arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/x86.c | 14 ++++++++++++++ virt/kvm/kvm_main.c | 3 +++ 4 files changed, 23 insertions(+), 1 deletions(-) diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index dd17537..884f2c4 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -256,5 +256,10 @@ struct kvm_arch{ struct gmap *gmap; }; +static inline bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *v) +{ + return true; +} + extern int sie64a(struct kvm_s390_sie_block *, u64 *); #endif diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 857ca68..ce01db3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -962,7 +962,7 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn); void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err); int kvm_is_in_guest(void); - +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu); void kvm_pmu_init(struct kvm_vcpu *vcpu); void kvm_pmu_destroy(struct kvm_vcpu *vcpu); void kvm_pmu_reset(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 07dbd14..24ceae8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6623,6 +6623,20 @@ bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) kvm_x86_ops->interrupt_allowed(vcpu); } +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu) +{ + bool eligible; + + eligible = !vcpu->arch.plo.pause_loop_exited || + (vcpu->arch.plo.pause_loop_exited && + vcpu->arch.plo.dy_eligible); + + if (vcpu->arch.plo.pause_loop_exited) + vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible; + + return eligible; +} + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..519321a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1595,6 +1595,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) continue; if (waitqueue_active(&vcpu->wq)) continue; + if (!kvm_arch_vcpu_check_and_update_eligible(vcpu)) { + continue; + } if (kvm_vcpu_yield_to(vcpu)) { kvm->last_boosted_vcpu = i; yielded = 1; ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T @ 2012-07-09 22:30 ` Rik van Riel 2012-07-10 11:46 ` Raghavendra K T 0 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2012-07-09 22:30 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 02:20 AM, Raghavendra K T wrote: > +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu) > +{ > + bool eligible; > + > + eligible = !vcpu->arch.plo.pause_loop_exited || > + (vcpu->arch.plo.pause_loop_exited&& > + vcpu->arch.plo.dy_eligible); > + > + if (vcpu->arch.plo.pause_loop_exited) > + vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible; > + > + return eligible; > +} This is a nice simple mechanism to skip CPUs that were eligible last time and had pause loop exits recently. However, it could stand some documentation. Please add a good comment explaining how and why the algorithm works, when arch.plo.pause_loop_exited is cleared, etc... It would be good to make this heuristic understandable to people who look at the code for the first time. -- All rights reversed ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield 2012-07-09 22:30 ` Rik van Riel @ 2012-07-10 11:46 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 11:46 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/10/2012 04:00 AM, Rik van Riel wrote: > On 07/09/2012 02:20 AM, Raghavendra K T wrote: > >> +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu) >> +{ >> + bool eligible; >> + >> + eligible = !vcpu->arch.plo.pause_loop_exited || >> + (vcpu->arch.plo.pause_loop_exited&& >> + vcpu->arch.plo.dy_eligible); >> + >> + if (vcpu->arch.plo.pause_loop_exited) >> + vcpu->arch.plo.dy_eligible = !vcpu->arch.plo.dy_eligible; >> + >> + return eligible; >> +} > > This is a nice simple mechanism to skip CPUs that were > eligible last time and had pause loop exits recently. > > However, it could stand some documentation. Please > add a good comment explaining how and why the algorithm > works, when arch.plo.pause_loop_exited is cleared, etc... > > It would be good to make this heuristic understandable > to people who look at the code for the first time. > Thanks for the review. will do more documentation. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T @ 2012-07-09 7:55 ` Christian Borntraeger 2012-07-10 8:27 ` Raghavendra K T 2012-07-11 9:06 ` Avi Kivity 2012-07-09 21:47 ` Andrew Theurer 2012-07-09 22:28 ` Rik van Riel 4 siblings, 2 replies; 52+ messages in thread From: Christian Borntraeger @ 2012-07-09 7:55 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 09/07/12 08:20, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. > > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor). Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. I have to double check with others, if these cases are critical, but for now, it seems that your dummy implementation for s390 is just fine. After all it is a no-op until we implement something. Thanks Christian ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger @ 2012-07-10 8:27 ` Raghavendra K T 2012-07-11 9:06 ` Avi Kivity 1 sibling, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 8:27 UTC (permalink / raw) To: Christian Borntraeger Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 01:25 PM, Christian Borntraeger wrote: > On 09/07/12 08:20, Raghavendra K T wrote: >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a >> random VCPU on PL exit. Though we already have filtering while choosing >> the candidate to yield_to, we can do better. >> >> Problem is, for large vcpu guests, we have more probability of yielding >> to a bad vcpu. We are not able to prevent directed yield to same guy who >> has done PL exit recently, who perhaps spins again and wastes CPU. >> >> Fix that by keeping track of who has done PL exit. So The Algorithm in series >> give chance to a VCPU which has: > > > We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor). > > Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. > So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. > I have to double check with others, if these cases are critical, but for now, it seems > that your dummy implementation for s390 is just fine. After all it is a no-op until > we implement something. > Thanks for the review. Nice to know that, patch has potential to help s390 also. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger 2012-07-10 8:27 ` Raghavendra K T @ 2012-07-11 9:06 ` Avi Kivity 2012-07-11 10:17 ` Christian Borntraeger 1 sibling, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-11 9:06 UTC (permalink / raw) To: Christian Borntraeger Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 10:55 AM, Christian Borntraeger wrote: > On 09/07/12 08:20, Raghavendra K T wrote: >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a >> random VCPU on PL exit. Though we already have filtering while choosing >> the candidate to yield_to, we can do better. >> >> Problem is, for large vcpu guests, we have more probability of yielding >> to a bad vcpu. We are not able to prevent directed yield to same guy who >> has done PL exit recently, who perhaps spins again and wastes CPU. >> >> Fix that by keeping track of who has done PL exit. So The Algorithm in series >> give chance to a VCPU which has: > > > We could do the same for s390. The appropriate exit would be diag44 (yield to hypervisor). > > Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. Perhaps x86 should copy this. > So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. > I have to double check with others, if these cases are critical, but for now, it seems > that your dummy implementation for s390 is just fine. After all it is a no-op until > we implement something. Does the data structure make sense for you? If so we can move it to common code (and manage it in kvm_vcpu_on_spin()). We can guard it with CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't have to pay anything. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 9:06 ` Avi Kivity @ 2012-07-11 10:17 ` Christian Borntraeger 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:51 ` Raghavendra K T 0 siblings, 2 replies; 52+ messages in thread From: Christian Borntraeger @ 2012-07-11 10:17 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt On 11/07/12 11:06, Avi Kivity wrote: [...] >> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. > > Perhaps x86 should copy this. See arch/s390/lib/spinlock.c The basic idea is using several heuristics: - loop for a given amount of loops - check if the lock holder is currently scheduled by the hypervisor (smp_vcpu_scheduled, which uses the sigp sense running instruction) Dont know if such thing is available for x86. It must be a lot cheaper than a guest exit to be useful - if lock holder is not running and we looped for a while do a directed yield to that cpu. > >> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >> I have to double check with others, if these cases are critical, but for now, it seems >> that your dummy implementation for s390 is just fine. After all it is a no-op until >> we implement something. > > Does the data structure make sense for you? If so we can move it to > common code (and manage it in kvm_vcpu_on_spin()). We can guard it with > CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't > have to pay anything. Ignoring the name, yes the data structure itself seems based on the algorithm and not on arch specific things. That should work. If we move that to common code then s390 will use that scheme automatically for the cases were we call kvm_vcpu_on_spin(). All others archs as well. So this would probably improve guests that uses cpu_relax, for example stop_machine_run. I have no measurements, though. Christian ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 10:17 ` Christian Borntraeger @ 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:16 ` Alexander Graf ` (3 more replies) 2012-07-11 11:51 ` Raghavendra K T 1 sibling, 4 replies; 52+ messages in thread From: Avi Kivity @ 2012-07-11 11:04 UTC (permalink / raw) To: Christian Borntraeger Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 07/11/2012 01:17 PM, Christian Borntraeger wrote: > On 11/07/12 11:06, Avi Kivity wrote: > [...] >>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >> >> Perhaps x86 should copy this. > > See arch/s390/lib/spinlock.c > The basic idea is using several heuristics: > - loop for a given amount of loops > - check if the lock holder is currently scheduled by the hypervisor > (smp_vcpu_scheduled, which uses the sigp sense running instruction) > Dont know if such thing is available for x86. It must be a lot cheaper > than a guest exit to be useful We could make it available via shared memory, updated using preempt notifiers. Of course piling on more pv makes this less attractive. > - if lock holder is not running and we looped for a while do a directed > yield to that cpu. > >> >>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>> I have to double check with others, if these cases are critical, but for now, it seems >>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>> we implement something. >> >> Does the data structure make sense for you? If so we can move it to >> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >> have to pay anything. > > Ignoring the name, What name would you suggest? > yes the data structure itself seems based on the algorithm > and not on arch specific things. That should work. If we move that to common > code then s390 will use that scheme automatically for the cases were we call > kvm_vcpu_on_spin(). All others archs as well. ARM doesn't have an instruction for cpu_relax(), so it can't intercept it. Given ppc's dislike of overcommit, and the way it implements cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't intercept those either, but I'm copying the ppc people in case I'm wrong. So it's s390 and x86. > So this would probably improve guests that uses cpu_relax, for example > stop_machine_run. I have no measurements, though. smp_call_function() too (though that can be converted to directed yield too). It seems worthwhile. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:04 ` Avi Kivity @ 2012-07-11 11:16 ` Alexander Graf 2012-07-11 11:23 ` Avi Kivity 2012-07-11 11:18 ` Christian Borntraeger ` (2 subsequent siblings) 3 siblings, 1 reply; 52+ messages in thread From: Alexander Graf @ 2012-07-11 11:16 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt On 11.07.2012, at 13:04, Avi Kivity wrote: > On 07/11/2012 01:17 PM, Christian Borntraeger wrote: >> On 11/07/12 11:06, Avi Kivity wrote: >> [...] >>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>> >>> Perhaps x86 should copy this. >> >> See arch/s390/lib/spinlock.c >> The basic idea is using several heuristics: >> - loop for a given amount of loops >> - check if the lock holder is currently scheduled by the hypervisor >> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >> Dont know if such thing is available for x86. It must be a lot cheaper >> than a guest exit to be useful > > We could make it available via shared memory, updated using preempt > notifiers. Of course piling on more pv makes this less attractive. > >> - if lock holder is not running and we looped for a while do a directed >> yield to that cpu. >> >>> >>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>>> I have to double check with others, if these cases are critical, but for now, it seems >>>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>>> we implement something. >>> >>> Does the data structure make sense for you? If so we can move it to >>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >>> have to pay anything. >> >> Ignoring the name, > > What name would you suggest? > >> yes the data structure itself seems based on the algorithm >> and not on arch specific things. That should work. If we move that to common >> code then s390 will use that scheme automatically for the cases were we call >> kvm_vcpu_on_spin(). All others archs as well. > > ARM doesn't have an instruction for cpu_relax(), so it can't intercept > it. Given ppc's dislike of overcommit, What dislike of overcommit? > and the way it implements cpu_relax() by adjusting hw thread priority, Yeah, I don't think we can intercept relaxing. It's basically a nop-like instruction that gives hardware hints on its current priorities. That said, we can always add PV code. Alex ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:16 ` Alexander Graf @ 2012-07-11 11:23 ` Avi Kivity 2012-07-11 11:52 ` Alexander Graf 2012-07-12 2:19 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 52+ messages in thread From: Avi Kivity @ 2012-07-11 11:23 UTC (permalink / raw) To: Alexander Graf Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt On 07/11/2012 02:16 PM, Alexander Graf wrote: >> >>> yes the data structure itself seems based on the algorithm >>> and not on arch specific things. That should work. If we move that to common >>> code then s390 will use that scheme automatically for the cases were we call >>> kvm_vcpu_on_spin(). All others archs as well. >> >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept >> it. Given ppc's dislike of overcommit, > > What dislike of overcommit? I understood ppc virtualization is more of the partitioning sort. Perhaps I misunderstood it. But the reliance on device assignment, the restrictions on scheduling, etc. all point to it. > >> and the way it implements cpu_relax() by adjusting hw thread priority, > > Yeah, I don't think we can intercept relaxing. ... and the lack of ability to intercept cpu_relax() ... > It's basically a nop-like instruction that gives hardware hints on its current priorities. That's what x88 PAUSE does. But we can intercept it (and not just any execution - we can restrict intercept to tight loops executed more than a specific number of times). > That said, we can always add PV code. Sure, but that's defeated by advancements like self-tuning PLE exits. It's hard to get this right. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:23 ` Avi Kivity @ 2012-07-11 11:52 ` Alexander Graf 2012-07-11 12:48 ` Avi Kivity 2012-07-12 2:19 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 52+ messages in thread From: Alexander Graf @ 2012-07-11 11:52 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt On 11.07.2012, at 13:23, Avi Kivity wrote: > On 07/11/2012 02:16 PM, Alexander Graf wrote: >>> >>>> yes the data structure itself seems based on the algorithm >>>> and not on arch specific things. That should work. If we move that to common >>>> code then s390 will use that scheme automatically for the cases were we call >>>> kvm_vcpu_on_spin(). All others archs as well. >>> >>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept >>> it. Given ppc's dislike of overcommit, >> >> What dislike of overcommit? > > I understood ppc virtualization is more of the partitioning sort. > Perhaps I misunderstood it. But the reliance on device assignment, the > restrictions on scheduling, etc. all point to it. Well, you need to distinguish the different PPC targets here. In the embedded world, partitioning is obviously the biggest use case, though overcommit is possible. For big servers however, we usually do want overcommit and we do support it within the constraints hardware gives us. It's really no different from x86 when it comes to the use case wideness :). > >> >>> and the way it implements cpu_relax() by adjusting hw thread priority, >> >> Yeah, I don't think we can intercept relaxing. > > ... and the lack of ability to intercept cpu_relax() ... > >> It's basically a nop-like instruction that gives hardware hints on its current priorities. > > That's what x88 PAUSE does. But we can intercept it (and not just any > execution - we can restrict intercept to tight loops executed more than > a specific number of times). Yeah, it's pretty hard to fetch that information from PPC, since unlike x86 we split the hint from the loop. But I'll let Ben speak to that, he certainly knows way better how the hardware works. > >> That said, we can always add PV code. > > Sure, but that's defeated by advancements like self-tuning PLE exits. > It's hard to get this right. Well, eventually everything we do in PV is going to be moot as soon as hardware catches up. In most cases from what I've seen it's only useful as an interim solution. But for that time it's good to have :). Alex ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:52 ` Alexander Graf @ 2012-07-11 12:48 ` Avi Kivity 0 siblings, 0 replies; 52+ messages in thread From: Avi Kivity @ 2012-07-11 12:48 UTC (permalink / raw) To: Alexander Graf Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Paul Mackerras, Benjamin Herrenschmidt On 07/11/2012 02:52 PM, Alexander Graf wrote: > > On 11.07.2012, at 13:23, Avi Kivity wrote: > >> On 07/11/2012 02:16 PM, Alexander Graf wrote: >>>> >>>>> yes the data structure itself seems based on the algorithm >>>>> and not on arch specific things. That should work. If we move that to common >>>>> code then s390 will use that scheme automatically for the cases were we call >>>>> kvm_vcpu_on_spin(). All others archs as well. >>>> >>>> ARM doesn't have an instruction for cpu_relax(), so it can't intercept >>>> it. Given ppc's dislike of overcommit, >>> >>> What dislike of overcommit? >> >> I understood ppc virtualization is more of the partitioning sort. >> Perhaps I misunderstood it. But the reliance on device assignment, the >> restrictions on scheduling, etc. all point to it. > > Well, you need to distinguish the different PPC targets here. In the embedded world, partitioning is obviously the biggest use case, though overcommit is possible. For big servers however, we usually do want overcommit and we do support it within the constraints hardware gives us. > > It's really no different from x86 when it comes to the use case wideness :). Okay, thanks for the correction. > >> >>> >>>> and the way it implements cpu_relax() by adjusting hw thread priority, >>> >>> Yeah, I don't think we can intercept relaxing. >> >> ... and the lack of ability to intercept cpu_relax() ... >> >>> It's basically a nop-like instruction that gives hardware hints on its current priorities. >> >> That's what x88 PAUSE does. But we can intercept it (and not just any >> execution - we can restrict intercept to tight loops executed more than >> a specific number of times). > > Yeah, it's pretty hard to fetch that information from PPC, since unlike x86 we split the hint from the loop. But I'll let Ben speak to that, he certainly knows way better how the hardware works. On x86 it's split from the loop as well (inside cpu_relax() like ppc). But the hardware detects the loop and lets us know about it. > >> >>> That said, we can always add PV code. >> >> Sure, but that's defeated by advancements like self-tuning PLE exits. >> It's hard to get this right. > > Well, eventually everything we do in PV is going to be moot as soon as hardware catches up. In most cases from what I've seen it's only useful as an interim solution. But for that time it's good to have :). Depends on how interim it is. If better hardware is coming, I'd rather not add more pv-ness. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:23 ` Avi Kivity 2012-07-11 11:52 ` Alexander Graf @ 2012-07-12 2:19 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 52+ messages in thread From: Benjamin Herrenschmidt @ 2012-07-12 2:19 UTC (permalink / raw) To: Avi Kivity Cc: Alexander Graf, Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Paul Mackerras On Wed, 2012-07-11 at 14:23 +0300, Avi Kivity wrote: > On 07/11/2012 02:16 PM, Alexander Graf wrote: > >> > >>> yes the data structure itself seems based on the algorithm > >>> and not on arch specific things. That should work. If we move that to common > >>> code then s390 will use that scheme automatically for the cases were we call > >>> kvm_vcpu_on_spin(). All others archs as well. > >> > >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept > >> it. Given ppc's dislike of overcommit, > > > > What dislike of overcommit? > > I understood ppc virtualization is more of the partitioning sort. > Perhaps I misunderstood it. But the reliance on device assignment, the > restrictions on scheduling, etc. all point to it. It historically was but that has changed quite a bit. Essentially the user can configure partitions to be more of the "all virtualized" kind or on the contrary more fixed partitions. The hypervisor does shared processors and we have paravirt APIs to cede our time slice to the lock holder. > >> and the way it implements cpu_relax() by adjusting hw thread priority, > > > > Yeah, I don't think we can intercept relaxing. > > ... and the lack of ability to intercept cpu_relax() ... > > > It's basically a nop-like instruction that gives hardware hints on its current priorities. > > That's what x88 PAUSE does. But we can intercept it (and not just any > execution - we can restrict intercept to tight loops executed more than > a specific number of times). > > > That said, we can always add PV code. > > Sure, but that's defeated by advancements like self-tuning PLE exits. > It's hard to get this right. > Cheers, Ben. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:16 ` Alexander Graf @ 2012-07-11 11:18 ` Christian Borntraeger 2012-07-11 11:39 ` Avi Kivity 2012-07-12 2:17 ` Benjamin Herrenschmidt 2012-07-12 10:38 ` Nikunj A Dadhania 3 siblings, 1 reply; 52+ messages in thread From: Christian Borntraeger @ 2012-07-11 11:18 UTC (permalink / raw) To: Avi Kivity Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 11/07/12 13:04, Avi Kivity wrote: > On 07/11/2012 01:17 PM, Christian Borntraeger wrote: >> On 11/07/12 11:06, Avi Kivity wrote: >> [...] >>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>> >>> Perhaps x86 should copy this. >> >> See arch/s390/lib/spinlock.c >> The basic idea is using several heuristics: >> - loop for a given amount of loops >> - check if the lock holder is currently scheduled by the hypervisor >> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >> Dont know if such thing is available for x86. It must be a lot cheaper >> than a guest exit to be useful > > We could make it available via shared memory, updated using preempt > notifiers. Of course piling on more pv makes this less attractive. > >> - if lock holder is not running and we looped for a while do a directed >> yield to that cpu. >> >>> >>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>>> I have to double check with others, if these cases are critical, but for now, it seems >>>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>>> we implement something. >>> >>> Does the data structure make sense for you? If so we can move it to >>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >>> have to pay anything. >> >> Ignoring the name, > > What name would you suggest? maybe vcpu_no_progress instead of pause_loop_exited > >> yes the data structure itself seems based on the algorithm >> and not on arch specific things. That should work. If we move that to common >> code then s390 will use that scheme automatically for the cases were we call >> kvm_vcpu_on_spin(). All others archs as well. > > ARM doesn't have an instruction for cpu_relax(), so it can't intercept > it. Given ppc's dislike of overcommit, and the way it implements > cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't > intercept those either, but I'm copying the ppc people in case I'm > wrong. So it's s390 and x86. > >> So this would probably improve guests that uses cpu_relax, for example >> stop_machine_run. I have no measurements, though. > > smp_call_function() too (though that can be converted to directed yield > too). It seems worthwhile. Indeed. For those places where is is possible I would like to see an architecture primitive for directed yield. That could be useful for other places as well (e.g. maybe lib/spinlock_debug.c, which has no yielding at all) Christian ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:18 ` Christian Borntraeger @ 2012-07-11 11:39 ` Avi Kivity 2012-07-12 5:11 ` Raghavendra K T 0 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-11 11:39 UTC (permalink / raw) To: Christian Borntraeger Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 07/11/2012 02:18 PM, Christian Borntraeger wrote: > On 11/07/12 13:04, Avi Kivity wrote: >> On 07/11/2012 01:17 PM, Christian Borntraeger wrote: >>> On 11/07/12 11:06, Avi Kivity wrote: >>> [...] >>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>>> >>>> Perhaps x86 should copy this. >>> >>> See arch/s390/lib/spinlock.c >>> The basic idea is using several heuristics: >>> - loop for a given amount of loops >>> - check if the lock holder is currently scheduled by the hypervisor >>> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >>> Dont know if such thing is available for x86. It must be a lot cheaper >>> than a guest exit to be useful >> >> We could make it available via shared memory, updated using preempt >> notifiers. Of course piling on more pv makes this less attractive. >> >>> - if lock holder is not running and we looped for a while do a directed >>> yield to that cpu. >>> >>>> >>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>>>> I have to double check with others, if these cases are critical, but for now, it seems >>>>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>>>> we implement something. >>>> >>>> Does the data structure make sense for you? If so we can move it to >>>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >>>> have to pay anything. >>> >>> Ignoring the name, >> >> What name would you suggest? > > maybe vcpu_no_progress instead of pause_loop_exited Ah, I thouht you objected to the CONFIG var. Maybe call it cpu_relax_intercepted since that's the linuxy name for the instruction. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:39 ` Avi Kivity @ 2012-07-12 5:11 ` Raghavendra K T 2012-07-12 8:11 ` Avi Kivity 0 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-12 5:11 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 07/11/2012 05:09 PM, Avi Kivity wrote: > On 07/11/2012 02:18 PM, Christian Borntraeger wrote: >> On 11/07/12 13:04, Avi Kivity wrote: >>> On 07/11/2012 01:17 PM, Christian Borntraeger wrote: >>>> On 11/07/12 11:06, Avi Kivity wrote: >>>> [...] >>>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>>>> >>>>> Perhaps x86 should copy this. >>>> >>>> See arch/s390/lib/spinlock.c >>>> The basic idea is using several heuristics: >>>> - loop for a given amount of loops >>>> - check if the lock holder is currently scheduled by the hypervisor >>>> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >>>> Dont know if such thing is available for x86. It must be a lot cheaper >>>> than a guest exit to be useful >>> >>> We could make it available via shared memory, updated using preempt >>> notifiers. Of course piling on more pv makes this less attractive. >>> >>>> - if lock holder is not running and we looped for a while do a directed >>>> yield to that cpu. >>>> >>>>> >>>>>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>>>>> I have to double check with others, if these cases are critical, but for now, it seems >>>>>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>>>>> we implement something. >>>>> >>>>> Does the data structure make sense for you? If so we can move it to >>>>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >>>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >>>>> have to pay anything. >>>> >>>> Ignoring the name, >>> >>> What name would you suggest? >> >> maybe vcpu_no_progress instead of pause_loop_exited > > Ah, I thouht you objected to the CONFIG var. Maybe call it > cpu_relax_intercepted since that's the linuxy name for the instruction. > Ok, just to be on same page. 'll have : 1. cpu_relax_intercepted instead of pause_loop_exited. 2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally selected for x86 and s390 3. make request mechanism to clear cpu_relax_intercepted. ('ll do same thing for s390 also but have not seen s390 code using request mechanism, so not sure if it ok.. otherwise we have to clear unconditionally for s390 before guest enter and for x86 we have to move make_request back to vmx/svm). will post V3 with these changes. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 5:11 ` Raghavendra K T @ 2012-07-12 8:11 ` Avi Kivity 2012-07-12 8:32 ` Raghavendra K T 0 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-12 8:11 UTC (permalink / raw) To: Raghavendra K T Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 07/12/2012 08:11 AM, Raghavendra K T wrote: >> Ah, I thouht you objected to the CONFIG var. Maybe call it >> cpu_relax_intercepted since that's the linuxy name for the instruction. >> > > Ok, just to be on same page. 'll have : > 1. cpu_relax_intercepted instead of pause_loop_exited. > > 2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally > selected for x86 and s390 > > 3. make request mechanism to clear cpu_relax_intercepted. > > ('ll do same thing for s390 also but have not seen s390 code using > request mechanism, so not sure if it ok.. otherwise we have to clear > unconditionally for s390 before guest enter and for x86 we have to move > make_request back to vmx/svm). > will post V3 with these changes. You can leave the s390 changes to the s390 people; just make sure the generic code is ready. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 8:11 ` Avi Kivity @ 2012-07-12 8:32 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-12 8:32 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On 07/12/2012 01:41 PM, Avi Kivity wrote: > On 07/12/2012 08:11 AM, Raghavendra K T wrote: >>> Ah, I thouht you objected to the CONFIG var. Maybe call it >>> cpu_relax_intercepted since that's the linuxy name for the instruction. >>> >> >> Ok, just to be on same page. 'll have : >> 1. cpu_relax_intercepted instead of pause_loop_exited. >> >> 2. CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT which is unconditionally >> selected for x86 and s390 >> >> 3. make request mechanism to clear cpu_relax_intercepted. >> >> ('ll do same thing for s390 also but have not seen s390 code using >> request mechanism, so not sure if it ok.. otherwise we have to clear >> unconditionally for s390 before guest enter and for x86 we have to move >> make_request back to vmx/svm). >> will post V3 with these changes. > > You can leave the s390 changes to the s390 people; just make sure the > generic code is ready. > Yep, Checked the following logic with make_request and it works fine, vcpu_spin() { ple_exited = true; . . make_request(KVM_REQ_CLEAR_PLE, vcpu); } vcpu_enter_guest() { if(check_request(KVM_REQ_CLEAR_PLE)) ple_exited = false; . . } But there is following approach that is working perfectly fine. vcpu_spin() { ple_exited = true; . . ple_exited = false; } I hope to go with second approach. let me know if you find any loop hole. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:16 ` Alexander Graf 2012-07-11 11:18 ` Christian Borntraeger @ 2012-07-12 2:17 ` Benjamin Herrenschmidt 2012-07-12 8:12 ` Avi Kivity 2012-07-12 10:38 ` Nikunj A Dadhania 3 siblings, 1 reply; 52+ messages in thread From: Benjamin Herrenschmidt @ 2012-07-12 2:17 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras > ARM doesn't have an instruction for cpu_relax(), so it can't intercept > it. Given ppc's dislike of overcommit, and the way it implements > cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't > intercept those either, but I'm copying the ppc people in case I'm > wrong. So it's s390 and x86. No but our spinlocks call __spin_yield() (or __rw_yield) which does some paravirt tricks already. We check if the holder is currently running, and if not, we call the H_CONFER hypercall which can be used to "give" our time slice to the holder. Our implementation of H_CONFER in KVM is currently a nop though. Cheers, Ben. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 2:17 ` Benjamin Herrenschmidt @ 2012-07-12 8:12 ` Avi Kivity 2012-07-12 11:24 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-12 8:12 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras On 07/12/2012 05:17 AM, Benjamin Herrenschmidt wrote: >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept >> it. Given ppc's dislike of overcommit, and the way it implements >> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't >> intercept those either, but I'm copying the ppc people in case I'm >> wrong. So it's s390 and x86. > > No but our spinlocks call __spin_yield() (or __rw_yield) which does > some paravirt tricks already. > > We check if the holder is currently running, and if not, we call the > H_CONFER hypercall which can be used to "give" our time slice to the > holder. > > Our implementation of H_CONFER in KVM is currently a nop though. Okay, so you can join the party. See yield_to() and kvm_vcpu_on_spin(). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 8:12 ` Avi Kivity @ 2012-07-12 11:24 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 52+ messages in thread From: Benjamin Herrenschmidt @ 2012-07-12 11:24 UTC (permalink / raw) To: Avi Kivity Cc: Christian Borntraeger, Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras On Thu, 2012-07-12 at 11:12 +0300, Avi Kivity wrote: > On 07/12/2012 05:17 AM, Benjamin Herrenschmidt wrote: > >> ARM doesn't have an instruction for cpu_relax(), so it can't intercept > >> it. Given ppc's dislike of overcommit, and the way it implements > >> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't > >> intercept those either, but I'm copying the ppc people in case I'm > >> wrong. So it's s390 and x86. > > > > No but our spinlocks call __spin_yield() (or __rw_yield) which does > > some paravirt tricks already. > > > > We check if the holder is currently running, and if not, we call the > > H_CONFER hypercall which can be used to "give" our time slice to the > > holder. > > > > Our implementation of H_CONFER in KVM is currently a nop though. > > Okay, so you can join the party. See yield_to() and kvm_vcpu_on_spin(). Thanks ! I'll have a look eventually :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:04 ` Avi Kivity ` (2 preceding siblings ...) 2012-07-12 2:17 ` Benjamin Herrenschmidt @ 2012-07-12 10:38 ` Nikunj A Dadhania 3 siblings, 0 replies; 52+ messages in thread From: Nikunj A Dadhania @ 2012-07-12 10:38 UTC (permalink / raw) To: Avi Kivity, Christian Borntraeger Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt, Alexander Graf, Paul Mackerras, Benjamin Herrenschmidt On Wed, 11 Jul 2012 14:04:03 +0300, Avi Kivity <avi@redhat.com> wrote: > > > So this would probably improve guests that uses cpu_relax, for example > > stop_machine_run. I have no measurements, though. > > smp_call_function() too (though that can be converted to directed yield > too). It seems worthwhile. > With https://lkml.org/lkml/2012/6/26/266 in tip:x86/mm which now uses smp_call_function_many in native_flush_tlb_others. It will help that too. Nikunj ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 10:17 ` Christian Borntraeger 2012-07-11 11:04 ` Avi Kivity @ 2012-07-11 11:51 ` Raghavendra K T 2012-07-11 11:55 ` Christian Borntraeger 2012-07-11 13:04 ` Raghavendra K T 1 sibling, 2 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 11:51 UTC (permalink / raw) To: Christian Borntraeger Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt On 07/11/2012 03:47 PM, Christian Borntraeger wrote: > On 11/07/12 11:06, Avi Kivity wrote: > [...] >>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >> >> Perhaps x86 should copy this. > > See arch/s390/lib/spinlock.c > The basic idea is using several heuristics: > - loop for a given amount of loops > - check if the lock holder is currently scheduled by the hypervisor > (smp_vcpu_scheduled, which uses the sigp sense running instruction) > Dont know if such thing is available for x86. It must be a lot cheaper > than a guest exit to be useful Unfortunately we do not have information on lock-holder. > - if lock holder is not running and we looped for a while do a directed > yield to that cpu. > >> >>> So there is no win here, but there are other cases were diag44 is used, e.g. cpu_relax. >>> I have to double check with others, if these cases are critical, but for now, it seems >>> that your dummy implementation for s390 is just fine. After all it is a no-op until >>> we implement something. >> >> Does the data structure make sense for you? If so we can move it to >> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >> have to pay anything. > > Ignoring the name, yes the data structure itself seems based on the algorithm > and not on arch specific things. That should work. Ok. can you please elaborate, on the flow. If we move that to common > code then s390 will use that scheme automatically for the cases were we call > kvm_vcpu_on_spin(). All others archs as well. > > So this would probably improve guests that uses cpu_relax, for example > stop_machine_run. I have no measurements, though. > > Christian > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:51 ` Raghavendra K T @ 2012-07-11 11:55 ` Christian Borntraeger 2012-07-11 12:04 ` Raghavendra K T 2012-07-11 13:04 ` Raghavendra K T 1 sibling, 1 reply; 52+ messages in thread From: Christian Borntraeger @ 2012-07-11 11:55 UTC (permalink / raw) To: Raghavendra K T Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt On 11/07/12 13:51, Raghavendra K T wrote: >>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>> >>> Perhaps x86 should copy this. >> >> See arch/s390/lib/spinlock.c >> The basic idea is using several heuristics: >> - loop for a given amount of loops >> - check if the lock holder is currently scheduled by the hypervisor >> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >> Dont know if such thing is available for x86. It must be a lot cheaper >> than a guest exit to be useful > > Unfortunately we do not have information on lock-holder. That would be an independent patch and requires guest changes. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:55 ` Christian Borntraeger @ 2012-07-11 12:04 ` Raghavendra K T 0 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 12:04 UTC (permalink / raw) To: Christian Borntraeger Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt On 07/11/2012 05:25 PM, Christian Borntraeger wrote: > On 11/07/12 13:51, Raghavendra K T wrote: >>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for spinlocks, though. >>>> >>>> Perhaps x86 should copy this. >>> >>> See arch/s390/lib/spinlock.c >>> The basic idea is using several heuristics: >>> - loop for a given amount of loops >>> - check if the lock holder is currently scheduled by the hypervisor >>> (smp_vcpu_scheduled, which uses the sigp sense running instruction) >>> Dont know if such thing is available for x86. It must be a lot cheaper >>> than a guest exit to be useful >> >> Unfortunately we do not have information on lock-holder. > > That would be an independent patch and requires guest changes. > Yes, AFAI think, there are two options: (1) extend lock and use spare bit in ticketlock indicate lock is held (2) use percpu list entry. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 11:51 ` Raghavendra K T 2012-07-11 11:55 ` Christian Borntraeger @ 2012-07-11 13:04 ` Raghavendra K T 1 sibling, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 13:04 UTC (permalink / raw) To: Christian Borntraeger Cc: Avi Kivity, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Christian Ehrhardt On 07/11/2012 05:21 PM, Raghavendra K T wrote: > On 07/11/2012 03:47 PM, Christian Borntraeger wrote: >> On 11/07/12 11:06, Avi Kivity wrote: [...] >>>> So there is no win here, but there are other cases were diag44 is >>>> used, e.g. cpu_relax. >>>> I have to double check with others, if these cases are critical, but >>>> for now, it seems >>>> that your dummy implementation for s390 is just fine. After all it >>>> is a no-op until >>>> we implement something. >>> >>> Does the data structure make sense for you? If so we can move it to >>> common code (and manage it in kvm_vcpu_on_spin()). We can guard it with >>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't >>> have to pay anything. >> >> Ignoring the name, yes the data structure itself seems based on the >> algorithm >> and not on arch specific things. That should work. > > Ok. can you please elaborate, on the flow. > Ok got it.. Will check how the code can be common to both x86 and s390. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T @ 2012-07-09 21:47 ` Andrew Theurer 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T ` (3 subsequent siblings) 4 siblings, 0 replies; 52+ messages in thread From: Andrew Theurer @ 2012-07-09 21:47 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. Hi, Raghu. > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With > that information we can limit the scan and also decide whether schedule() > is better. [ I am able to use #kicked vcpus to decide on this But may > be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about > next eligible lock-holder. > > Summary: There is a huge improvement for moderate / no overcommit scenario > for kvm based guest on PLE machine (which is difficult ;) ). > > Result: > Base : kernel 3.5.0-rc5 with Rik's Ple handler fix > > Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, > 32 core machine Is this with HT enabled, therefore 64 CPU threads? > Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) > with test kernels > > Guest: fedora 16 with 32 vcpus 8GB memory. Can you briefly explain the 1x and 2x configs? This of course is highly dependent whether or not HT is enabled... FWIW, I started testing what I would call "0.5x", where I have one 40 vcpu guest running on a host with 40 cores and 80 CPU threads total (HT enabled, no extra load on the system). For ebizzy, the results are quite erratic from run to run, so I am inclined to discard it as a workload, but maybe I should try "1x" and "2x" cpu over-commit as well. >From initial observations, at least for the ebizzy workload, the percentage of exits that result in a yield_to() are very low, around 1%, before these patches. So, I am concerned that at least for this test, reducing that number even more has diminishing returns. I am however still concerned about the scalability problem with yield_to(), which shows like this for me (perf): > 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock > 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to > 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task > 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to > 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present > 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin > 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest For the cpu threads in the host that are actually active (in this case 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This is for a no-IO workload, so that's just incredible to see so much cpu wasted. I feel that 2 important areas to tackle are a more scalable yield_to() and reducing the number of pause exits itself (hopefully by just tuning ple_window for the latter). Honestly, I not confident addressing this problem will improve the ebizzy score. That workload is so erratic for me, that I do not trust the results at all. I have however seen consistent improvements in disabling PLE for a http guest workload and a very high IOPS guest workload, both with much time spent in host in the double runqueue lock for yield_to(), so that's why I still gravitate toward that issue. -Andrew Theurer ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler @ 2012-07-09 21:47 ` Andrew Theurer 0 siblings, 0 replies; 52+ messages in thread From: Andrew Theurer @ 2012-07-09 21:47 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. Hi, Raghu. > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With > that information we can limit the scan and also decide whether schedule() > is better. [ I am able to use #kicked vcpus to decide on this But may > be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about > next eligible lock-holder. > > Summary: There is a huge improvement for moderate / no overcommit scenario > for kvm based guest on PLE machine (which is difficult ;) ). > > Result: > Base : kernel 3.5.0-rc5 with Rik's Ple handler fix > > Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, > 32 core machine Is this with HT enabled, therefore 64 CPU threads? > Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) > with test kernels > > Guest: fedora 16 with 32 vcpus 8GB memory. Can you briefly explain the 1x and 2x configs? This of course is highly dependent whether or not HT is enabled... FWIW, I started testing what I would call "0.5x", where I have one 40 vcpu guest running on a host with 40 cores and 80 CPU threads total (HT enabled, no extra load on the system). For ebizzy, the results are quite erratic from run to run, so I am inclined to discard it as a workload, but maybe I should try "1x" and "2x" cpu over-commit as well. From initial observations, at least for the ebizzy workload, the percentage of exits that result in a yield_to() are very low, around 1%, before these patches. So, I am concerned that at least for this test, reducing that number even more has diminishing returns. I am however still concerned about the scalability problem with yield_to(), which shows like this for me (perf): > 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock > 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to > 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task > 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to > 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present > 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin > 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest For the cpu threads in the host that are actually active (in this case 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This is for a no-IO workload, so that's just incredible to see so much cpu wasted. I feel that 2 important areas to tackle are a more scalable yield_to() and reducing the number of pause exits itself (hopefully by just tuning ple_window for the latter). Honestly, I not confident addressing this problem will improve the ebizzy score. That workload is so erratic for me, that I do not trust the results at all. I have however seen consistent improvements in disabling PLE for a http guest workload and a very high IOPS guest workload, both with much time spent in host in the double runqueue lock for yield_to(), so that's why I still gravitate toward that issue. -Andrew Theurer ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 21:47 ` Andrew Theurer (?) @ 2012-07-10 9:26 ` Raghavendra K T -1 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 9:26 UTC (permalink / raw) To: Andrew M. Theurer Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/10/2012 03:17 AM, Andrew Theurer wrote: > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a >> random VCPU on PL exit. Though we already have filtering while choosing >> the candidate to yield_to, we can do better. > > Hi, Raghu. Hi Andrew, Thank you for your analysis and inputs > >> Problem is, for large vcpu guests, we have more probability of yielding >> to a bad vcpu. We are not able to prevent directed yield to same guy who >> has done PL exit recently, who perhaps spins again and wastes CPU. >> >> Fix that by keeping track of who has done PL exit. So The Algorithm in series >> give chance to a VCPU which has: >> >> (a) Not done PLE exit at all (probably he is preempted lock-holder) >> >> (b) VCPU skipped in last iteration because it did PL exit, and probably >> has become eligible now (next eligible lock holder) >> >> Future enhancemnets: >> (1) Currently we have a boolean to decide on eligibility of vcpu. It >> would be nice if I get feedback on guest (>32 vcpu) whether we can >> improve better with integer counter. (with counter = say f(log n )). >> >> (2) We have not considered system load during iteration of vcpu. With >> that information we can limit the scan and also decide whether schedule() >> is better. [ I am able to use #kicked vcpus to decide on this But may >> be there are better ideas like information from global loadavg.] >> >> (3) We can exploit this further with PV patches since it also knows about >> next eligible lock-holder. >> >> Summary: There is a huge improvement for moderate / no overcommit scenario >> for kvm based guest on PLE machine (which is difficult ;) ). >> >> Result: >> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix >> >> Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, >> 32 core machine > > Is this with HT enabled, therefore 64 CPU threads? No. HT disabled with 32 online CPUs > >> Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) >> with test kernels >> >> Guest: fedora 16 with 32 vcpus 8GB memory. > > Can you briefly explain the 1x and 2x configs? This of course is highly > dependent whether or not HT is enabled... 1x config: kernbench/ebizzy/sysbench running on 1 guest (32 vcpu) all the benchmarks have 2*#vcpu = 64 threads 2x config: kernbench/ebizzy/sysbench running on 2 guests each with 32 vcpu) all the benchmarks have 2*#vcpu = 64 threads > > FWIW, I started testing what I would call "0.5x", where I have one 40 > vcpu guest running on a host with 40 cores and 80 CPU threads total (HT > enabled, no extra load on the system). For ebizzy, the results are > quite erratic from run to run, so I am inclined to discard it as a I will be posting full run detail (individual run) in reply to this mail since it is big. I have posted stdev also with the result.. it has not shown too much deviation. > workload, but maybe I should try "1x" and "2x" cpu over-commit as well. > >> From initial observations, at least for the ebizzy workload, the > percentage of exits that result in a yield_to() are very low, around 1%, > before these patches. Hmm Ok.. IMO for a under-committed workload, probably low percentage of yield_to was expected, but not sure whether 1% is too less though. But importantly, number of successful yield_to can never measure benefit. With this patch what I am trying to address is to ensure successful yield_to result in benefit. So, I am concerned that at least for this test, > reducing that number even more has diminishing returns. I am however > still concerned about the scalability problem with yield_to(), So did you mean you are expected to see more yield_to overheads with large guests? As already mentioned in future enhancements, one thing I will be trying in future would be, a. have counter instead of boolean for skipping yield_to b. just scan probably f(log(n)) vcpu to yield and then schedule()/ return depending on system load. so we will be reducing overall vcpu iteration in PLE handler from O(n * n) to O(n log n) which > shows like this for me (perf): > >> 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock >> 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to >> 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task >> 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to >> 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present >> 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin >> 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest > > For the cpu threads in the host that are actually active (in this case > 1/2 of them), ~50% of their time is in kernel and ~43% in guest.This > is for a no-IO workload, so that's just incredible to see so much cpu > wasted. I feel that 2 important areas to tackle are a more scalable > yield_to() and reducing the number of pause exits itself (hopefully by > just tuning ple_window for the latter). I think this is a concern and as you stated I agree that tuning ple_window helps here. > > Honestly, I not confident addressing this problem will improve the > ebizzy score. That workload is so erratic for me, that I do not trust > the results at all. I have however seen consistent improvements in > disabling PLE for a http guest workload and a very high IOPS guest > workload, both with much time spent in host in the double runqueue lock > for yield_to(), so that's why I still gravitate toward that issue. The problem starts (in PLE disabled) when we have workload just > 1x.We start burning so much of cpu. IIRC, in 2x overcommit, kernel compilation that takes 10hr on non-PLE, used to take just 1hr after pv patches (and should be same with PLE enabled) If we leave PLE disabled case, I do not expect any degradation even in 0.5 x scenario, though you say results are erratic. Could you please let me know, When PLE was enabled, before and after the patch did you see any degradation for 0.5x? > -Andrew Theurer > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result 2012-07-09 21:47 ` Andrew Theurer (?) (?) @ 2012-07-10 10:07 ` Raghavendra K T -1 siblings, 0 replies; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 10:07 UTC (permalink / raw) To: habanero Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel, Raghavendra On 07/10/2012 03:17 AM, Andrew Theurer wrote: > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a >> random VCPU on PL exit. Though we already have filtering while choosing >> the candidate to yield_to, we can do better. > [...] > Honestly, I not confident addressing this problem will improve the > ebizzy score. That workload is so erratic for me, that I do not trust > the results at all. I have however seen consistent improvements in > disabling PLE for a http guest workload and a very high IOPS guest > workload, both with much time spent in host in the double runqueue lock > for yield_to(), so that's why I still gravitate toward that issue. > Deatiled result Base + Rik patch ebizzy ========= overcommit 1 x 1160 records/s real 60.00 s user 6.28 s sys 1078.69 s 1130 records/s real 60.00 s user 5.15 s sys 1080.51 s 1073 records/s real 60.00 s user 5.02 s sys 1030.21 s 1151 records/s real 60.00 s user 5.51 s sys 1097.63 s 1145 records/s real 60.00 s user 5.21 s sys 1093.56 s 1149 records/s real 60.00 s user 5.32 s sys 1097.30 s 1111 records/s real 60.00 s user 5.16 s sys 1061.77 s 1115 records/s real 60.00 s user 5.16 s sys 1066.99 s overcommit 2 x 1818 records/s real 60.00 s user 11.67 s sys 843.84 s 1809 records/s real 60.00 s user 11.77 s sys 845.68 s 1865 records/s real 60.00 s user 11.94 s sys 866.69 s 1822 records/s real 60.00 s user 12.81 s sys 843.05 s 1928 records/s real 60.00 s user 14.02 s sys 887.86 s 1915 records/s real 60.00 s user 11.55 s sys 888.68 s 1997 records/s real 60.00 s user 11.34 s sys 923.54 s 1985 records/s real 60.00 s user 11.41 s sys 923.44 s kernbench =============== overcommit 1 x Elapsed Time 49.2367 (33.6921) User Time 243.313 (343.965) System Time 385.21 (125.151) Percent CPU 1243.33 (79.5257) Context Switches 58450.7 (31603.6) Sleeps 73987 (41782.5) -- Elapsed Time 47.8367 (37.2156) User Time 244.79 (349.112) System Time 338.553 (141.732) Percent CPU 1181 (81.074) Context Switches 56194.3 (36421.6) Sleeps 74355.3 (40263.5) -- Elapsed Time 49.6067 (34.7325) User Time 250.117 (354.008) System Time 341.277 (57.5594) Percent CPU 1197 (46.3573) Context Switches 55520.3 (27748.1) Sleeps 72673 (38997.4) -- Elapsed Time 50.24 (36.6571) User Time 247.873 (352.427) System Time 349.11 (79.4226) Percent CPU 1193.67 (50.362) Context Switches 55153.3 (27926.2) Sleeps 73128 (39532.4) overcommit 2 x Elapsed Time 91.9233 (96.6304) User Time 278.347 (371.217) System Time 222.447 (181.378) Percent CPU 521.667 (46.1988) Context Switches 49597 (35766.4) Sleeps 77939.7 (36840.1) -- Elapsed Time 89.48 (92.7224) User Time 275.223 (364.737) System Time 202.473 (172.233) Percent CPU 497.333 (53.0031) Context Switches 44117 (30001) Sleeps 77196 (35746.2) -- Elapsed Time 93.6133 (95.7924) User Time 294.767 (379.39) System Time 235.487 (207.567) Percent CPU 529.667 (58.2866) Context Switches 50588 (36669.4) Sleeps 79323.7 (38285.8) -- Elapsed Time 92.7267 (100.928) User Time 286.537 (384.253) System Time 232.983 (192.233) Percent CPU 552 (76.961) Context Switches 51071 (35090) Sleeps 79059 (36466.4) sysbench ============== overcommit 1 x total time: 12.1229s total number of events: 100041 total time taken by event execution: 772.8819 -- total time: 12.0775s total number of events: 100013 total time taken by event execution: 769.5969 -- total time: 12.1671s total number of events: 100011 total time taken by event execution: 775.5967 -- total time: 12.2695s total number of events: 100003 total time taken by event execution: 782.3780 -- total time: 12.1526s total number of events: 100014 total time taken by event execution: 773.9802 -- total time: 12.3350s total number of events: 100069 total time taken by event execution: 786.2091 -- total time: 12.1019s total number of events: 100013 total time taken by event execution: 771.5163 -- total time: 12.0716s total number of events: 100010 total time taken by event execution: 769.8809 overcommit 2 x total time: 13.6532s total number of events: 100011 total time taken by event execution: 870.0869 -- total time: 15.8572s total number of events: 100010 total time taken by event execution: 910.6689 -- total time: 13.6100s total number of events: 100008 total time taken by event execution: 867.1782 -- total time: 15.4295s total number of events: 100008 total time taken by event execution: 917.8441 -- total time: 13.8994s total number of events: 100004 total time taken by event execution: 885.6729 -- total time: 14.2006s total number of events: 100005 total time taken by event execution: 887.0262 -- total time: 13.8869s total number of events: 100011 total time taken by event execution: 885.3583 -- total time: 13.9183s total number of events: 100007 total time taken by event execution: 880.4344 With Rik + PLE handler optimization patch =========================================== ebizzy ========== overcommit 1 x 2249 records/s real 60.00 s user 9.87 s sys 1529.54 s 2316 records/s real 60.00 s user 10.51 s sys 1550.33 s 2353 records/s real 60.00 s user 10.82 s sys 1565.10 s 2365 records/s real 60.00 s user 10.88 s sys 1569.00 s 2282 records/s real 60.00 s user 10.77 s sys 1540.03 s 2292 records/s real 60.00 s user 10.60 s sys 1553.76 s 2272 records/s real 60.00 s user 10.44 s sys 1510.90 s 2404 records/s real 60.00 s user 10.96 s sys 1563.49 s overcommit 2 x 2454 records/s real 60.00 s user 14.66 s sys 880.17 s 2192 records/s real 60.00 s user 15.56 s sys 881.12 s 2329 records/s real 60.00 s user 17.56 s sys 933.03 s 2281 records/s real 60.00 s user 16.22 s sys 925.34 s 2286 records/s real 60.00 s user 16.93 s sys 902.04 s 2289 records/s real 60.00 s user 15.53 s sys 909.78 s 2586 records/s real 60.00 s user 15.38 s sys 857.22 s 2675 records/s real 60.00 s user 15.93 s sys 842.40 s kernbench ============= overcommit 1 x Elapsed Time 36.6633 (33.6422) User Time 248.303 (359.64) System Time 123.003 (67.1702) Percent CPU 864 (242.52) Context Switches 44936.3 (28799.8) Sleeps 76076.7 (41142.1) -- Elapsed Time 37.9167 (37.3285) User Time 247.517 (358.659) System Time 118.883 (86.7824) Percent CPU 807.333 (245.133) Context Switches 44219.3 (29480.9) Sleeps 77137.3 (42685.4) -- Elapsed Time 39.65 (39.0432) User Time 248.07 (357.765) System Time 100.76 (58.7603) Percent CPU 748.333 (199.803) Context Switches 42332.3 (27183.7) Sleeps 75248.7 (41084.4) -- Elapsed Time 39.2867 (39.8316) User Time 245.903 (356.194) System Time 101.783 (60.4971) Percent CPU 762.667 (186.827) Context Switches 42289.3 (24882.1) Sleeps 74964.7 (38139.1) overcommit 2 x Elapsed Time 85.6567 (92.092) User Time 274.607 (370.598) System Time 172.12 (134.705) Percent CPU 496.667 (34.2977) Context Switches 45715.7 (29180.4) Sleeps 76054 (34844.5) -- Elapsed Time 86.8667 (92.72) User Time 278.767 (365.877) System Time 193.277 (142.811) Percent CPU 538.667 (36.5558) Context Switches 48035.3 (32107.3) Sleeps 78004.7 (37835.6) -- Elapsed Time 87.38 (91.6723) User Time 269.133 (374.608) System Time 165.283 (122.423) Percent CPU 465.667 (119.068) Context Switches 45107.3 (29571.6) Sleeps 76942.7 (33102.4) -- Elapsed Time 83.6333 (96.6314) User Time 267.97 (374.691) System Time 156.843 (123.183) Percent CPU 503 (28.5832) Context Switches 44406.7 (30002.8) Sleeps 78975.7 (40787.4) sysbench ================= overcommit 1 x total time: 11.7338s total number of events: 100021 total time taken by event execution: 747.8628 -- total time: 11.9323s total number of events: 100006 total time taken by event execution: 760.7567 -- total time: 12.0282s total number of events: 100068 total time taken by event execution: 766.2259 -- total time: 12.0065s total number of events: 100010 total time taken by event execution: 765.0691 -- total time: 12.2033s total number of events: 100016 total time taken by event execution: 777.9971 -- total time: 12.2472s total number of events: 100041 total time taken by event execution: 780.9914 -- total time: 12.4853s total number of events: 100015 total time taken by event execution: 795.9082 -- total time: 12.7028s total number of events: 100015 total time taken by event execution: 810.4563 overcommit 2 x total time: 13.7335s total number of events: 100005 total time taken by event execution: 872.0665 -- total time: 14.0005s total number of events: 100010 total time taken by event execution: 892.4587 -- total time: 13.8066s total number of events: 100008 total time taken by event execution: 880.2714 -- total time: 14.6350s total number of events: 100006 total time taken by event execution: 875.3052 -- total time: 13.8536s total number of events: 100007 total time taken by event execution: 877.8040 -- total time: 15.7213s total number of events: 100007 total time taken by event execution: 896.5455 -- total time: 13.9135s total number of events: 100007 total time taken by event execution: 882.0964 -- total time: 13.8390s total number of events: 100009 total time taken by event execution: 881.8267 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 21:47 ` Andrew Theurer ` (2 preceding siblings ...) (?) @ 2012-07-10 11:54 ` Raghavendra K T 2012-07-10 13:27 ` Andrew Theurer -1 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-10 11:54 UTC (permalink / raw) To: habanero Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/10/2012 03:17 AM, Andrew Theurer wrote: > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a >> random VCPU on PL exit. Though we already have filtering while choosing >> the candidate to yield_to, we can do better. > > Hi, Raghu. > [...] > > Can you briefly explain the 1x and 2x configs? This of course is highly > dependent whether or not HT is enabled... > Sorry if I had not made very clear in earlier threads. Have you applied Rik's following patch for base. without this you could see some inconsistent results perhaps. https://lkml.org/lkml/2012/6/19/401 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-10 11:54 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T @ 2012-07-10 13:27 ` Andrew Theurer 0 siblings, 0 replies; 52+ messages in thread From: Andrew Theurer @ 2012-07-10 13:27 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On Tue, 2012-07-10 at 17:24 +0530, Raghavendra K T wrote: > On 07/10/2012 03:17 AM, Andrew Theurer wrote: > > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a > >> random VCPU on PL exit. Though we already have filtering while choosing > >> the candidate to yield_to, we can do better. > > > > Hi, Raghu. > > > [...] > > > > Can you briefly explain the 1x and 2x configs? This of course is highly > > dependent whether or not HT is enabled... > > > > Sorry if I had not made very clear in earlier threads. Have you applied > Rik's following patch for base. without this you could see some > inconsistent results perhaps. > > https://lkml.org/lkml/2012/6/19/401 Yes, I do have that applied with your patch and in my baseline. -Andrew Theurer ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 21:47 ` Andrew Theurer ` (3 preceding siblings ...) (?) @ 2012-07-11 9:00 ` Avi Kivity 2012-07-11 13:59 ` Raghavendra K T -1 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-11 9:00 UTC (permalink / raw) To: habanero Cc: Raghavendra K T, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/10/2012 12:47 AM, Andrew Theurer wrote: > > For the cpu threads in the host that are actually active (in this case > 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This > is for a no-IO workload, so that's just incredible to see so much cpu > wasted. I feel that 2 important areas to tackle are a more scalable > yield_to() and reducing the number of pause exits itself (hopefully by > just tuning ple_window for the latter). One thing we can do is autotune ple_window. If a ple exit fails to wake anybody (because all vcpus are either running, sleeping, or in ple exits) then we deduce we are not overcommitted and we can increase the ple window. There's the question of how to decrease it again though. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 9:00 ` Avi Kivity @ 2012-07-11 13:59 ` Raghavendra K T 2012-07-11 14:01 ` Raghavendra K T 0 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 13:59 UTC (permalink / raw) To: Avi Kivity Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 02:30 PM, Avi Kivity wrote: > On 07/10/2012 12:47 AM, Andrew Theurer wrote: >> >> For the cpu threads in the host that are actually active (in this case >> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This >> is for a no-IO workload, so that's just incredible to see so much cpu >> wasted. I feel that 2 important areas to tackle are a more scalable >> yield_to() and reducing the number of pause exits itself (hopefully by >> just tuning ple_window for the latter). > > One thing we can do is autotune ple_window. If a ple exit fails to wake > anybody (because all vcpus are either running, sleeping, or in ple > exits) then we deduce we are not overcommitted and we can increase the > ple window. There's the question of how to decrease it again though. > I see some problem here, If I interpret situation correctly. What happens if we have two guests with one VM having no over-commit and other with high over-commit. (except when we have gang scheduling). Rather we should have something tied to VM rather than rigid PLE window. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 13:59 ` Raghavendra K T @ 2012-07-11 14:01 ` Raghavendra K T 2012-07-12 8:15 ` Avi Kivity 0 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-11 14:01 UTC (permalink / raw) To: Avi Kivity Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 07:29 PM, Raghavendra K T wrote: > On 07/11/2012 02:30 PM, Avi Kivity wrote: >> On 07/10/2012 12:47 AM, Andrew Theurer wrote: >>> >>> For the cpu threads in the host that are actually active (in this case >>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This >>> is for a no-IO workload, so that's just incredible to see so much cpu >>> wasted. I feel that 2 important areas to tackle are a more scalable >>> yield_to() and reducing the number of pause exits itself (hopefully by >>> just tuning ple_window for the latter). >> >> One thing we can do is autotune ple_window. If a ple exit fails to wake >> anybody (because all vcpus are either running, sleeping, or in ple >> exits) then we deduce we are not overcommitted and we can increase the >> ple window. There's the question of how to decrease it again though. >> > > I see some problem here, If I interpret situation correctly. What > happens if we have two guests with one VM having no over-commit and > other with high over-commit. (except when we have gang scheduling). > Sorry, I meant less load and high load inside the guest. > Rather we should have something tied to VM rather than rigid PLE > window. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-11 14:01 ` Raghavendra K T @ 2012-07-12 8:15 ` Avi Kivity 2012-07-12 8:25 ` Raghavendra K T 0 siblings, 1 reply; 52+ messages in thread From: Avi Kivity @ 2012-07-12 8:15 UTC (permalink / raw) To: Raghavendra K T Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/11/2012 05:01 PM, Raghavendra K T wrote: > On 07/11/2012 07:29 PM, Raghavendra K T wrote: >> On 07/11/2012 02:30 PM, Avi Kivity wrote: >>> On 07/10/2012 12:47 AM, Andrew Theurer wrote: >>>> >>>> For the cpu threads in the host that are actually active (in this case >>>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This >>>> is for a no-IO workload, so that's just incredible to see so much cpu >>>> wasted. I feel that 2 important areas to tackle are a more scalable >>>> yield_to() and reducing the number of pause exits itself (hopefully by >>>> just tuning ple_window for the latter). >>> >>> One thing we can do is autotune ple_window. If a ple exit fails to wake >>> anybody (because all vcpus are either running, sleeping, or in ple >>> exits) then we deduce we are not overcommitted and we can increase the >>> ple window. There's the question of how to decrease it again though. >>> >> >> I see some problem here, If I interpret situation correctly. What >> happens if we have two guests with one VM having no over-commit and >> other with high over-commit. (except when we have gang scheduling). >> > Sorry, I meant less load and high load inside the guest. > >> Rather we should have something tied to VM rather than rigid PLE >> window. The problem occurs even with no overcommit at all. One vcpu is in a legitimately long pause loop. All those exits accomplish nothing, since all vcpus are scheduled. Better to let it spin in guest mode. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 8:15 ` Avi Kivity @ 2012-07-12 8:25 ` Raghavendra K T 2012-07-12 12:31 ` Avi Kivity 0 siblings, 1 reply; 52+ messages in thread From: Raghavendra K T @ 2012-07-12 8:25 UTC (permalink / raw) To: Avi Kivity Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/12/2012 01:45 PM, Avi Kivity wrote: > On 07/11/2012 05:01 PM, Raghavendra K T wrote: >> On 07/11/2012 07:29 PM, Raghavendra K T wrote: >>> On 07/11/2012 02:30 PM, Avi Kivity wrote: >>>> On 07/10/2012 12:47 AM, Andrew Theurer wrote: >>>>> >>>>> For the cpu threads in the host that are actually active (in this case >>>>> 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This >>>>> is for a no-IO workload, so that's just incredible to see so much cpu >>>>> wasted. I feel that 2 important areas to tackle are a more scalable >>>>> yield_to() and reducing the number of pause exits itself (hopefully by >>>>> just tuning ple_window for the latter). >>>> >>>> One thing we can do is autotune ple_window. If a ple exit fails to wake >>>> anybody (because all vcpus are either running, sleeping, or in ple >>>> exits) then we deduce we are not overcommitted and we can increase the >>>> ple window. There's the question of how to decrease it again though. >>>> >>> >>> I see some problem here, If I interpret situation correctly. What >>> happens if we have two guests with one VM having no over-commit and >>> other with high over-commit. (except when we have gang scheduling). >>> >> Sorry, I meant less load and high load inside the guest. >> >>> Rather we should have something tied to VM rather than rigid PLE >>> window. > > The problem occurs even with no overcommit at all. One vcpu is in a > legitimately long pause loop. All those exits accomplish nothing, since > all vcpus are scheduled. Better to let it spin in guest mode. > I agree. One idea is we can have a scan_window to limit the scan of all n vcpus each time we enter vcpu_spin, to say 2*log n initially; then algorithm would be like; if (yield fails) increase ple_window , increase scan_window if (yield succeeds) decrease ple_window , decrease scan_window and we have to set limit on what is max and min scan window and max and min ple_window. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-12 8:25 ` Raghavendra K T @ 2012-07-12 12:31 ` Avi Kivity 0 siblings, 0 replies; 52+ messages in thread From: Avi Kivity @ 2012-07-12 12:31 UTC (permalink / raw) To: Raghavendra K T Cc: habanero, H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Rik van Riel, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/12/2012 11:25 AM, Raghavendra K T wrote: >> >> The problem occurs even with no overcommit at all. One vcpu is in a >> legitimately long pause loop. All those exits accomplish nothing, since >> all vcpus are scheduled. Better to let it spin in guest mode. >> > > I agree. One idea is we can have a scan_window to limit the scan of all > n vcpus each time we enter vcpu_spin, to say 2*log n initially; Not sure I agree. The subset that we scan is in no way special, there's no reason to suppose it would be effective. We can make the loop exit time scale with the number of vcpus to account for the greater effort needed to wake a vcpu. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T ` (3 preceding siblings ...) 2012-07-09 21:47 ` Andrew Theurer @ 2012-07-09 22:28 ` Rik van Riel 4 siblings, 0 replies; 52+ messages in thread From: Rik van Riel @ 2012-07-09 22:28 UTC (permalink / raw) To: Raghavendra K T Cc: H. Peter Anvin, Thomas Gleixner, Marcelo Tosatti, Ingo Molnar, Avi Kivity, S390, Carsten Otte, Christian Borntraeger, KVM, chegu vinod, Andrew M. Theurer, LKML, X86, Gleb Natapov, linux390, Srivatsa Vaddagiri, Joerg Roedel On 07/09/2012 02:20 AM, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. > > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: Your patch series looks good to me. Simple changes with a significant result. However, the simple heuristic could use some comments :) -- All rights reversed ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2012-07-12 12:32 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 22:39 ` Rik van Riel 2012-07-10 11:22 ` Raghavendra K T 2012-07-11 8:53 ` Avi Kivity 2012-07-11 10:52 ` Raghavendra K T 2012-07-11 11:18 ` Avi Kivity 2012-07-11 11:56 ` Raghavendra K T 2012-07-11 12:41 ` Andrew Jones 2012-07-12 10:58 ` Nikunj A Dadhania 2012-07-12 11:02 ` Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T 2012-07-09 22:30 ` Rik van Riel 2012-07-10 11:46 ` Raghavendra K T 2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger 2012-07-10 8:27 ` Raghavendra K T 2012-07-11 9:06 ` Avi Kivity 2012-07-11 10:17 ` Christian Borntraeger 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:16 ` Alexander Graf 2012-07-11 11:23 ` Avi Kivity 2012-07-11 11:52 ` Alexander Graf 2012-07-11 12:48 ` Avi Kivity 2012-07-12 2:19 ` Benjamin Herrenschmidt 2012-07-11 11:18 ` Christian Borntraeger 2012-07-11 11:39 ` Avi Kivity 2012-07-12 5:11 ` Raghavendra K T 2012-07-12 8:11 ` Avi Kivity 2012-07-12 8:32 ` Raghavendra K T 2012-07-12 2:17 ` Benjamin Herrenschmidt 2012-07-12 8:12 ` Avi Kivity 2012-07-12 11:24 ` Benjamin Herrenschmidt 2012-07-12 10:38 ` Nikunj A Dadhania 2012-07-11 11:51 ` Raghavendra K T 2012-07-11 11:55 ` Christian Borntraeger 2012-07-11 12:04 ` Raghavendra K T 2012-07-11 13:04 ` Raghavendra K T 2012-07-09 21:47 ` Andrew Theurer 2012-07-09 21:47 ` Andrew Theurer 2012-07-10 9:26 ` Raghavendra K T 2012-07-10 10:07 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T 2012-07-10 11:54 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-10 13:27 ` Andrew Theurer 2012-07-11 9:00 ` Avi Kivity 2012-07-11 13:59 ` Raghavendra K T 2012-07-11 14:01 ` Raghavendra K T 2012-07-12 8:15 ` Avi Kivity 2012-07-12 8:25 ` Raghavendra K T 2012-07-12 12:31 ` Avi Kivity 2012-07-09 22:28 ` Rik van Riel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.