[PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
@ 2012-09-21 11:59 Raghavendra K T
  2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
                   ` (3 more replies)
  0 siblings, 4 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

In some special scenarios like #vcpu <= #pcpu, PLE handler may 
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit. (IMHO, it is difficult to
decide when we have mixed type of VMs).

Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
 /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same 
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
 Handle undercommitted guest case in PLE handler
 Be courteous to other VMs in overcommitted scenario in PLE handler 

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+-----------+-----------+-----------+------------+-----------+
         ebizzy (record/sec higher is better)
+-----------+-----------+-----------+------------+-----------+
   base        stddev       patched    stdev        %improve     
+-----------+-----------+-----------+------------+-----------+
 11293.3750   624.4378	 18209.6250   371.7061	  61.24166
  3641.8750   468.9400	  3725.5000   253.7823	   2.29621
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
        kernbench (time in sec lower is better)
+-----------+-----------+-----------+------------+-----------+
   base        stddev       patched    stdev        %improve     
+-----------+-----------+-----------+------------+-----------+
    30.6020     1.3018	    30.8287     1.1517	  -0.74080
    64.0825     2.3764	    63.4721     5.0191	   0.95252
    95.8638     8.7030	    94.5988     8.3832	   1.31958
+-----------+-----------+-----------+------------+-----------+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy      209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    6 ++++++
 virt/kvm/kvm_main.c   |    7 +++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-21 11:59 [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler Raghavendra K T
@ 2012-09-21 12:00 ` Raghavendra K T
  2012-09-21 13:02   ` Rik van Riel
  2012-09-24 11:33   ` Peter Zijlstra
  2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

When total number of VCPUs of system is less than or equal to physical CPUs,
PLE exits become costly since each VCPU can have dedicated PCPU, and
trying to find a target VCPU to yield_to just burns time in PLE handler.

This patch reduces overhead, by simply doing a return in such scenarios by
checking the length of current cpu runqueue.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    6 ++++++
 virt/kvm/kvm_main.c   |    3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8c8664..3645458 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -138,6 +138,7 @@ extern int nr_threads;
 DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned long nr_running(void);
+extern unsigned long rq_nr_running(void);
 extern unsigned long nr_uninterruptible(void);
 extern unsigned long nr_iowait(void);
 extern unsigned long nr_iowait_cpu(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2170b81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,12 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+unsigned long rq_nr_running(void)
+{
+	return this_rq()->nr_running;
+}
+EXPORT_SYMBOL(rq_nr_running);
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 28f00bc..8323685 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1629,6 +1629,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int pass;
 	int i;
 
+	if (unlikely(rq_nr_running() == 1))
+		return;
+
 	kvm_vcpu_set_in_spin_loop(me, true);
 	/*
 	 * We boost the priority of a VCPU that is runnable but not


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 11:59 [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler Raghavendra K T
  2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
@ 2012-09-21 12:00 ` Raghavendra K T
  2012-09-21 13:22   ` Rik van Riel
                     ` (2 more replies)
  2012-09-21 13:18 ` [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios " Chegu Vinod
  2012-09-24 11:34 ` Peter Zijlstra
  3 siblings, 3 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

When PLE handler fails to find a better candidate to yield_to, it
goes back and does spin again. This is acceptable when we do not
have overcommit.
But in overcommitted scenarios (especially when we have large
number of small guests), it is better to yield.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 virt/kvm/kvm_main.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8323685..713b677 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 		}
 	}
+	/* In overcommitted cases, yield instead of spinning */
+	if (!yielded && rq_nr_running() > 1)
+		schedule();
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
@ 2012-09-21 13:02   ` Rik van Riel
  2012-09-21 17:24     ` Raghavendra K T
  2012-09-24 11:33   ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Rik van Riel @ 2012-09-21 13:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/21/2012 08:00 AM, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> When total number of VCPUs of system is less than or equal to physical CPUs,
> PLE exits become costly since each VCPU can have dedicated PCPU, and
> trying to find a target VCPU to yield_to just burns time in PLE handler.
>
> This patch reduces overhead, by simply doing a return in such scenarios by
> checking the length of current cpu runqueue.

I am not convinced this is the way to go.

The VCPU that is holding the lock, and is not releasing it,
probably got scheduled out. That implies that VCPU is on a
runqueue with at least one other task.

> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1629,6 +1629,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   	int pass;
>   	int i;
>
> +	if (unlikely(rq_nr_running() == 1))
> +		return;
> +
>   	kvm_vcpu_set_in_spin_loop(me, true);
>   	/*
>   	 * We boost the priority of a VCPU that is runnable but not
>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-21 11:59 [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler Raghavendra K T
  2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
  2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
@ 2012-09-21 13:18 ` Chegu Vinod
  2012-09-21 17:36   ` Raghavendra K T
  2012-09-24 11:34 ` Peter Zijlstra
  3 siblings, 1 reply; 126+ messages in thread
From: Chegu Vinod @ 2012-09-21 13:18 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 9/21/2012 4:59 AM, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly,

Yes.
>   because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
>
> An idea to solve this is:
> 1) As Avi had proposed we can modify hardware ple_window
> dynamically to avoid frequent PL-exit.

Yes. We had to do this to get around some scaling issues for large 
(>20way) guests (with no overcommitment)

As part of some experimentation we even tried "switching off"  PLE too :(



> (IMHO, it is difficult to
> decide when we have mixed type of VMs).

Agree.

Not sure if the following alternatives have also been looked at :

- Could the  behavior  associated with the "ple_window" be modified to 
be a function of some [new] per-guest attribute (which can be conveyed 
to the host as part of the guest launch sequence). The user can choose 
to set this [new] attribute for a given guest. This would help avoid the 
frequent exits due to PLE (as Avi had mentioned earlier) ?

- Can the PLE feature ( in VT) be "enhanced" to be made a per guest 
attribute ?


IMHO, the approach of not taking a frequent exit is better than taking 
an exit and returning back from the handler etc.

Thanks
Vinod




>
> Another idea, proposed in the first patch, is to identify
> non-overcommit case and just return from the PLE handler.
>
> There are are many ways to identify non-overcommit scenario.
> 1) Using loadavg etc (get_avenrun/calc_global_load
>   /this_cpu_load)
>
> 2) Explicitly check nr_running()/num_online_cpus()
>
> 3) Check source vcpu runqueue length.
>
> Not sure how can we make use of (1) effectively/how to use it.
> (2) has significant overhead since it iterates all cpus.
> so this patch uses third method. (I feel it is uglier to export
> runqueue length, but expecting suggestion on this).
>
> In second patch, when we have large number of small guests, it is
> possible that a spinning vcpu fails to yield_to any vcpu of same
> VM and go back and spin. This is also not effective when we are
> over-committed. Instead, we do a schedule() so that we give chance
> to other VMs to run.
>
> Raghavendra K T(2):
>   Handle undercommitted guest case in PLE handler
>   Be courteous to other VMs in overcommitted scenario in PLE handler
>
> Results:
> base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
> patched = base + patch1 + patch2
> machine: x240 with 16 core with HT enabled (32 cpu thread).
> 32 vcpu guest with 8GB RAM.
>
> +-----------+-----------+-----------+------------+-----------+
>           ebizzy (record/sec higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stddev       patched    stdev        %improve
> +-----------+-----------+-----------+------------+-----------+
>   11293.3750   624.4378	 18209.6250   371.7061	  61.24166
>    3641.8750   468.9400	  3725.5000   253.7823	   2.29621
> +-----------+-----------+-----------+------------+-----------+
>
> +-----------+-----------+-----------+------------+-----------+
>          kernbench (time in sec lower is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stddev       patched    stdev        %improve
> +-----------+-----------+-----------+------------+-----------+
>      30.6020     1.3018	    30.8287     1.1517	  -0.74080
>      64.0825     2.3764	    63.4721     5.0191	   0.95252
>      95.8638     8.7030	    94.5988     8.3832	   1.31958
> +-----------+-----------+-----------+------------+-----------+
>
> Note:
> on mx3850x5 machine with 32 cores HT disabled I got around
> ebizzy      209%
> kernbench   6%
> improvement for 1x scenario.
>
> Thanks Srikar for his active partipation in discussing ideas and
> reviewing the patch.
>
> Please let me know your suggestions and comments.
> ---
>   include/linux/sched.h |    1 +
>   kernel/sched/core.c   |    6 ++++++
>   virt/kvm/kvm_main.c   |    7 +++++++
>   3 files changed, 14 insertions(+), 0 deletions(-)
>
> .
>


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
@ 2012-09-21 13:22   ` Rik van Riel
  2012-09-21 13:46   ` Takuya Yoshikawa
  2012-09-24 15:26   ` Avi Kivity
  2 siblings, 0 replies; 126+ messages in thread
From: Rik van Riel @ 2012-09-21 13:22 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/21/2012 08:00 AM, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
>
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
  2012-09-21 13:22   ` Rik van Riel
@ 2012-09-21 13:46   ` Takuya Yoshikawa
  2012-09-21 13:52     ` Rik van Riel
  2012-09-24 15:26   ` Avi Kivity
  2 siblings, 1 reply; 126+ messages in thread
From: Takuya Yoshikawa @ 2012-09-21 13:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Fri, 21 Sep 2012 17:30:20 +0530
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8323685..713b677 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  		}
>  	}
> +	/* In overcommitted cases, yield instead of spinning */
> +	if (!yielded && rq_nr_running() > 1)
> +		schedule();

How about doing cond_resched() instead?

I'm not sure whether checking more sched stuff in KVM code is a
good thing.

	Takuya

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 13:46   ` Takuya Yoshikawa
@ 2012-09-21 13:52     ` Rik van Riel
  2012-09-21 17:45       ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Rik van Riel @ 2012-09-21 13:52 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Raghavendra K T, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote:
> On Fri, 21 Sep 2012 17:30:20 +0530
> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> When PLE handler fails to find a better candidate to yield_to, it
>> goes back and does spin again. This is acceptable when we do not
>> have overcommit.
>> But in overcommitted scenarios (especially when we have large
>> number of small guests), it is better to yield.
>>
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   virt/kvm/kvm_main.c |    4 ++++
>>   1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 8323685..713b677 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   			}
>>   		}
>>   	}
>> +	/* In overcommitted cases, yield instead of spinning */
>> +	if (!yielded && rq_nr_running() > 1)
>> +		schedule();
>
> How about doing cond_resched() instead?

Actually, an actual call to yield() may be better.

That will set scheduler hints to make the scheduler pick
another task for one round, while preserving this task's
top position in the runqueue.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-21 13:02   ` Rik van Riel
@ 2012-09-21 17:24     ` Raghavendra K T
  2012-09-24 15:41       ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 17:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/21/2012 06:32 PM, Rik van Riel wrote:
> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> When total number of VCPUs of system is less than or equal to physical
>> CPUs,
>> PLE exits become costly since each VCPU can have dedicated PCPU, and
>> trying to find a target VCPU to yield_to just burns time in PLE handler.
>>
>> This patch reduces overhead, by simply doing a return in such
>> scenarios by
>> checking the length of current cpu runqueue.
>
> I am not convinced this is the way to go.
>
> The VCPU that is holding the lock, and is not releasing it,
> probably got scheduled out. That implies that VCPU is on a
> runqueue with at least one other task.

I see your point here, we have two cases:

case 1)

rq1 : vcpu1->wait(lockA) (spinning)
rq2 : vcpu2->holding(lockA) (running)

Here Ideally vcpu1 should not enter PLE handler, since it would surely
get the lock within ple_window cycle. (assuming ple_window is tuned for
that workload perfectly).

May be this explains why we are not seeing benefit with kernbench.

On the other side, Since we cannot have a perfect ple_window tuned for
all type of workloads, for those workloads, which may need more than
4096 cycles, we gain. thinking is it that we are seeing in benefited
cases?

case 2)
rq1 : vcpu1->wait(lockA) (spinning)
rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]

I agree that checking rq1 length is not proper in this case, and as you
rightly pointed out, we are in trouble here. 
nr_running()/num_online_cpus() would give more accurate picture here,
but it seemed costly. May be load balancer save us a bit here in not
running to such sort of cases. ( I agree load balancer is far too
complex).

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-21 13:18 ` [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios " Chegu Vinod
@ 2012-09-21 17:36   ` Raghavendra K T
  2012-09-24  8:42     ` Dor Laor
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 17:36 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/21/2012 06:48 PM, Chegu Vinod wrote:
> On 9/21/2012 4:59 AM, Raghavendra K T wrote:
>> In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly,
>
> Yes.
>> because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>> An idea to solve this is:
>> 1) As Avi had proposed we can modify hardware ple_window
>> dynamically to avoid frequent PL-exit.
>
> Yes. We had to do this to get around some scaling issues for large
> (>20way) guests (with no overcommitment)

Do you mean you already have some solution tested for this?

>
> As part of some experimentation we even tried "switching off" PLE too :(
>

Honestly,
Your this experiment and Andrew Theurer's observations were the
motivation for this patch.

>
>
>> (IMHO, it is difficult to
>> decide when we have mixed type of VMs).
>
> Agree.
>
> Not sure if the following alternatives have also been looked at :
>
> - Could the behavior associated with the "ple_window" be modified to be
> a function of some [new] per-guest attribute (which can be conveyed to
> the host as part of the guest launch sequence). The user can choose to
> set this [new] attribute for a given guest. This would help avoid the
> frequent exits due to PLE (as Avi had mentioned earlier) ?

Ccing Drew also. We had a good discussion on this idea last time.
(sorry that I forgot to include in patch series)

May be a good idea when we know the load in advance..

>
> - Can the PLE feature ( in VT) be "enhanced" to be made a per guest
> attribute ?
>
>
> IMHO, the approach of not taking a frequent exit is better than taking
> an exit and returning back from the handler etc.

I entirely agree on this point. (though have not tried above
approaches). Hope to see more expert opinions pouring in.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 13:52     ` Rik van Riel
@ 2012-09-21 17:45       ` Raghavendra K T
  2012-09-24 13:43         ` Takuya Yoshikawa
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-21 17:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Takuya Yoshikawa, Peter Zijlstra, H. Peter Anvin,
	Marcelo Tosatti, Ingo Molnar, Avi Kivity, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Gleb Natapov

On 09/21/2012 07:22 PM, Rik van Riel wrote:
> On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote:
>> On Fri, 21 Sep 2012 17:30:20 +0530
>> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>
>>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> When PLE handler fails to find a better candidate to yield_to, it
>>> goes back and does spin again. This is acceptable when we do not
>>> have overcommit.
>>> But in overcommitted scenarios (especially when we have large
>>> number of small guests), it is better to yield.
>>>
>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>> ---
>>> virt/kvm/kvm_main.c | 4 ++++
>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 8323685..713b677 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> }
>>> }
>>> }
>>> + /* In overcommitted cases, yield instead of spinning */
>>> + if (!yielded && rq_nr_running() > 1)
>>> + schedule();
>>
>> How about doing cond_resched() instead?
>
> Actually, an actual call to yield() may be better.
>
> That will set scheduler hints to make the scheduler pick
> another task for one round, while preserving this task's
> top position in the runqueue.

I am not a scheduler expert, but I am also inclined towards
Rik's suggestion here since we set skip buddy here. Takuya?


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-21 17:36   ` Raghavendra K T
@ 2012-09-24  8:42     ` Dor Laor
  2012-09-24 12:02       ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Dor Laor @ 2012-09-24  8:42 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

In order to help PLE and pvticketlock converge I thought that a small 
test code should be developed to test this in a predictable, 
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each 
time you write to a /sys/.... entry.

Each such a thread spins over a spin lock. The specific spin lock is 
also chosen by the /sys/ interface. Let's say we have an array of spin 
locks *10 times the amount of vcpus.

All the threads are running a
  while (1) {

    spin_lock(my_lock);
    sum += execute_dummy_cpu_computation(time);
    spin_unlock(my_lock);

    if (sys_tells_thread_to_die()) break;
  }

  print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make 
the ticket lock order deterministic and known (like a linear walk of all 
the threads trying to catch that lock).

This way you can easy calculate:
  1. the score of a single vcpu running a single thread
  2. the score of sum of all thread scores when #thread==#vcpu all
     taking the same spin lock. The overall sum should be close as
     possible to #1.
  3. Like #2 but #threads > #vcpus and other versions of #total vcpus
     (belonging to all VMs)  > #pcpus.
  4. Create #thread == #vcpus but let each thread have it's own spin
     lock
  5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact 
overhead of scheduling VMs and threads since you have the ideal result 
in hand and you know what the threads are doing.

My 2 cents, Dor

On 09/21/2012 08:36 PM, Raghavendra K T wrote:
> On 09/21/2012 06:48 PM, Chegu Vinod wrote:
>> On 9/21/2012 4:59 AM, Raghavendra K T wrote:
>>> In some special scenarios like #vcpu <= #pcpu, PLE handler may
>>> prove very costly,
>>
>> Yes.
>>> because there is no need to iterate over vcpus
>>> and do unsuccessful yield_to burning CPU.
>>>
>>> An idea to solve this is:
>>> 1) As Avi had proposed we can modify hardware ple_window
>>> dynamically to avoid frequent PL-exit.
>>
>> Yes. We had to do this to get around some scaling issues for large
>> (>20way) guests (with no overcommitment)
>
> Do you mean you already have some solution tested for this?
>
>>
>> As part of some experimentation we even tried "switching off" PLE too :(
>>
>
> Honestly,
> Your this experiment and Andrew Theurer's observations were the
> motivation for this patch.
>
>>
>>
>>> (IMHO, it is difficult to
>>> decide when we have mixed type of VMs).
>>
>> Agree.
>>
>> Not sure if the following alternatives have also been looked at :
>>
>> - Could the behavior associated with the "ple_window" be modified to be
>> a function of some [new] per-guest attribute (which can be conveyed to
>> the host as part of the guest launch sequence). The user can choose to
>> set this [new] attribute for a given guest. This would help avoid the
>> frequent exits due to PLE (as Avi had mentioned earlier) ?
>
> Ccing Drew also. We had a good discussion on this idea last time.
> (sorry that I forgot to include in patch series)
>
> May be a good idea when we know the load in advance..
>
>>
>> - Can the PLE feature ( in VT) be "enhanced" to be made a per guest
>> attribute ?
>>
>>
>> IMHO, the approach of not taking a frequent exit is better than taking
>> an exit and returning back from the handler etc.
>
> I entirely agree on this point. (though have not tried above
> approaches). Hope to see more expert opinions pouring in.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
  2012-09-21 13:02   ` Rik van Riel
@ 2012-09-24 11:33   ` Peter Zijlstra
  2012-09-24 11:40     ` Raghavendra K T
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 11:33 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Avi Kivity, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote:
> +unsigned long rq_nr_running(void)
> +{
> +       return this_rq()->nr_running;
> +}
> +EXPORT_SYMBOL(rq_nr_running); 

Uhm,.. no, that's a horrible thing to export.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-21 11:59 [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler Raghavendra K T
                   ` (2 preceding siblings ...)
  2012-09-21 13:18 ` [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios " Chegu Vinod
@ 2012-09-24 11:34 ` Peter Zijlstra
  2012-09-24 11:52   ` Raghavendra K T
  3 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 11:34 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may 
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU. 

What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 11:33   ` Peter Zijlstra
@ 2012-09-24 11:40     ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-24 11:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Avi Kivity, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 05:03 PM, Peter Zijlstra wrote:
> On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote:
>> +unsigned long rq_nr_running(void)
>> +{
>> +       return this_rq()->nr_running;
>> +}
>> +EXPORT_SYMBOL(rq_nr_running);
>
> Uhm,.. no, that's a horrible thing to export.
>

True.. I had the same fear :).  Other options I thought were something
like  nr_running()/num_online_cpus, this_cpu_load(), etc..

Could you please let me know, if we can rely some good metric to say,
system is not overcommitted/overcommitted?



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 11:34 ` Peter Zijlstra
@ 2012-09-24 11:52   ` Raghavendra K T
  2012-09-24 12:36     ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-24 11:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>
> What's the costly thing? The vm-exit, the yield (which should be a nop
> if its the only task there) or something else entirely?
>
Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24  8:42     ` Dor Laor
@ 2012-09-24 12:02       ` Raghavendra K T
  2012-09-25 15:00         ` Dor Laor
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-24 12:02 UTC (permalink / raw)
  To: dlaor
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 09/24/2012 02:12 PM, Dor Laor wrote:
> In order to help PLE and pvticketlock converge I thought that a small
> test code should be developed to test this in a predictable,
> deterministic way.
>
> The idea is to have a guest kernel module that spawn a new thread each
> time you write to a /sys/.... entry.
>
> Each such a thread spins over a spin lock. The specific spin lock is
> also chosen by the /sys/ interface. Let's say we have an array of spin
> locks *10 times the amount of vcpus.
>
> All the threads are running a
> while (1) {
>
> spin_lock(my_lock);
> sum += execute_dummy_cpu_computation(time);
> spin_unlock(my_lock);
>
> if (sys_tells_thread_to_die()) break;
> }
>
> print_result(sum);
>
> Instead of calling the kernel's spin_lock functions, clone them and make
> the ticket lock order deterministic and known (like a linear walk of all
> the threads trying to catch that lock).

By Cloning you mean hierarchy of the locks?
Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking

>
> This way you can easy calculate:
> 1. the score of a single vcpu running a single thread
> 2. the score of sum of all thread scores when #thread==#vcpu all
> taking the same spin lock. The overall sum should be close as
> possible to #1.
> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> (belonging to all VMs) > #pcpus.
> 4. Create #thread == #vcpus but let each thread have it's own spin
> lock
> 5. Like 4 + 2
>
> Hopefully this way will allows you to judge and evaluate the exact
> overhead of scheduling VMs and threads since you have the ideal result
> in hand and you know what the threads are doing.
>
> My 2 cents, Dor
>

Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 11:52   ` Raghavendra K T
@ 2012-09-24 12:36     ` Peter Zijlstra
  2012-09-24 13:29       ` Raghavendra K T
  2012-09-26 12:57       ` Andrew Jones
  0 siblings, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 12:36 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> >> prove very costly, because there is no need to iterate over vcpus
> >> and do unsuccessful yield_to burning CPU.
> >
> > What's the costly thing? The vm-exit, the yield (which should be a nop
> > if its the only task there) or something else entirely?
> >
> Both vmexit and yield_to() actually,
> 
> because unsuccessful yield_to() overall is costly in PLE handler.
> 
> This is because when we have large guests, say 32/16 vcpus, and one
> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> 
> this results is fairly high amount of cpu burning and double run queue
> lock contention.
> 
> (if they were spinning probably lock progress would have been faster).
> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> seems little complex to achieve currently.

OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
 kernel/sched/core.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (>0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	local_irq_save(flags);
 	rq = this_rq();
 
+	/*
+	 * If we're the only runnable task on the rq, there's absolutely no
+	 * point in yielding.
+	 */
+	if (rq->nr_running == 1) {
+		yielded = -ESRCH;
+		goto out_irq;
+	}
+
 again:
 	p_rq = task_rq(p);
 	double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	}
 
 	if (!curr->sched_class->yield_to_task)
-		goto out;
+		goto out_unlock;
 
 	if (curr->sched_class != p->sched_class)
-		goto out;
+		goto out_unlock;
 
 	if (task_running(p_rq, p) || p->state)
-		goto out;
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
-	if (yielded)
+	if (yielded > 0)
 		schedule();
 
 	return yielded;


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 12:36     ` Peter Zijlstra
@ 2012-09-24 13:29       ` Raghavendra K T
  2012-09-24 13:54         ` Peter Zijlstra
  2012-09-26 12:57       ` Andrew Jones
  1 sibling, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-24 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 06:06 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
>> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
>>> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>>>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>>>> prove very costly, because there is no need to iterate over vcpus
>>>> and do unsuccessful yield_to burning CPU.
>>>
>>> What's the costly thing? The vm-exit, the yield (which should be a nop
>>> if its the only task there) or something else entirely?
>>>
>> Both vmexit and yield_to() actually,
>>
>> because unsuccessful yield_to() overall is costly in PLE handler.
>>
>> This is because when we have large guests, say 32/16 vcpus, and one
>> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
>> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
>> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
>>
>> this results is fairly high amount of cpu burning and double run queue
>> lock contention.
>>
>> (if they were spinning probably lock progress would have been faster).
>> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
>> seems little complex to achieve currently.
>
> OK, so the vmexit stays and we need to improve yield_to.
>
> How about something like the below, that would allow breaking out of the
> for-each-vcpu loop and simply going back into the vm, right?
>
> ---
>   kernel/sched/core.c | 25 +++++++++++++++++++------
>   1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b38f00e..5d5b355 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
>    * It's the caller's job to ensure that the target task struct
>    * can't go away on us before we can do any checks.
>    *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *   true (>0) if we indeed boosted the target task.
> + *   false (0) if we failed to boost the target.
> + *   -ESRCH if there's no task to yield to.
>    */
>   bool __sched yield_to(struct task_struct *p, bool preempt)
>   {
> @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   	local_irq_save(flags);
>   	rq = this_rq();
>
> +	/*
> +	 * If we're the only runnable task on the rq, there's absolutely no
> +	 * point in yielding.
> +	 */
> +	if (rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>   again:
>   	p_rq = task_rq(p);
>   	double_rq_lock(rq, p_rq);
> @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   	}
>
>   	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>
>   	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>
>   	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>
>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   	if (yielded) {
> @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   			resched_task(p_rq->curr);
>   	}
>
> -out:
> +out_unlock:
>   	double_rq_unlock(rq, p_rq);
> +out_irq:
>   	local_irq_restore(flags);
>
> -	if (yielded)
> +	if (yielded>  0)
>   		schedule();
>
>   	return yielded;
>
>

Yes, I think this is a nice idea. Any future users of yield_to
also would benefit from this. we will have to iterate only till first
attempt to yield_to.

I 'll run the test with this patch.

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run 
queue but not running.

Do you think instead of using rq->nr_running, we could get a global 
sense of load using avenrun (something like avenrun/num_onlinecpus)


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 17:45       ` Raghavendra K T
@ 2012-09-24 13:43         ` Takuya Yoshikawa
  0 siblings, 0 replies; 126+ messages in thread
From: Takuya Yoshikawa @ 2012-09-24 13:43 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Fri, 21 Sep 2012 23:15:40 +0530
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> >> How about doing cond_resched() instead?
> >
> > Actually, an actual call to yield() may be better.
> >
> > That will set scheduler hints to make the scheduler pick
> > another task for one round, while preserving this task's
> > top position in the runqueue.
> 
> I am not a scheduler expert, but I am also inclined towards
> Rik's suggestion here since we set skip buddy here. Takuya?
> 

Yes, I think it's better.
But I hope that experts in Cc will suggest the best way.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 13:29       ` Raghavendra K T
@ 2012-09-24 13:54         ` Peter Zijlstra
  2012-09-24 14:16           ` Raghavendra K T
  2012-09-24 15:51           ` Avi Kivity
  0 siblings, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 13:54 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> However Rik had a genuine concern in the cases where runqueue is not
> equally distributed and lockholder might actually be on a different run 
> queue but not running.

Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.

> Do you think instead of using rq->nr_running, we could get a global 
> sense of load using avenrun (something like avenrun/num_onlinecpus) 

To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 13:54         ` Peter Zijlstra
@ 2012-09-24 14:16           ` Raghavendra K T
  2012-09-25 13:40             ` Raghavendra K T
  2012-09-24 15:51           ` Avi Kivity
  1 sibling, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-24 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> However Rik had a genuine concern in the cases where runqueue is not
>> equally distributed and lockholder might actually be on a different run
>> queue but not running.
>
> Load should eventually get distributed equally -- that's what the
> load-balancer is for -- so this is a temporary situation.
>
> We already try and favour the non running vcpu in this case, that's what
> yield_to_task_fair() is about. If its still not eligible to run, tough
> luck.

Yes, I agree.

>
>> Do you think instead of using rq->nr_running, we could get a global
>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>
> To what purpose? Also, global stuff is expensive, so you should try and
> stay away from it as hard as you possibly can.

Yes, that concern only had made me to fall back to rq->nr_running.

Will come back with the result soon.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
  2012-09-21 13:22   ` Rik van Riel
  2012-09-21 13:46   ` Takuya Yoshikawa
@ 2012-09-24 15:26   ` Avi Kivity
  2012-09-24 15:34     ` Peter Zijlstra
  2 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 15:26 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/21/2012 03:00 PM, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8323685..713b677 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  		}
>  	}
> +	/* In overcommitted cases, yield instead of spinning */
> +	if (!yielded && rq_nr_running() > 1)
> +		schedule();
> +

I think this is a no-op these (CFS) days.  To get schedule() to do
anything, you need to wake up a task, or let time pass, or block.
Otherwise it will see that nothing has changed and as far as it's
concerned you're still the best task to be running (otherwise it
wouldn't have picked you in the first place).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 15:26   ` Avi Kivity
@ 2012-09-24 15:34     ` Peter Zijlstra
  2012-09-24 15:43       ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 15:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote:
> I think this is a no-op these (CFS) days.  To get schedule() to do
> anything, you need to wake up a task, or let time pass, or block.
> Otherwise it will see that nothing has changed and as far as it's
> concerned you're still the best task to be running (otherwise it
> wouldn't have picked you in the first place). 

Time could have passed enough before calling this that there's now a
different/more eligible task around to schedule.

Esp. for a !PREEMPT kernel this is could be significant.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-21 17:24     ` Raghavendra K T
@ 2012-09-24 15:41       ` Avi Kivity
  2012-09-24 16:06         ` Avi Kivity
                           ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 15:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/21/2012 08:24 PM, Raghavendra K T wrote:
> On 09/21/2012 06:32 PM, Rik van Riel wrote:
>> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
>>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> When total number of VCPUs of system is less than or equal to physical
>>> CPUs,
>>> PLE exits become costly since each VCPU can have dedicated PCPU, and
>>> trying to find a target VCPU to yield_to just burns time in PLE handler.
>>>
>>> This patch reduces overhead, by simply doing a return in such
>>> scenarios by
>>> checking the length of current cpu runqueue.
>>
>> I am not convinced this is the way to go.
>>
>> The VCPU that is holding the lock, and is not releasing it,
>> probably got scheduled out. That implies that VCPU is on a
>> runqueue with at least one other task.
> 
> I see your point here, we have two cases:
> 
> case 1)
> 
> rq1 : vcpu1->wait(lockA) (spinning)
> rq2 : vcpu2->holding(lockA) (running)
> 
> Here Ideally vcpu1 should not enter PLE handler, since it would surely
> get the lock within ple_window cycle. (assuming ple_window is tuned for
> that workload perfectly).
> 
> May be this explains why we are not seeing benefit with kernbench.
> 
> On the other side, Since we cannot have a perfect ple_window tuned for
> all type of workloads, for those workloads, which may need more than
> 4096 cycles, we gain. thinking is it that we are seeing in benefited
> cases?

Maybe we need to increase the ple window regardless.  4096 cycles is 2
microseconds or less (call it t_spin).  The overhead from
kvm_vcpu_on_spin() and the associated task switches is at least a few
microseconds, increasing as contention is added (call it t_tield).  The
time for a natural context switch is several milliseconds (call it
t_slice).  There is also the time the lock holder owns the lock,
assuming no contention (t_hold).

If t_yield > t_spin, then in the undercommitted case it dominates
t_spin.  If t_hold > t_spin we lose badly.

If t_spin > t_yield, then the undercommitted case doesn't suffer as much
as most of the spinning happens in the guest instead of the host, so it
can pick up the unlock timely.  We don't lose too much in the
overcommitted case provided the values aren't too far apart (say a
factor of 3).

Obviously t_spin must be significantly smaller than t_slice, otherwise
it accomplishes nothing.

Regarding t_hold: if it is small, then a larger t_spin helps avoid false
exits.  If it is large, then we're not very sensitive to t_spin.  It
doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
yielding for several milliseconds.

So I think it's worth trying again with ple_window of 20000-40000.

> 
> case 2)
> rq1 : vcpu1->wait(lockA) (spinning)
> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
> 
> I agree that checking rq1 length is not proper in this case, and as you
> rightly pointed out, we are in trouble here.
> nr_running()/num_online_cpus() would give more accurate picture here,
> but it seemed costly. May be load balancer save us a bit here in not
> running to such sort of cases. ( I agree load balancer is far too
> complex).

In theory preempt notifier can tell us whether a vcpu is preempted or
not (except for exits to userspace), so we can keep track of whether
it's we're overcommitted in kvm itself.  It also avoids false positives
from other guests and/or processes being overcommitted while our vm is fine.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 15:34     ` Peter Zijlstra
@ 2012-09-24 15:43       ` Avi Kivity
  2012-09-24 15:52         ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 05:34 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote:
>> I think this is a no-op these (CFS) days.  To get schedule() to do
>> anything, you need to wake up a task, or let time pass, or block.
>> Otherwise it will see that nothing has changed and as far as it's
>> concerned you're still the best task to be running (otherwise it
>> wouldn't have picked you in the first place). 
> 
> Time could have passed enough before calling this that there's now a
> different/more eligible task around to schedule.

Wouldn't this correspond to the scheduler interrupt firing and causing a
reschedule?  I thought the timer was programmed for exactly the point in
time that CFS considers the right time for a switch.  But I'm basing
this on my mental model of CFS, not CFS itself.

> Esp. for a !PREEMPT kernel this is could be significant.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 13:54         ` Peter Zijlstra
  2012-09-24 14:16           ` Raghavendra K T
@ 2012-09-24 15:51           ` Avi Kivity
  2012-09-24 16:03             ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> However Rik had a genuine concern in the cases where runqueue is not
>> equally distributed and lockholder might actually be on a different run 
>> queue but not running.
> 
> Load should eventually get distributed equally -- that's what the
> load-balancer is for -- so this is a temporary situation.

What's the expected latency?  This is the whole problem.  Eventually the
scheduler would pick the lock holder as well, the problem is that it's
in the millisecond scale while lock hold times are in the microsecond
scale, leading to a 1000x slowdown.

If we want to yield, we really want to boost someone.

> We already try and favour the non running vcpu in this case, that's what
> yield_to_task_fair() is about. If its still not eligible to run, tough
> luck.

Crazy idea: instead of yielding, just run that other vcpu in the thread
that would otherwise spin.  I can see about a million objections to this
already though.

>> Do you think instead of using rq->nr_running, we could get a global 
>> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> 
> To what purpose? Also, global stuff is expensive, so you should try and
> stay away from it as hard as you possibly can.

Spinning is also expensive.  How about we do the global stuff every N
times, to amortize the cost (and reduce contention)?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 15:43       ` Avi Kivity
@ 2012-09-24 15:52         ` Peter Zijlstra
  2012-09-24 15:58           ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 15:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote:
> Wouldn't this correspond to the scheduler interrupt firing and causing a
> reschedule?  I thought the timer was programmed for exactly the point in
> time that CFS considers the right time for a switch.  But I'm basing
> this on my mental model of CFS, not CFS itself. 

No, we tried this for hrtimer kernels for a while, but programming
hrtimers the whole time (every actual task-switch) turns out to be far
too expensive. So we're back to HZ ticks and 'polling' the preemption
state.

Even if we remove all the hrtimer infrastructure overhead (can do with a
few hacks) setting the hardware requires going out to the LAPIC, which
is stupid slow.

Some hardware actually has fast/reliable/usable timers, sadly none of it
is popular.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 15:52         ` Peter Zijlstra
@ 2012-09-24 15:58           ` Avi Kivity
  2012-09-24 16:05             ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 05:52 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote:
>> Wouldn't this correspond to the scheduler interrupt firing and causing a
>> reschedule?  I thought the timer was programmed for exactly the point in
>> time that CFS considers the right time for a switch.  But I'm basing
>> this on my mental model of CFS, not CFS itself. 
> 
> No, we tried this for hrtimer kernels for a while, but programming
> hrtimers the whole time (every actual task-switch) turns out to be far
> too expensive. So we're back to HZ ticks and 'polling' the preemption
> state.

Ok, so I wasn't completely off base.

With HZ=1000, we can only be faster than the poll by a millisecond than
the interrupt-driven schedule(), and we need to be a lot faster.

> Even if we remove all the hrtimer infrastructure overhead (can do with a
> few hacks) setting the hardware requires going out to the LAPIC, which
> is stupid slow.
> 
> Some hardware actually has fast/reliable/usable timers, sadly none of it
> is popular.

There is the TSC deadline timer mode of newer Intels.  Programming the
timer is a simple wrmsr, and it will fire immediately if it already
expired.  Unfortunately on AMDs it is not available, and on virtual
hardware it will be slow (~1-2 usec).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 15:51           ` Avi Kivity
@ 2012-09-24 16:03             ` Peter Zijlstra
  2012-09-24 16:20               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> However Rik had a genuine concern in the cases where runqueue is not
> >> equally distributed and lockholder might actually be on a different run 
> >> queue but not running.
> > 
> > Load should eventually get distributed equally -- that's what the
> > load-balancer is for -- so this is a temporary situation.
> 
> What's the expected latency?  This is the whole problem.  Eventually the
> scheduler would pick the lock holder as well, the problem is that it's
> in the millisecond scale while lock hold times are in the microsecond
> scale, leading to a 1000x slowdown.

Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
something like accurate or fast, never both.

> If we want to yield, we really want to boost someone.

Now if only you knew which someone ;-) This non-modified guest nonsense
is such a snake pit.. but you know how I feel about all that.

> > We already try and favour the non running vcpu in this case, that's what
> > yield_to_task_fair() is about. If its still not eligible to run, tough
> > luck.
> 
> Crazy idea: instead of yielding, just run that other vcpu in the thread
> that would otherwise spin.  I can see about a million objections to this
> already though.

Yah.. you want me to list a few? :-) It would require synchronization
with the other cpu to pull its task -- one really wants to avoid it also
running it.

Do this at a high enough frequency and you're dead too.

Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
associated with a vcpu thread and use the preemption notifiers to sort
things against the scheduler or somesuch.

> >> Do you think instead of using rq->nr_running, we could get a global 
> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> > 
> > To what purpose? Also, global stuff is expensive, so you should try and
> > stay away from it as hard as you possibly can.
> 
> Spinning is also expensive.  How about we do the global stuff every N
> times, to amortize the cost (and reduce contention)?

Nah, spinning isn't expensive, its a waste of time, similar end result
for someone who wants to do useful work though, but not the same cause.

Pick N and I'll come up with a scenario for which its wrong ;-)

Anyway, its an ugly problem and one I really want to contain inside the
insanity that created it (virt), lets not taint the rest of the kernel
more than we need to. 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 15:58           ` Avi Kivity
@ 2012-09-24 16:05             ` Peter Zijlstra
  2012-09-24 16:10               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote:
> There is the TSC deadline timer mode of newer Intels.  Programming the
> timer is a simple wrmsr, and it will fire immediately if it already
> expired.  Unfortunately on AMDs it is not available, and on virtual
> hardware it will be slow (~1-2 usec). 

Its also still a LAPIC write -- disguised as an MSR though :/

Also, who gives a hoot about virtual crap ;-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 15:41       ` Avi Kivity
@ 2012-09-24 16:06         ` Avi Kivity
  2012-09-24 16:14           ` Peter Zijlstra
  2012-09-25  8:09           ` Raghavendra K T
  2012-09-25  7:36         ` Raghavendra K T
  2012-10-03 12:22         ` Raghavendra K T
  2 siblings, 2 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 16:06 UTC (permalink / raw)
  To: Raghavendra K T, Peter Zijlstra
  Cc: Rik van Riel, H. Peter Anvin, Ingo Molnar, Marcelo Tosatti,
	Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Gleb Natapov

On 09/24/2012 05:41 PM, Avi Kivity wrote:
> 
>> 
>> case 2)
>> rq1 : vcpu1->wait(lockA) (spinning)
>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
>> 
>> I agree that checking rq1 length is not proper in this case, and as you
>> rightly pointed out, we are in trouble here.
>> nr_running()/num_online_cpus() would give more accurate picture here,
>> but it seemed costly. May be load balancer save us a bit here in not
>> running to such sort of cases. ( I agree load balancer is far too
>> complex).
> 
> In theory preempt notifier can tell us whether a vcpu is preempted or
> not (except for exits to userspace), so we can keep track of whether
> it's we're overcommitted in kvm itself.  It also avoids false positives
> from other guests and/or processes being overcommitted while our vm is fine.

It also allows us to cheaply skip running vcpus.

We would probably need a ->sched_exit() preempt notifier to make this
work.  Peter, I know how much you love those, would it be acceptable?
We'd still need yield_to() but the pressure on it might be reduced.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 16:05             ` Peter Zijlstra
@ 2012-09-24 16:10               ` Avi Kivity
  2012-09-24 16:13                 ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 06:05 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote:
>> There is the TSC deadline timer mode of newer Intels.  Programming the
>> timer is a simple wrmsr, and it will fire immediately if it already
>> expired.  Unfortunately on AMDs it is not available, and on virtual
>> hardware it will be slow (~1-2 usec). 
> 
> Its also still a LAPIC write -- disguised as an MSR though :/

It's probably a whole lot faster though.

> Also, who gives a hoot about virtual crap ;-)

I only mentioned it to see if your virtual crap detector is still
working.  Looks like it's still in top condition, low latency and 100%
hit rate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 16:10               ` Avi Kivity
@ 2012-09-24 16:13                 ` Peter Zijlstra
  2012-09-24 16:21                   ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
> > Its also still a LAPIC write -- disguised as an MSR though :/
> 
> It's probably a whole lot faster though. 

I've been told its not, I haven't tried it.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 16:06         ` Avi Kivity
@ 2012-09-24 16:14           ` Peter Zijlstra
  2012-09-24 16:25             ` Avi Kivity
  2012-09-25  8:09           ` Raghavendra K T
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote:
> 
> We would probably need a ->sched_exit() preempt notifier to make this
> work.  Peter, I know how much you love those, would it be acceptable? 

Where exactly do you want this? TASK_DEAD? or another exit?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 16:03             ` Peter Zijlstra
@ 2012-09-24 16:20               ` Avi Kivity
  2012-09-26 13:20                 ` Andrew Jones
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
>> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
>> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> >> However Rik had a genuine concern in the cases where runqueue is not
>> >> equally distributed and lockholder might actually be on a different run 
>> >> queue but not running.
>> > 
>> > Load should eventually get distributed equally -- that's what the
>> > load-balancer is for -- so this is a temporary situation.
>> 
>> What's the expected latency?  This is the whole problem.  Eventually the
>> scheduler would pick the lock holder as well, the problem is that it's
>> in the millisecond scale while lock hold times are in the microsecond
>> scale, leading to a 1000x slowdown.
> 
> Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> something like accurate or fast, never both.
> 
>> If we want to yield, we really want to boost someone.
> 
> Now if only you knew which someone ;-) This non-modified guest nonsense
> is such a snake pit.. but you know how I feel about all that.

Actually if I knew that in addition to boosting someone, I also unboost
myself enough to be preempted, it wouldn't matter.  While boosting the
lock holder is good, the main point is not spinning and doing useful
work instead.  We can detect spinners and avoid boosting them.

That's the motivation for the "donate vruntime" approach I wanted earlier.

> 
>> > We already try and favour the non running vcpu in this case, that's what
>> > yield_to_task_fair() is about. If its still not eligible to run, tough
>> > luck.
>> 
>> Crazy idea: instead of yielding, just run that other vcpu in the thread
>> that would otherwise spin.  I can see about a million objections to this
>> already though.
> 
> Yah.. you want me to list a few? :-) It would require synchronization
> with the other cpu to pull its task -- one really wants to avoid it also
> running it.

Yeah, it's quite a horrible idea.

> 
> Do this at a high enough frequency and you're dead too.
> 
> Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> associated with a vcpu thread and use the preemption notifiers to sort
> things against the scheduler or somesuch.

That's what I thought when I wrote this, but I can't, I might be
preempted in random kvm code.  So my state includes the host stack and
registers.  Maybe we can special-case when we interrupt guest mode.

> 
>> >> Do you think instead of using rq->nr_running, we could get a global 
>> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
>> > 
>> > To what purpose? Also, global stuff is expensive, so you should try and
>> > stay away from it as hard as you possibly can.
>> 
>> Spinning is also expensive.  How about we do the global stuff every N
>> times, to amortize the cost (and reduce contention)?
> 
> Nah, spinning isn't expensive, its a waste of time, similar end result
> for someone who wants to do useful work though, but not the same cause.
> 
> Pick N and I'll come up with a scenario for which its wrong ;-)

Sure.  But if it's rare enough, then that's okay for us.

> Anyway, its an ugly problem and one I really want to contain inside the
> insanity that created it (virt), lets not taint the rest of the kernel
> more than we need to. 

Agreed.  Though given that postgres and others use userspace spinlocks,
maybe it's not just virt.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 16:13                 ` Peter Zijlstra
@ 2012-09-24 16:21                   ` Avi Kivity
  2012-09-25 10:11                     ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 06:13 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
>> > Its also still a LAPIC write -- disguised as an MSR though :/
>> 
>> It's probably a whole lot faster though. 
> 
> I've been told its not, I haven't tried it.

I'll see if I can find a machine with it (don't see it on my Westmere,
it's probably on one of the Bridges.  Or maybe the other Peter knows.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 16:14           ` Peter Zijlstra
@ 2012-09-24 16:25             ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-24 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 06:14 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote:
>> 
>> We would probably need a ->sched_exit() preempt notifier to make this
>> work.  Peter, I know how much you love those, would it be acceptable? 
> 
> Where exactly do you want this? TASK_DEAD? or another exit?

TASK_DEAD of the task that registered the preempt notifier.

The idea is that I want to hold on to the notifier even when exiting to
userspace.  Since userspace is under no obligation to call kvm again, I
need a chance to unregister the notifier and otherwise clean up.

Eh, looking at the code, we'll have a ->sched_out() after the state is
set to TASK_DEAD.  So all we need to do is examine the state.  We'll
need to examine the state anyway to see if we were preempted or blocking.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 15:41       ` Avi Kivity
  2012-09-24 16:06         ` Avi Kivity
@ 2012-09-25  7:36         ` Raghavendra K T
  2012-09-25  8:12           ` Avi Kivity
  2012-10-03 12:22         ` Raghavendra K T
  2 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-25  7:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 09:11 PM, Avi Kivity wrote:
> On 09/21/2012 08:24 PM, Raghavendra K T wrote:
>> On 09/21/2012 06:32 PM, Rik van Riel wrote:
>>> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
>>>> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>>>>
>>>> When total number of VCPUs of system is less than or equal to physical
>>>> CPUs,
>>>> PLE exits become costly since each VCPU can have dedicated PCPU, and
>>>> trying to find a target VCPU to yield_to just burns time in PLE handler.
>>>>
>>>> This patch reduces overhead, by simply doing a return in such
>>>> scenarios by
>>>> checking the length of current cpu runqueue.
>>>
>>> I am not convinced this is the way to go.
>>>
>>> The VCPU that is holding the lock, and is not releasing it,
>>> probably got scheduled out. That implies that VCPU is on a
>>> runqueue with at least one other task.
>>
>> I see your point here, we have two cases:
>>
>> case 1)
>>
>> rq1 : vcpu1->wait(lockA) (spinning)
>> rq2 : vcpu2->holding(lockA) (running)
>>
>> Here Ideally vcpu1 should not enter PLE handler, since it would surely
>> get the lock within ple_window cycle. (assuming ple_window is tuned for
>> that workload perfectly).
>>
>> May be this explains why we are not seeing benefit with kernbench.
>>
>> On the other side, Since we cannot have a perfect ple_window tuned for
>> all type of workloads, for those workloads, which may need more than
>> 4096 cycles, we gain. thinking is it that we are seeing in benefited
>> cases?
>
> Maybe we need to increase the ple window regardless.  4096 cycles is 2
> microseconds or less (call it t_spin).  The overhead from
> kvm_vcpu_on_spin() and the associated task switches is at least a few
> microseconds, increasing as contention is added (call it t_tield).  The
> time for a natural context switch is several milliseconds (call it
> t_slice).  There is also the time the lock holder owns the lock,
> assuming no contention (t_hold).
>
> If t_yield>  t_spin, then in the undercommitted case it dominates
> t_spin.  If t_hold>  t_spin we lose badly.
>
> If t_spin>  t_yield, then the undercommitted case doesn't suffer as much
> as most of the spinning happens in the guest instead of the host, so it
> can pick up the unlock timely.  We don't lose too much in the
> overcommitted case provided the values aren't too far apart (say a
> factor of 3).
>
> Obviously t_spin must be significantly smaller than t_slice, otherwise
> it accomplishes nothing.
>
> Regarding t_hold: if it is small, then a larger t_spin helps avoid false
> exits.  If it is large, then we're not very sensitive to t_spin.  It
> doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
> yielding for several milliseconds.
>
> So I think it's worth trying again with ple_window of 20000-40000.
>

Agree that spinning is not costly and  I have tried increasing
ple_window earlier. I 'll give one more shot.

I was thinking, unnessary spinning of vcpus (spinning when lockholder
is preempted), add up to degradation significantly, especially in
ticketlock scenario is more problemtic. no?


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 16:06         ` Avi Kivity
  2012-09-24 16:14           ` Peter Zijlstra
@ 2012-09-25  8:09           ` Raghavendra K T
  2012-09-25  8:54             ` Avi Kivity
  1 sibling, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-25  8:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 09:36 PM, Avi Kivity wrote:
> On 09/24/2012 05:41 PM, Avi Kivity wrote:
>>
>>>
>>> case 2)
>>> rq1 : vcpu1->wait(lockA) (spinning)
>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
>>>
>>> I agree that checking rq1 length is not proper in this case, and as you
>>> rightly pointed out, we are in trouble here.
>>> nr_running()/num_online_cpus() would give more accurate picture here,
>>> but it seemed costly. May be load balancer save us a bit here in not
>>> running to such sort of cases. ( I agree load balancer is far too
>>> complex).
>>
>> In theory preempt notifier can tell us whether a vcpu is preempted or
>> not (except for exits to userspace), so we can keep track of whether
>> it's we're overcommitted in kvm itself.  It also avoids false positives
>> from other guests and/or processes being overcommitted while our vm is fine.
>
> It also allows us to cheaply skip running vcpus.

Hi Avi,

Could you please elaborate on how preempt notifiers can be used
here to keep track of overcommit or skip running vcpus?

Are we planning set some flag in sched_out() handler etc?


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25  7:36         ` Raghavendra K T
@ 2012-09-25  8:12           ` Avi Kivity
  2012-09-25 14:21             ` Takuya Yoshikawa
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-25  8:12 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/25/2012 09:36 AM, Raghavendra K T wrote:
> On 09/24/2012 09:11 PM, Avi Kivity wrote:
>> On 09/21/2012 08:24 PM, Raghavendra K T wrote:
>>> On 09/21/2012 06:32 PM, Rik van Riel wrote:
>>>> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
>>>>> From: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>>>>>
>>>>> When total number of VCPUs of system is less than or equal to
>>>>> physical
>>>>> CPUs,
>>>>> PLE exits become costly since each VCPU can have dedicated PCPU, and
>>>>> trying to find a target VCPU to yield_to just burns time in PLE
>>>>> handler.
>>>>>
>>>>> This patch reduces overhead, by simply doing a return in such
>>>>> scenarios by
>>>>> checking the length of current cpu runqueue.
>>>>
>>>> I am not convinced this is the way to go.
>>>>
>>>> The VCPU that is holding the lock, and is not releasing it,
>>>> probably got scheduled out. That implies that VCPU is on a
>>>> runqueue with at least one other task.
>>>
>>> I see your point here, we have two cases:
>>>
>>> case 1)
>>>
>>> rq1 : vcpu1->wait(lockA) (spinning)
>>> rq2 : vcpu2->holding(lockA) (running)
>>>
>>> Here Ideally vcpu1 should not enter PLE handler, since it would surely
>>> get the lock within ple_window cycle. (assuming ple_window is tuned for
>>> that workload perfectly).
>>>
>>> May be this explains why we are not seeing benefit with kernbench.
>>>
>>> On the other side, Since we cannot have a perfect ple_window tuned for
>>> all type of workloads, for those workloads, which may need more than
>>> 4096 cycles, we gain. thinking is it that we are seeing in benefited
>>> cases?
>>
>> Maybe we need to increase the ple window regardless.  4096 cycles is 2
>> microseconds or less (call it t_spin).  The overhead from
>> kvm_vcpu_on_spin() and the associated task switches is at least a few
>> microseconds, increasing as contention is added (call it t_tield).  The
>> time for a natural context switch is several milliseconds (call it
>> t_slice).  There is also the time the lock holder owns the lock,
>> assuming no contention (t_hold).
>>
>> If t_yield>  t_spin, then in the undercommitted case it dominates
>> t_spin.  If t_hold>  t_spin we lose badly.
>>
>> If t_spin>  t_yield, then the undercommitted case doesn't suffer as much
>> as most of the spinning happens in the guest instead of the host, so it
>> can pick up the unlock timely.  We don't lose too much in the
>> overcommitted case provided the values aren't too far apart (say a
>> factor of 3).
>>
>> Obviously t_spin must be significantly smaller than t_slice, otherwise
>> it accomplishes nothing.
>>
>> Regarding t_hold: if it is small, then a larger t_spin helps avoid false
>> exits.  If it is large, then we're not very sensitive to t_spin.  It
>> doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
>> yielding for several milliseconds.
>>
>> So I think it's worth trying again with ple_window of 20000-40000.
>>
>
> Agree that spinning is not costly and  I have tried increasing
> ple_window earlier. I 'll give one more shot.
>
> I was thinking, unnessary spinning of vcpus (spinning when lockholder
> is preempted), add up to degradation significantly, especially in
> ticketlock scenario is more problemtic. no?
>

It will.  The tradeoff is between false-positive costs (undercommit) and
true positive costs (overcommit).  I think undercommit should perform
well no matter what.

If we utilize preempt notifiers to track overcommit dynamically, then we
can vary the spin time dynamically.  Keep it long initially, as we get
more preempted vcpus make it shorter.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25  8:09           ` Raghavendra K T
@ 2012-09-25  8:54             ` Avi Kivity
  2012-09-25 13:49               ` Raghavendra K T
                                 ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-25  8:54 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/25/2012 10:09 AM, Raghavendra K T wrote:
> On 09/24/2012 09:36 PM, Avi Kivity wrote:
>> On 09/24/2012 05:41 PM, Avi Kivity wrote:
>>>
>>>>
>>>> case 2)
>>>> rq1 : vcpu1->wait(lockA) (spinning)
>>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
>>>>
>>>> I agree that checking rq1 length is not proper in this case, and as
>>>> you
>>>> rightly pointed out, we are in trouble here.
>>>> nr_running()/num_online_cpus() would give more accurate picture here,
>>>> but it seemed costly. May be load balancer save us a bit here in not
>>>> running to such sort of cases. ( I agree load balancer is far too
>>>> complex).
>>>
>>> In theory preempt notifier can tell us whether a vcpu is preempted or
>>> not (except for exits to userspace), so we can keep track of whether
>>> it's we're overcommitted in kvm itself.  It also avoids false positives
>>> from other guests and/or processes being overcommitted while our vm
>>> is fine.
>>
>> It also allows us to cheaply skip running vcpus.
>
> Hi Avi,
>
> Could you please elaborate on how preempt notifiers can be used
> here to keep track of overcommit or skip running vcpus?
>
> Are we planning set some flag in sched_out() handler etc?
>

Keep a bitmap kvm->preempted_vcpus.

In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
flag and our bit in kvm->preempted_vcpus.  On sched_in, if the flag is
set, clear our bit in kvm->preempted_vcpus.  We can also keep a counter
of preempted vcpus.

We can use the bitmap and the counter to quickly see if spinning is
worthwhile (if the counter is zero, better to spin).  If not, we can use
the bitmap to select target vcpus quickly.

The only problem is that in order to keep this accurate we need to keep
the preempt notifiers active during exits to userspace.  But we can
prototype this without this change, and add it later if it works.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
  2012-09-24 16:21                   ` Avi Kivity
@ 2012-09-25 10:11                     ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-25 10:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/24/2012 06:21 PM, Avi Kivity wrote:
> On 09/24/2012 06:13 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
> >> > Its also still a LAPIC write -- disguised as an MSR though :/
> >> 
> >> It's probably a whole lot faster though. 
> > 
> > I've been told its not, I haven't tried it.
>
> I'll see if I can find a machine with it (don't see it on my Westmere,
> it's probably on one of the Bridges.  Or maybe the other Peter knows.
>
>

Before measuring TSC_DEADLINE, I measured writing to TMICT, it costs 32
cycles.  This is on a Sandy Bridge.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 14:16           ` Raghavendra K T
@ 2012-09-25 13:40             ` Raghavendra K T
  2012-09-27  8:36               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-25 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/24/2012 07:46 PM, Raghavendra K T wrote:
> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>> However Rik had a genuine concern in the cases where runqueue is not
>>> equally distributed and lockholder might actually be on a different run
>>> queue but not running.
>>
>> Load should eventually get distributed equally -- that's what the
>> load-balancer is for -- so this is a temporary situation.
>>
>> We already try and favour the non running vcpu in this case, that's what
>> yield_to_task_fair() is about. If its still not eligible to run, tough
>> luck.
>
> Yes, I agree.
>
>>
>>> Do you think instead of using rq->nr_running, we could get a global
>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>
>> To what purpose? Also, global stuff is expensive, so you should try and
>> stay away from it as hard as you possibly can.
>
> Yes, that concern only had made me to fall back to rq->nr_running.
>
> Will come back with the result soon.

Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

---+-----------+-----------+-----------+-----------+
    |    Ebizzy result (rec/sec higher is better)   |
---+-----------+-----------+-----------+-----------+
    |    Base   |     A     |      B    |     C     |
---+-----------+-----------+-----------+-----------+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
---+-----------+-----------+-----------+-----------+

    % improvements w.r.t BASE
---+------------+------------+------------+
    |      A     |    B       |     C      |
---+------------+------------+------------+
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---+------------+------------+------------+

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).






^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25  8:54             ` Avi Kivity
@ 2012-09-25 13:49               ` Raghavendra K T
  2012-09-27  7:44               ` Gleb Natapov
       [not found]               ` <CAJocwcf+8u84_yDC-PK0Yni93YSTWzYvr69nq6b3pNv1MwVJzQ@mail.gmail.com>
  2 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-25 13:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/25/2012 02:24 PM, Avi Kivity wrote:
> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
>> On 09/24/2012 09:36 PM, Avi Kivity wrote:
>>> On 09/24/2012 05:41 PM, Avi Kivity wrote:
>>>>
>>>>>
>>>>> case 2)
>>>>> rq1 : vcpu1->wait(lockA) (spinning)
>>>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
>>>>>
>>>>> I agree that checking rq1 length is not proper in this case, and as
>>>>> you
>>>>> rightly pointed out, we are in trouble here.
>>>>> nr_running()/num_online_cpus() would give more accurate picture here,
>>>>> but it seemed costly. May be load balancer save us a bit here in not
>>>>> running to such sort of cases. ( I agree load balancer is far too
>>>>> complex).
>>>>
>>>> In theory preempt notifier can tell us whether a vcpu is preempted or
>>>> not (except for exits to userspace), so we can keep track of whether
>>>> it's we're overcommitted in kvm itself.  It also avoids false positives
>>>> from other guests and/or processes being overcommitted while our vm
>>>> is fine.
>>>
>>> It also allows us to cheaply skip running vcpus.
>>
>> Hi Avi,
>>
>> Could you please elaborate on how preempt notifiers can be used
>> here to keep track of overcommit or skip running vcpus?
>>
>> Are we planning set some flag in sched_out() handler etc?
>>
>
> Keep a bitmap kvm->preempted_vcpus.
>
> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
> flag and our bit in kvm->preempted_vcpus.  On sched_in, if the flag is
> set, clear our bit in kvm->preempted_vcpus.  We can also keep a counter
> of preempted vcpus.
>
> We can use the bitmap and the counter to quickly see if spinning is
> worthwhile (if the counter is zero, better to spin).  If not, we can use
> the bitmap to select target vcpus quickly.
>
> The only problem is that in order to keep this accurate we need to keep
> the preempt notifiers active during exits to userspace.  But we can
> prototype this without this change, and add it later if it works.
>

Avi, Thanks for the idea.. I want to try this some time soon.

So ideally it means if we are under-committed the counter/ bitmap
effective value is zero.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25  8:12           ` Avi Kivity
@ 2012-09-25 14:21             ` Takuya Yoshikawa
  2012-09-27  8:43               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Takuya Yoshikawa @ 2012-09-25 14:21 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Rik van Riel, Peter Zijlstra, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Tue, 25 Sep 2012 10:12:49 +0200
Avi Kivity <avi@redhat.com> wrote:

> It will.  The tradeoff is between false-positive costs (undercommit) and
> true positive costs (overcommit).  I think undercommit should perform
> well no matter what.
> 
> If we utilize preempt notifiers to track overcommit dynamically, then we
> can vary the spin time dynamically.  Keep it long initially, as we get
> more preempted vcpus make it shorter.

What will happen if we pin each vcpu thread to some core?
I don't want to see so many vcpu threads moving around without
being pinned at all.

In that case, we don't want to make KVM do any work of searching
a vcpu thread to yield to.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 12:02       ` Raghavendra K T
@ 2012-09-25 15:00         ` Dor Laor
  2012-09-26 12:27           ` Konrad Rzeszutek Wilk
  2012-09-27  9:49           ` Raghavendra K T
  0 siblings, 2 replies; 126+ messages in thread
From: Dor Laor @ 2012-09-25 15:00 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> On 09/24/2012 02:12 PM, Dor Laor wrote:
>> In order to help PLE and pvticketlock converge I thought that a small
>> test code should be developed to test this in a predictable,
>> deterministic way.
>>
>> The idea is to have a guest kernel module that spawn a new thread each
>> time you write to a /sys/.... entry.
>>
>> Each such a thread spins over a spin lock. The specific spin lock is
>> also chosen by the /sys/ interface. Let's say we have an array of spin
>> locks *10 times the amount of vcpus.
>>
>> All the threads are running a
>> while (1) {
>>
>> spin_lock(my_lock);
>> sum += execute_dummy_cpu_computation(time);
>> spin_unlock(my_lock);
>>
>> if (sys_tells_thread_to_die()) break;
>> }
>>
>> print_result(sum);
>>
>> Instead of calling the kernel's spin_lock functions, clone them and make
>> the ticket lock order deterministic and known (like a linear walk of all
>> the threads trying to catch that lock).
>
> By Cloning you mean hierarchy of the locks?

No, I meant to clone the implementation of the current spin lock code in 
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you 
can always make the ticket go linearly from 1->2...->N.
Not sure it's a good idea, just a recommendation.

> Also I believe time should be passed via sysfs / hardcoded for each
> type of lock we are mimicking

Yap

>
>>
>> This way you can easy calculate:
>> 1. the score of a single vcpu running a single thread
>> 2. the score of sum of all thread scores when #thread==#vcpu all
>> taking the same spin lock. The overall sum should be close as
>> possible to #1.
>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>> (belonging to all VMs) > #pcpus.
>> 4. Create #thread == #vcpus but let each thread have it's own spin
>> lock
>> 5. Like 4 + 2
>>
>> Hopefully this way will allows you to judge and evaluate the exact
>> overhead of scheduling VMs and threads since you have the ideal result
>> in hand and you know what the threads are doing.
>>
>> My 2 cents, Dor
>>
>
> Thank you,
> I think this is an excellent idea. ( Though I am trying to put all the
> pieces together you mentioned). So overall we should be able to measure
> the performance of pvspinlock/PLE improvements with a deterministic
> load in guest.
>
> Only thing I am missing is,
> How to generate different combinations of the lock.
>
> Okay, let me see if I can come with a solid model for this.
>

Do you mean the various options for PLE/pvticket/other? I haven't 
thought of it and assumed its static but it can also be controlled 
through the temporary /sys interface.

Thanks for following up!
Dor

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-25 15:00         ` Dor Laor
@ 2012-09-26 12:27           ` Konrad Rzeszutek Wilk
  2012-09-27 10:07             ` Raghavendra K T
  2012-09-27  9:49           ` Raghavendra K T
  1 sibling, 1 reply; 126+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-26 12:27 UTC (permalink / raw)
  To: Dor Laor
  Cc: Raghavendra K T, Chegu Vinod, Peter Zijlstra, H. Peter Anvin,
	Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> >On 09/24/2012 02:12 PM, Dor Laor wrote:
> >>In order to help PLE and pvticketlock converge I thought that a small
> >>test code should be developed to test this in a predictable,
> >>deterministic way.
> >>
> >>The idea is to have a guest kernel module that spawn a new thread each
> >>time you write to a /sys/.... entry.
> >>
> >>Each such a thread spins over a spin lock. The specific spin lock is
> >>also chosen by the /sys/ interface. Let's say we have an array of spin
> >>locks *10 times the amount of vcpus.
> >>
> >>All the threads are running a
> >>while (1) {
> >>
> >>spin_lock(my_lock);
> >>sum += execute_dummy_cpu_computation(time);
> >>spin_unlock(my_lock);
> >>
> >>if (sys_tells_thread_to_die()) break;
> >>}
> >>
> >>print_result(sum);
> >>
> >>Instead of calling the kernel's spin_lock functions, clone them and make
> >>the ticket lock order deterministic and known (like a linear walk of all
> >>the threads trying to catch that lock).
> >
> >By Cloning you mean hierarchy of the locks?
> 
> No, I meant to clone the implementation of the current spin lock
> code in order to set any order you may like for the ticket
> selection.
> (even for a non pvticket lock version)

Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.
> 
> For instance, let's say you have N threads trying to grab the lock,
> you can always make the ticket go linearly from 1->2...->N.
> Not sure it's a good idea, just a recommendation.

So round-robin. Could you make NCPUS threads, pin them to CPUs, and set
them to be SCHED_RR? Or NCPUS*2 to overcommit.

> 
> >Also I believe time should be passed via sysfs / hardcoded for each
> >type of lock we are mimicking
> 
> Yap
> 
> >
> >>
> >>This way you can easy calculate:
> >>1. the score of a single vcpu running a single thread
> >>2. the score of sum of all thread scores when #thread==#vcpu all
> >>taking the same spin lock. The overall sum should be close as
> >>possible to #1.
> >>3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> >>(belonging to all VMs) > #pcpus.
> >>4. Create #thread == #vcpus but let each thread have it's own spin
> >>lock
> >>5. Like 4 + 2
> >>
> >>Hopefully this way will allows you to judge and evaluate the exact
> >>overhead of scheduling VMs and threads since you have the ideal result
> >>in hand and you know what the threads are doing.
> >>
> >>My 2 cents, Dor
> >>
> >
> >Thank you,
> >I think this is an excellent idea. ( Though I am trying to put all the
> >pieces together you mentioned). So overall we should be able to measure
> >the performance of pvspinlock/PLE improvements with a deterministic
> >load in guest.
> >
> >Only thing I am missing is,
> >How to generate different combinations of the lock.
> >
> >Okay, let me see if I can come with a solid model for this.
> >
> 
> Do you mean the various options for PLE/pvticket/other? I haven't
> thought of it and assumed its static but it can also be controlled
> through the temporary /sys interface.
> 
> Thanks for following up!
> Dor
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 12:36     ` Peter Zijlstra
  2012-09-24 13:29       ` Raghavendra K T
@ 2012-09-26 12:57       ` Andrew Jones
  2012-09-27 10:21         ` Raghavendra K T
  1 sibling, 1 reply; 126+ messages in thread
From: Andrew Jones @ 2012-09-26 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> > On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> > >> prove very costly, because there is no need to iterate over vcpus
> > >> and do unsuccessful yield_to burning CPU.
> > >
> > > What's the costly thing? The vm-exit, the yield (which should be a nop
> > > if its the only task there) or something else entirely?
> > >
> > Both vmexit and yield_to() actually,
> > 
> > because unsuccessful yield_to() overall is costly in PLE handler.
> > 
> > This is because when we have large guests, say 32/16 vcpus, and one
> > vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> > the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> > 
> > this results is fairly high amount of cpu burning and double run queue
> > lock contention.
> > 
> > (if they were spinning probably lock progress would have been faster).
> > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> > seems little complex to achieve currently.
> 
> OK, so the vmexit stays and we need to improve yield_to.

Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).

Drew

> 
> How about something like the below, that would allow breaking out of the
> for-each-vcpu loop and simply going back into the vm, right?
> 
> ---
>  kernel/sched/core.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b38f00e..5d5b355 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *   true (>0) if we indeed boosted the target task.
> + *   false (0) if we failed to boost the target.
> + *   -ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  	local_irq_save(flags);
>  	rq = this_rq();
>  
> +	/*
> +	 * If we're the only runnable task on the rq, there's absolutely no
> +	 * point in yielding.
> +	 */
> +	if (rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>  again:
>  	p_rq = task_rq(p);
>  	double_rq_lock(rq, p_rq);
> @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  	}
>  
>  	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>  
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  			resched_task(p_rq->curr);
>  	}
>  
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
>  
> -	if (yielded)
> +	if (yielded > 0)
>  		schedule();
>  
>  	return yielded;
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-24 16:20               ` Avi Kivity
@ 2012-09-26 13:20                 ` Andrew Jones
  2012-09-26 13:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Andrew Jones @ 2012-09-26 13:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Raghavendra K T, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote:
> On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> >> However Rik had a genuine concern in the cases where runqueue is not
> >> >> equally distributed and lockholder might actually be on a different run 
> >> >> queue but not running.
> >> > 
> >> > Load should eventually get distributed equally -- that's what the
> >> > load-balancer is for -- so this is a temporary situation.
> >> 
> >> What's the expected latency?  This is the whole problem.  Eventually the
> >> scheduler would pick the lock holder as well, the problem is that it's
> >> in the millisecond scale while lock hold times are in the microsecond
> >> scale, leading to a 1000x slowdown.
> > 
> > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> > something like accurate or fast, never both.
> > 
> >> If we want to yield, we really want to boost someone.
> > 
> > Now if only you knew which someone ;-) This non-modified guest nonsense
> > is such a snake pit.. but you know how I feel about all that.
> 
> Actually if I knew that in addition to boosting someone, I also unboost
> myself enough to be preempted, it wouldn't matter.  While boosting the
> lock holder is good, the main point is not spinning and doing useful
> work instead.  We can detect spinners and avoid boosting them.
> 
> That's the motivation for the "donate vruntime" approach I wanted earlier.

I'll probably get shot for the suggestion, but doesn't this problem merit
another scheduler class? We want FIFO order for a special class of tasks,
"spinners". Wouldn't a clean solution be to promote a task's scheduler
class to the spinner class when we PLE (or come from some special syscall
for userspace spinlocks?)? That class would be higher priority than the
fair class and would schedule in FIFO order, but it would only run its
tasks for short periods before switching. Also, after each task is run
its scheduler class would get reset down to its original class (fair).
At least at first thought this looks to me to be cleaner than the next
and skip hinting, plus it helps guarantee that the lock holder gets
scheduled before the tasks waiting on that lock.

Drew

> 
> > 
> >> > We already try and favour the non running vcpu in this case, that's what
> >> > yield_to_task_fair() is about. If its still not eligible to run, tough
> >> > luck.
> >> 
> >> Crazy idea: instead of yielding, just run that other vcpu in the thread
> >> that would otherwise spin.  I can see about a million objections to this
> >> already though.
> > 
> > Yah.. you want me to list a few? :-) It would require synchronization
> > with the other cpu to pull its task -- one really wants to avoid it also
> > running it.
> 
> Yeah, it's quite a horrible idea.
> 
> > 
> > Do this at a high enough frequency and you're dead too.
> > 
> > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> > associated with a vcpu thread and use the preemption notifiers to sort
> > things against the scheduler or somesuch.
> 
> That's what I thought when I wrote this, but I can't, I might be
> preempted in random kvm code.  So my state includes the host stack and
> registers.  Maybe we can special-case when we interrupt guest mode.
> 
> > 
> >> >> Do you think instead of using rq->nr_running, we could get a global 
> >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> >> > 
> >> > To what purpose? Also, global stuff is expensive, so you should try and
> >> > stay away from it as hard as you possibly can.
> >> 
> >> Spinning is also expensive.  How about we do the global stuff every N
> >> times, to amortize the cost (and reduce contention)?
> > 
> > Nah, spinning isn't expensive, its a waste of time, similar end result
> > for someone who wants to do useful work though, but not the same cause.
> > 
> > Pick N and I'll come up with a scenario for which its wrong ;-)
> 
> Sure.  But if it's rare enough, then that's okay for us.
> 
> > Anyway, its an ugly problem and one I really want to contain inside the
> > insanity that created it (virt), lets not taint the rest of the kernel
> > more than we need to. 
> 
> Agreed.  Though given that postgres and others use userspace spinlocks,
> maybe it's not just virt.
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-26 13:20                 ` Andrew Jones
@ 2012-09-26 13:26                   ` Peter Zijlstra
  2012-09-26 13:39                     ` Andrew Jones
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-26 13:26 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Avi Kivity, Raghavendra K T, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> Wouldn't a clean solution be to promote a task's scheduler
> class to the spinner class when we PLE (or come from some special
> syscall
> for userspace spinlocks?)? 

Userspace spinlocks are typically employed to avoid syscalls..

> That class would be higher priority than the
> fair class and would schedule in FIFO order, but it would only run its
> tasks for short periods before switching. 

Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-26 13:26                   ` Peter Zijlstra
@ 2012-09-26 13:39                     ` Andrew Jones
  2012-09-26 13:45                       ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Andrew Jones @ 2012-09-26 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Raghavendra K T, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > Wouldn't a clean solution be to promote a task's scheduler
> > class to the spinner class when we PLE (or come from some special
> > syscall
> > for userspace spinlocks?)? 
> 
> Userspace spinlocks are typically employed to avoid syscalls..

I'm guessing there could be a slow path - spin N times and then give
up and yield.

> 
> > That class would be higher priority than the
> > fair class and would schedule in FIFO order, but it would only run its
> > tasks for short periods before switching. 
> 
> Since lock hold times aren't limited, esp. for things like userspace
> 'spin' locks, you've got a very good denial of service / opportunity for
> abuse right there.

Maybe add some throttling to avoid overuse/maliciousness?

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-26 13:39                     ` Andrew Jones
@ 2012-09-26 13:45                       ` Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-26 13:45 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Avi Kivity, Raghavendra K T, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
> On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > > Wouldn't a clean solution be to promote a task's scheduler
> > > class to the spinner class when we PLE (or come from some special
> > > syscall
> > > for userspace spinlocks?)? 
> > 
> > Userspace spinlocks are typically employed to avoid syscalls..
> 
> I'm guessing there could be a slow path - spin N times and then give
> up and yield.

Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.

> > 
> > > That class would be higher priority than the
> > > fair class and would schedule in FIFO order, but it would only run its
> > > tasks for short periods before switching. 
> > 
> > Since lock hold times aren't limited, esp. for things like userspace
> > 'spin' locks, you've got a very good denial of service / opportunity for
> > abuse right there.
> 
> Maybe add some throttling to avoid overuse/maliciousness?

At which point you're pretty much back to where you started.

A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..

Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.

In general its a very bad idea..

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25  8:54             ` Avi Kivity
  2012-09-25 13:49               ` Raghavendra K T
@ 2012-09-27  7:44               ` Gleb Natapov
  2012-09-27  8:59                 ` Avi Kivity
       [not found]               ` <CAJocwcf+8u84_yDC-PK0Yni93YSTWzYvr69nq6b3pNv1MwVJzQ@mail.gmail.com>
  2 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-09-27  7:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
> >>>
> >>>>
> >>>> case 2)
> >>>> rq1 : vcpu1->wait(lockA) (spinning)
> >>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
> >>>>
> >>>> I agree that checking rq1 length is not proper in this case, and as
> >>>> you
> >>>> rightly pointed out, we are in trouble here.
> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
> >>>> but it seemed costly. May be load balancer save us a bit here in not
> >>>> running to such sort of cases. ( I agree load balancer is far too
> >>>> complex).
> >>>
> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
> >>> not (except for exits to userspace), so we can keep track of whether
> >>> it's we're overcommitted in kvm itself.  It also avoids false positives
> >>> from other guests and/or processes being overcommitted while our vm
> >>> is fine.
> >>
> >> It also allows us to cheaply skip running vcpus.
> >
> > Hi Avi,
> >
> > Could you please elaborate on how preempt notifiers can be used
> > here to keep track of overcommit or skip running vcpus?
> >
> > Are we planning set some flag in sched_out() handler etc?
> >
> 
> Keep a bitmap kvm->preempted_vcpus.
> 
> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
> flag and our bit in kvm->preempted_vcpus.  On sched_in, if the flag is
> set, clear our bit in kvm->preempted_vcpus.  We can also keep a counter
> of preempted vcpus.
> 
> We can use the bitmap and the counter to quickly see if spinning is
> worthwhile (if the counter is zero, better to spin).  If not, we can use
> the bitmap to select target vcpus quickly.
> 
> The only problem is that in order to keep this accurate we need to keep
> the preempt notifiers active during exits to userspace.  But we can
> prototype this without this change, and add it later if it works.
> 
Can user return notifier can be used instead? Set bit in
kvm->preempted_vcpus on return to userspace.

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-25 13:40             ` Raghavendra K T
@ 2012-09-27  8:36               ` Avi Kivity
  2012-09-27 11:23                 ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-27  8:36 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/25/2012 03:40 PM, Raghavendra K T wrote:
> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>> equally distributed and lockholder might actually be on a different run
>>>> queue but not running.
>>>
>>> Load should eventually get distributed equally -- that's what the
>>> load-balancer is for -- so this is a temporary situation.
>>>
>>> We already try and favour the non running vcpu in this case, that's what
>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>> luck.
>>
>> Yes, I agree.
>>
>>>
>>>> Do you think instead of using rq->nr_running, we could get a global
>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>
>>> To what purpose? Also, global stuff is expensive, so you should try and
>>> stay away from it as hard as you possibly can.
>>
>> Yes, that concern only had made me to fall back to rq->nr_running.
>>
>> Will come back with the result soon.
> 
> Got the result with the patches:
> So here is the result,
> 
> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
> 1x and 2x overcommits
> 
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
> 
> ---+-----------+-----------+-----------+-----------+
>    |    Ebizzy result (rec/sec higher is better)   |
> ---+-----------+-----------+-----------+-----------+
>    |    Base   |     A     |      B    |     C     |
> ---+-----------+-----------+-----------+-----------+
> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
> 2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
> ---+-----------+-----------+-----------+-----------+
> 
>    % improvements w.r.t BASE
> ---+------------+------------+------------+
>    |      A     |    B       |     C      |
> ---+------------+------------+------------+
> 1x | 206.37603  |  139.70410 |  210.19323 |
> 2x | -3.06555   |  -4.33218  |  -98.08773 |
> ---+------------+------------+------------+
> 
> we are getting the benefit of almost PLE disabled case with this
> approach. With patch B, we have dropped a bit in gain.
> (because we still would iterate vcpus until we decide to do a directed
> yield).

This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-25 14:21             ` Takuya Yoshikawa
@ 2012-09-27  8:43               ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-27  8:43 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Raghavendra K T, Rik van Riel, Peter Zijlstra, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On 09/25/2012 04:21 PM, Takuya Yoshikawa wrote:
> On Tue, 25 Sep 2012 10:12:49 +0200
> Avi Kivity <avi@redhat.com> wrote:
> 
>> It will.  The tradeoff is between false-positive costs (undercommit) and
>> true positive costs (overcommit).  I think undercommit should perform
>> well no matter what.
>> 
>> If we utilize preempt notifiers to track overcommit dynamically, then we
>> can vary the spin time dynamically.  Keep it long initially, as we get
>> more preempted vcpus make it shorter.
> 
> What will happen if we pin each vcpu thread to some core?
> I don't want to see so many vcpu threads moving around without
> being pinned at all.

If you do that you've removed a lot of flexibility from the scheduler,
so overcommit becomes even less likely to work well (a trivial example
is pinning two vcpus from the same vm to the same core -- it's so
obviously bad no one considers doing it).

> In that case, we don't want to make KVM do any work of searching
> a vcpu thread to yield to.

Why not?  If a vcpu thread on another core has been preempted, and is
the lock holder, and we can boost it, then we've fixed our problem.
Even if the spinning thread keeps spinning because it is the only task
eligible to run on its core.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
       [not found]               ` <CAJocwcf+8u84_yDC-PK0Yni93YSTWzYvr69nq6b3pNv1MwVJzQ@mail.gmail.com>
@ 2012-09-27  8:50                 ` Avi Kivity
  2012-09-27 11:26                   ` Raghavendra K T
       [not found]                   ` <CAJocwcc19F+PtsQ5okGMvYeVnkEigpZRpwWY9JgeRPFqfcVoXA@mail.gmail.com>
  0 siblings, 2 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-27  8:50 UTC (permalink / raw)
  To: Jiannan Ouyang
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
> I've actually implemented this preempted_bitmap idea. 

Interesting, please share the code if you can.

> However, I'm doing this to expose this information to the guest, so the
> guest is able to know if the lock holder is preempted or not before
> spining. Right now, I'm doing experiment to show that this idea works.
> 
> I'm wondering what do you guys think of the relationship between the
> pv_ticketlock approach and PLE handler approach. Are we going to adopt
> PLE instead of the pv ticketlock, and why?

Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better

Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  7:44               ` Gleb Natapov
@ 2012-09-27  8:59                 ` Avi Kivity
  2012-09-27  9:11                   ` Gleb Natapov
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-27  8:59 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 09/27/2012 09:44 AM, Gleb Natapov wrote:
> On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
>> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
>> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
>> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
>> >>>
>> >>>>
>> >>>> case 2)
>> >>>> rq1 : vcpu1->wait(lockA) (spinning)
>> >>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
>> >>>>
>> >>>> I agree that checking rq1 length is not proper in this case, and as
>> >>>> you
>> >>>> rightly pointed out, we are in trouble here.
>> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
>> >>>> but it seemed costly. May be load balancer save us a bit here in not
>> >>>> running to such sort of cases. ( I agree load balancer is far too
>> >>>> complex).
>> >>>
>> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
>> >>> not (except for exits to userspace), so we can keep track of whether
>> >>> it's we're overcommitted in kvm itself.  It also avoids false positives
>> >>> from other guests and/or processes being overcommitted while our vm
>> >>> is fine.
>> >>
>> >> It also allows us to cheaply skip running vcpus.
>> >
>> > Hi Avi,
>> >
>> > Could you please elaborate on how preempt notifiers can be used
>> > here to keep track of overcommit or skip running vcpus?
>> >
>> > Are we planning set some flag in sched_out() handler etc?
>> >
>> 
>> Keep a bitmap kvm->preempted_vcpus.
>> 
>> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
>> flag and our bit in kvm->preempted_vcpus.  On sched_in, if the flag is
>> set, clear our bit in kvm->preempted_vcpus.  We can also keep a counter
>> of preempted vcpus.
>> 
>> We can use the bitmap and the counter to quickly see if spinning is
>> worthwhile (if the counter is zero, better to spin).  If not, we can use
>> the bitmap to select target vcpus quickly.
>> 
>> The only problem is that in order to keep this accurate we need to keep
>> the preempt notifiers active during exits to userspace.  But we can
>> prototype this without this change, and add it later if it works.
>> 
> Can user return notifier can be used instead? Set bit in
> kvm->preempted_vcpus on return to userspace.
> 

User return notifier is per-cpu, not per-task.  There is a new task_work
(<linux/task_work.h>) that does what you want.  With these
technicalities out of the way, I think it's the wrong idea.  If a vcpu
thread is in userspace, that doesn't mean it's preempted, there's no
point in boosting it if it's already running.

btw, we can have secondary effects.  A vcpu can be waiting for a lock in
the host kernel, or for a host page fault.  There's no point in boosting
anything for that.  Or a vcpu in userspace can be waiting for a lock
that is held by another thread, which has been preempted.  This is (like
I think Peter already said) a priority inheritance problem.  However
with fine-grained locking in userspace, we can make it go away.  The
guest kernel is unlikely to access one device simultaneously from two
threads (and if it does, we just need to improve the threading in the
device model).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  8:59                 ` Avi Kivity
@ 2012-09-27  9:11                   ` Gleb Natapov
  2012-09-27  9:33                     ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-09-27  9:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Thu, Sep 27, 2012 at 10:59:21AM +0200, Avi Kivity wrote:
> On 09/27/2012 09:44 AM, Gleb Natapov wrote:
> > On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
> >> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
> >> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
> >> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
> >> >>>
> >> >>>>
> >> >>>> case 2)
> >> >>>> rq1 : vcpu1->wait(lockA) (spinning)
> >> >>>> rq2 : vcpu3 (running) ,  vcpu2->holding(lockA) [scheduled out]
> >> >>>>
> >> >>>> I agree that checking rq1 length is not proper in this case, and as
> >> >>>> you
> >> >>>> rightly pointed out, we are in trouble here.
> >> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
> >> >>>> but it seemed costly. May be load balancer save us a bit here in not
> >> >>>> running to such sort of cases. ( I agree load balancer is far too
> >> >>>> complex).
> >> >>>
> >> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
> >> >>> not (except for exits to userspace), so we can keep track of whether
> >> >>> it's we're overcommitted in kvm itself.  It also avoids false positives
> >> >>> from other guests and/or processes being overcommitted while our vm
> >> >>> is fine.
> >> >>
> >> >> It also allows us to cheaply skip running vcpus.
> >> >
> >> > Hi Avi,
> >> >
> >> > Could you please elaborate on how preempt notifiers can be used
> >> > here to keep track of overcommit or skip running vcpus?
> >> >
> >> > Are we planning set some flag in sched_out() handler etc?
> >> >
> >> 
> >> Keep a bitmap kvm->preempted_vcpus.
> >> 
> >> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
> >> flag and our bit in kvm->preempted_vcpus.  On sched_in, if the flag is
> >> set, clear our bit in kvm->preempted_vcpus.  We can also keep a counter
> >> of preempted vcpus.
> >> 
> >> We can use the bitmap and the counter to quickly see if spinning is
> >> worthwhile (if the counter is zero, better to spin).  If not, we can use
> >> the bitmap to select target vcpus quickly.
> >> 
> >> The only problem is that in order to keep this accurate we need to keep
> >> the preempt notifiers active during exits to userspace.  But we can
> >> prototype this without this change, and add it later if it works.
> >> 
> > Can user return notifier can be used instead? Set bit in
> > kvm->preempted_vcpus on return to userspace.
> > 
> 
> User return notifier is per-cpu, not per-task.  There is a new task_work
> (<linux/task_work.h>) that does what you want.  With these
> technicalities out of the way, I think it's the wrong idea.  If a vcpu
> thread is in userspace, that doesn't mean it's preempted, there's no
> point in boosting it if it's already running.
> 
Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
is in userspace it is definitely not preempted.
 
> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
> the host kernel, or for a host page fault.  There's no point in boosting
> anything for that.  Or a vcpu in userspace can be waiting for a lock
> that is held by another thread, which has been preempted. 
Do you mean userspace spinlock? Because otherwise task that's waits on
a kernel lock will sleep in the kernel.

>                                                            This is (like
> I think Peter already said) a priority inheritance problem.  However
> with fine-grained locking in userspace, we can make it go away.  The
> guest kernel is unlikely to access one device simultaneously from two
> threads (and if it does, we just need to improve the threading in the
> device model).
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  9:11                   ` Gleb Natapov
@ 2012-09-27  9:33                     ` Avi Kivity
  2012-09-27  9:58                       ` Gleb Natapov
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-27  9:33 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 09/27/2012 11:11 AM, Gleb Natapov wrote:
>> 
>> User return notifier is per-cpu, not per-task.  There is a new task_work
>> (<linux/task_work.h>) that does what you want.  With these
>> technicalities out of the way, I think it's the wrong idea.  If a vcpu
>> thread is in userspace, that doesn't mean it's preempted, there's no
>> point in boosting it if it's already running.
>> 
> Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
> TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
> is in userspace it is definitely not preempted.

No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
then it is preempted (i.e. runnable), not sleeping on some waitqueue,
voluntarily (HLT) or involuntarily (page fault).

>  
>> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
>> the host kernel, or for a host page fault.  There's no point in boosting
>> anything for that.  Or a vcpu in userspace can be waiting for a lock
>> that is held by another thread, which has been preempted. 
> Do you mean userspace spinlock? Because otherwise task that's waits on
> a kernel lock will sleep in the kernel.

I meant a kernel mutex.

vcpu 0: take guest spinlock
vcpu 0: vmexit
vcpu 0: spin_lock(some_lock)
vcpu 1: take same guest spinlock
vcpu 1: PLE vmexit
vcpu 1: wtf?

Waiting on a host kernel spinlock is not too bad because we expect to be
out shortly.  Waiting on a host kernel mutex can be a lot worse.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-25 15:00         ` Dor Laor
  2012-09-26 12:27           ` Konrad Rzeszutek Wilk
@ 2012-09-27  9:49           ` Raghavendra K T
  2012-09-27 10:28             ` Andrew Jones
  2012-09-27 10:33             ` Dor Laor
  1 sibling, 2 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27  9:49 UTC (permalink / raw)
  To: dlaor
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 09/25/2012 08:30 PM, Dor Laor wrote:
> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>> In order to help PLE and pvticketlock converge I thought that a small
>>> test code should be developed to test this in a predictable,
>>> deterministic way.
>>>
>>> The idea is to have a guest kernel module that spawn a new thread each
>>> time you write to a /sys/.... entry.
>>>
>>> Each such a thread spins over a spin lock. The specific spin lock is
>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>> locks *10 times the amount of vcpus.
>>>
>>> All the threads are running a
>>> while (1) {
>>>
>>> spin_lock(my_lock);
>>> sum += execute_dummy_cpu_computation(time);
>>> spin_unlock(my_lock);
>>>
>>> if (sys_tells_thread_to_die()) break;
>>> }
>>>
>>> print_result(sum);
>>>
>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>> the ticket lock order deterministic and known (like a linear walk of all
>>> the threads trying to catch that lock).
>>
>> By Cloning you mean hierarchy of the locks?
>
> No, I meant to clone the implementation of the current spin lock code in
> order to set any order you may like for the ticket selection.
> (even for a non pvticket lock version)
>
> For instance, let's say you have N threads trying to grab the lock, you
> can always make the ticket go linearly from 1->2...->N.
> Not sure it's a good idea, just a recommendation.
>
>> Also I believe time should be passed via sysfs / hardcoded for each
>> type of lock we are mimicking
>
> Yap
>
>>
>>>
>>> This way you can easy calculate:
>>> 1. the score of a single vcpu running a single thread
>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>> taking the same spin lock. The overall sum should be close as
>>> possible to #1.
>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>> (belonging to all VMs) > #pcpus.
>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>> lock
>>> 5. Like 4 + 2
>>>
>>> Hopefully this way will allows you to judge and evaluate the exact
>>> overhead of scheduling VMs and threads since you have the ideal result
>>> in hand and you know what the threads are doing.
>>>
>>> My 2 cents, Dor
>>>
>>
>> Thank you,
>> I think this is an excellent idea. ( Though I am trying to put all the
>> pieces together you mentioned). So overall we should be able to measure
>> the performance of pvspinlock/PLE improvements with a deterministic
>> load in guest.
>>
>> Only thing I am missing is,
>> How to generate different combinations of the lock.
>>
>> Okay, let me see if I can come with a solid model for this.
>>
>
> Do you mean the various options for PLE/pvticket/other? I haven't
> thought of it and assumed its static but it can also be controlled
> through the temporary /sys interface.
>

No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  9:33                     ` Avi Kivity
@ 2012-09-27  9:58                       ` Gleb Natapov
  2012-09-27 10:04                         ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-09-27  9:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Thu, Sep 27, 2012 at 11:33:56AM +0200, Avi Kivity wrote:
> On 09/27/2012 11:11 AM, Gleb Natapov wrote:
> >> 
> >> User return notifier is per-cpu, not per-task.  There is a new task_work
> >> (<linux/task_work.h>) that does what you want.  With these
> >> technicalities out of the way, I think it's the wrong idea.  If a vcpu
> >> thread is in userspace, that doesn't mean it's preempted, there's no
> >> point in boosting it if it's already running.
> >> 
> > Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
> > TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
> > is in userspace it is definitely not preempted.
> 
> No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
> then it is preempted (i.e. runnable), not sleeping on some waitqueue,
> voluntarily (HLT) or involuntarily (page fault).
> 
Of course, I got it all backwards. Need more coffee.

> >  
> >> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
> >> the host kernel, or for a host page fault.  There's no point in boosting
> >> anything for that.  Or a vcpu in userspace can be waiting for a lock
> >> that is held by another thread, which has been preempted. 
> > Do you mean userspace spinlock? Because otherwise task that's waits on
> > a kernel lock will sleep in the kernel.
> 
> I meant a kernel mutex.
> 
> vcpu 0: take guest spinlock
> vcpu 0: vmexit
> vcpu 0: spin_lock(some_lock)
> vcpu 1: take same guest spinlock
> vcpu 1: PLE vmexit
> vcpu 1: wtf?
> 
> Waiting on a host kernel spinlock is not too bad because we expect to be
> out shortly.  Waiting on a host kernel mutex can be a lot worse.
> 
We can't do much about it without PV spinlock since there is not
information about what vcpu holds which guest spinlock, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  9:58                       ` Gleb Natapov
@ 2012-09-27 10:04                         ` Avi Kivity
  2012-09-27 10:08                           ` Gleb Natapov
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-27 10:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 09/27/2012 11:58 AM, Gleb Natapov wrote:
> 
>> >  
>> >> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
>> >> the host kernel, or for a host page fault.  There's no point in boosting
>> >> anything for that.  Or a vcpu in userspace can be waiting for a lock
>> >> that is held by another thread, which has been preempted. 
>> > Do you mean userspace spinlock? Because otherwise task that's waits on
>> > a kernel lock will sleep in the kernel.
>> 
>> I meant a kernel mutex.
>> 
>> vcpu 0: take guest spinlock
>> vcpu 0: vmexit
>> vcpu 0: spin_lock(some_lock)
>> vcpu 1: take same guest spinlock
>> vcpu 1: PLE vmexit
>> vcpu 1: wtf?
>> 
>> Waiting on a host kernel spinlock is not too bad because we expect to be
>> out shortly.  Waiting on a host kernel mutex can be a lot worse.
>> 
> We can't do much about it without PV spinlock since there is not
> information about what vcpu holds which guest spinlock, no?

It doesn't help.  If the lock holder is waiting for another lock in the
host kernel, boosting it doesn't help even if we know who it is.  We
need to boost the real lock holder, but we have no idea who it is (and
even if we did, we often can't do anything about it).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-26 12:27           ` Konrad Rzeszutek Wilk
@ 2012-09-27 10:07             ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27 10:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Dor Laor
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 09/26/2012 05:57 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>> test code should be developed to test this in a predictable,
>>>> deterministic way.
>>>>
>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>> time you write to a /sys/.... entry.
>>>>
>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>> locks *10 times the amount of vcpus.
>>>>
>>>> All the threads are running a
>>>> while (1) {
>>>>
>>>> spin_lock(my_lock);
>>>> sum += execute_dummy_cpu_computation(time);
>>>> spin_unlock(my_lock);
>>>>
>>>> if (sys_tells_thread_to_die()) break;
>>>> }
>>>>
>>>> print_result(sum);
>>>>
>>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>>> the ticket lock order deterministic and known (like a linear walk of all
>>>> the threads trying to catch that lock).
>>>
>>> By Cloning you mean hierarchy of the locks?
>>
>> No, I meant to clone the implementation of the current spin lock
>> code in order to set any order you may like for the ticket
>> selection.
>> (even for a non pvticket lock version)
>
> Wouldn't that defeat the purpose of trying the test the different
> implementations that try to fix the lock-holder preemption problem?
> You want something that you can shoe-in for all work-loads - also
> for this test system.

Hmm true. I think it is indeed difficult to shoe-in all workloads.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27 10:04                         ` Avi Kivity
@ 2012-09-27 10:08                           ` Gleb Natapov
  2012-09-27 10:15                             ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-09-27 10:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
> On 09/27/2012 11:58 AM, Gleb Natapov wrote:
> > 
> >> >  
> >> >> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
> >> >> the host kernel, or for a host page fault.  There's no point in boosting
> >> >> anything for that.  Or a vcpu in userspace can be waiting for a lock
> >> >> that is held by another thread, which has been preempted. 
> >> > Do you mean userspace spinlock? Because otherwise task that's waits on
> >> > a kernel lock will sleep in the kernel.
> >> 
> >> I meant a kernel mutex.
> >> 
> >> vcpu 0: take guest spinlock
> >> vcpu 0: vmexit
> >> vcpu 0: spin_lock(some_lock)
> >> vcpu 1: take same guest spinlock
> >> vcpu 1: PLE vmexit
> >> vcpu 1: wtf?
> >> 
> >> Waiting on a host kernel spinlock is not too bad because we expect to be
> >> out shortly.  Waiting on a host kernel mutex can be a lot worse.
> >> 
> > We can't do much about it without PV spinlock since there is not
> > information about what vcpu holds which guest spinlock, no?
> 
> It doesn't help.  If the lock holder is waiting for another lock in the
> host kernel, boosting it doesn't help even if we know who it is.  We
> need to boost the real lock holder, but we have no idea who it is (and
> even if we did, we often can't do anything about it).
> 
Without PV lock we will boost random preempted vcpu instead of going to
sleep in the situation you described.

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27 10:08                           ` Gleb Natapov
@ 2012-09-27 10:15                             ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-27 10:15 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Raghavendra K T, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 09/27/2012 12:08 PM, Gleb Natapov wrote:
> On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
>> On 09/27/2012 11:58 AM, Gleb Natapov wrote:
>> > 
>> >> >  
>> >> >> btw, we can have secondary effects.  A vcpu can be waiting for a lock in
>> >> >> the host kernel, or for a host page fault.  There's no point in boosting
>> >> >> anything for that.  Or a vcpu in userspace can be waiting for a lock
>> >> >> that is held by another thread, which has been preempted. 
>> >> > Do you mean userspace spinlock? Because otherwise task that's waits on
>> >> > a kernel lock will sleep in the kernel.
>> >> 
>> >> I meant a kernel mutex.
>> >> 
>> >> vcpu 0: take guest spinlock
>> >> vcpu 0: vmexit
>> >> vcpu 0: spin_lock(some_lock)
>> >> vcpu 1: take same guest spinlock
>> >> vcpu 1: PLE vmexit
>> >> vcpu 1: wtf?
>> >> 
>> >> Waiting on a host kernel spinlock is not too bad because we expect to be
>> >> out shortly.  Waiting on a host kernel mutex can be a lot worse.
>> >> 
>> > We can't do much about it without PV spinlock since there is not
>> > information about what vcpu holds which guest spinlock, no?
>> 
>> It doesn't help.  If the lock holder is waiting for another lock in the
>> host kernel, boosting it doesn't help even if we know who it is.  We
>> need to boost the real lock holder, but we have no idea who it is (and
>> even if we did, we often can't do anything about it).
>> 
> Without PV lock we will boost random preempted vcpu instead of going to
> sleep in the situation you described.

True.  In theory boosting a random vcpu shouldn't have any negative
effects though.  Right now the problem is that the boosting itself is
expensive.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-26 12:57       ` Andrew Jones
@ 2012-09-27 10:21         ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27 10:21 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On 09/26/2012 06:27 PM, Andrew Jones wrote:
> On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
>> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
>>> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
>>>> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>>>>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>>>>> prove very costly, because there is no need to iterate over vcpus
>>>>> and do unsuccessful yield_to burning CPU.
>>>>
>>>> What's the costly thing? The vm-exit, the yield (which should be a nop
>>>> if its the only task there) or something else entirely?
>>>>
>>> Both vmexit and yield_to() actually,
>>>
>>> because unsuccessful yield_to() overall is costly in PLE handler.
>>>
>>> This is because when we have large guests, say 32/16 vcpus, and one
>>> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
>>> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
>>> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
>>>
>>> this results is fairly high amount of cpu burning and double run queue
>>> lock contention.
>>>
>>> (if they were spinning probably lock progress would have been faster).
>>> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
>>> seems little complex to achieve currently.
>>
>> OK, so the vmexit stays and we need to improve yield_to.
>
> Can't we do this check sooner as well, as it only requires per-cpu data?
> If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
> and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
> into kvm code would allow us to do other kvm things as a result of the
> check in order to avoid some vmexits. It looks like we should be able to
> avoid some without much complexity by just making a per-vm ple_window
> variable, and then, when we hit the nr_running == 1 condition, also doing
> vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
> Reset the window to the default value when we successfully yield (and
> maybe we should limit the number of bumps).

We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly 
to export nr_running info to ple handler.

Looking at the result and comparing result of A and C,
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
>
>    % improvements w.r.t BASE
> ---+------------+------------+------------+
>    |      A     |    B       |     C      |
> ---+------------+------------+------------+
> 1x | 206.37603  |  139.70410 |  210.19323 |

I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But
> vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))

is worth trying. I will have to see it eventually.






^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27  9:49           ` Raghavendra K T
@ 2012-09-27 10:28             ` Andrew Jones
  2012-09-27 10:44               ` Avi Kivity
  2012-09-27 11:31               ` Raghavendra K T
  2012-09-27 10:33             ` Dor Laor
  1 sibling, 2 replies; 126+ messages in thread
From: Andrew Jones @ 2012-09-27 10:28 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: dlaor, Chegu Vinod, Peter Zijlstra, H. Peter Anvin,
	Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
> On 09/25/2012 08:30 PM, Dor Laor wrote:
> >On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> >>On 09/24/2012 02:12 PM, Dor Laor wrote:
> >>>In order to help PLE and pvticketlock converge I thought that a small
> >>>test code should be developed to test this in a predictable,
> >>>deterministic way.
> >>>
> >>>The idea is to have a guest kernel module that spawn a new thread each
> >>>time you write to a /sys/.... entry.
> >>>
> >>>Each such a thread spins over a spin lock. The specific spin lock is
> >>>also chosen by the /sys/ interface. Let's say we have an array of spin
> >>>locks *10 times the amount of vcpus.
> >>>
> >>>All the threads are running a
> >>>while (1) {
> >>>
> >>>spin_lock(my_lock);
> >>>sum += execute_dummy_cpu_computation(time);
> >>>spin_unlock(my_lock);
> >>>
> >>>if (sys_tells_thread_to_die()) break;
> >>>}
> >>>
> >>>print_result(sum);
> >>>
> >>>Instead of calling the kernel's spin_lock functions, clone them and make
> >>>the ticket lock order deterministic and known (like a linear walk of all
> >>>the threads trying to catch that lock).
> >>
> >>By Cloning you mean hierarchy of the locks?
> >
> >No, I meant to clone the implementation of the current spin lock code in
> >order to set any order you may like for the ticket selection.
> >(even for a non pvticket lock version)
> >
> >For instance, let's say you have N threads trying to grab the lock, you
> >can always make the ticket go linearly from 1->2...->N.
> >Not sure it's a good idea, just a recommendation.
> >
> >>Also I believe time should be passed via sysfs / hardcoded for each
> >>type of lock we are mimicking
> >
> >Yap
> >
> >>
> >>>
> >>>This way you can easy calculate:
> >>>1. the score of a single vcpu running a single thread
> >>>2. the score of sum of all thread scores when #thread==#vcpu all
> >>>taking the same spin lock. The overall sum should be close as
> >>>possible to #1.
> >>>3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> >>>(belonging to all VMs) > #pcpus.
> >>>4. Create #thread == #vcpus but let each thread have it's own spin
> >>>lock
> >>>5. Like 4 + 2
> >>>
> >>>Hopefully this way will allows you to judge and evaluate the exact
> >>>overhead of scheduling VMs and threads since you have the ideal result
> >>>in hand and you know what the threads are doing.
> >>>
> >>>My 2 cents, Dor
> >>>
> >>
> >>Thank you,
> >>I think this is an excellent idea. ( Though I am trying to put all the
> >>pieces together you mentioned). So overall we should be able to measure
> >>the performance of pvspinlock/PLE improvements with a deterministic
> >>load in guest.
> >>
> >>Only thing I am missing is,
> >>How to generate different combinations of the lock.
> >>
> >>Okay, let me see if I can come with a solid model for this.
> >>
> >
> >Do you mean the various options for PLE/pvticket/other? I haven't
> >thought of it and assumed its static but it can also be controlled
> >through the temporary /sys interface.
> >
> 
> No, I am not there yet.
> 
> So In summary, we are suffering with inconsistent benchmark result,
> while measuring the benefit of our improvement in PLE/pvlock etc..

Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following 
- interleave the memory of all guests across all numa nodes on the
  machine
- echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
  guest
- echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
  each run
- use a ramdisk for the benchmark output files on all running guests
- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
  with it turned back on

Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...

Drew

> 
> So good point from your suggestion is,
> - Giving predictability to workload that runs in guest, so that we have
> pi-pi comparison of improvement.
> 
> - we can easily tune the workload via sysfs, and we can have script to
> automate them.
> 
> What is complicated is:
> - How can we simulate a workload close to what we measure with
> benchmarks?
> - How can we mimic lock holding time/ lock hierarchy close to the way
> it is seen with real workloads (for e.g. highly contended zone lru lock
> with similar amount of lockholding times).
> - How close it would be to when we forget about other types of spinning
> (for e.g, flush_tlb).
> 
> So I feel it is not as trivial as it looks like.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27  9:49           ` Raghavendra K T
  2012-09-27 10:28             ` Andrew Jones
@ 2012-09-27 10:33             ` Dor Laor
  1 sibling, 0 replies; 126+ messages in thread
From: Dor Laor @ 2012-09-27 10:33 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Chegu Vinod, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 09/27/2012 11:49 AM, Raghavendra K T wrote:
> On 09/25/2012 08:30 PM, Dor Laor wrote:
>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>> test code should be developed to test this in a predictable,
>>>> deterministic way.
>>>>
>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>> time you write to a /sys/.... entry.
>>>>
>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>> locks *10 times the amount of vcpus.
>>>>
>>>> All the threads are running a
>>>> while (1) {
>>>>
>>>> spin_lock(my_lock);
>>>> sum += execute_dummy_cpu_computation(time);
>>>> spin_unlock(my_lock);
>>>>
>>>> if (sys_tells_thread_to_die()) break;
>>>> }
>>>>
>>>> print_result(sum);
>>>>
>>>> Instead of calling the kernel's spin_lock functions, clone them and
>>>> make
>>>> the ticket lock order deterministic and known (like a linear walk of
>>>> all
>>>> the threads trying to catch that lock).
>>>
>>> By Cloning you mean hierarchy of the locks?
>>
>> No, I meant to clone the implementation of the current spin lock code in
>> order to set any order you may like for the ticket selection.
>> (even for a non pvticket lock version)
>>
>> For instance, let's say you have N threads trying to grab the lock, you
>> can always make the ticket go linearly from 1->2...->N.
>> Not sure it's a good idea, just a recommendation.
>>
>>> Also I believe time should be passed via sysfs / hardcoded for each
>>> type of lock we are mimicking
>>
>> Yap
>>
>>>
>>>>
>>>> This way you can easy calculate:
>>>> 1. the score of a single vcpu running a single thread
>>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>>> taking the same spin lock. The overall sum should be close as
>>>> possible to #1.
>>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>>> (belonging to all VMs) > #pcpus.
>>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>>> lock
>>>> 5. Like 4 + 2
>>>>
>>>> Hopefully this way will allows you to judge and evaluate the exact
>>>> overhead of scheduling VMs and threads since you have the ideal result
>>>> in hand and you know what the threads are doing.
>>>>
>>>> My 2 cents, Dor
>>>>
>>>
>>> Thank you,
>>> I think this is an excellent idea. ( Though I am trying to put all the
>>> pieces together you mentioned). So overall we should be able to measure
>>> the performance of pvspinlock/PLE improvements with a deterministic
>>> load in guest.
>>>
>>> Only thing I am missing is,
>>> How to generate different combinations of the lock.
>>>
>>> Okay, let me see if I can come with a solid model for this.
>>>
>>
>> Do you mean the various options for PLE/pvticket/other? I haven't
>> thought of it and assumed its static but it can also be controlled
>> through the temporary /sys interface.
>>
>
> No, I am not there yet.
>
> So In summary, we are suffering with inconsistent benchmark result,
> while measuring the benefit of our improvement in PLE/pvlock etc..
>
> So good point from your suggestion is,
> - Giving predictability to workload that runs in guest, so that we have
> pi-pi comparison of improvement.
>
> - we can easily tune the workload via sysfs, and we can have script to
> automate them.
>
> What is complicated is:
> - How can we simulate a workload close to what we measure with
> benchmarks?
> - How can we mimic lock holding time/ lock hierarchy close to the way
> it is seen with real workloads (for e.g. highly contended zone lru lock
> with similar amount of lockholding times).

You can spin for a similar instruction count that you're interested

> - How close it would be to when we forget about other types of spinning
> (for e.g, flush_tlb).
>
> So I feel it is not as trivial as it looks like.


Indeed this is mainly a tool that can serve to optimize few synthetic 
workloads.
I still believe that it worth to go through this exercise since a 100% 
predictable and controlled case can help us purely asses the state of 
PLE and pvticket code. Otherwise we're dealing w/ too many parameters 
and assumptions at once.

Dor


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 10:28             ` Andrew Jones
@ 2012-09-27 10:44               ` Avi Kivity
  2012-09-27 11:31               ` Raghavendra K T
  1 sibling, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-27 10:44 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Raghavendra K T, dlaor, Chegu Vinod, Peter Zijlstra,
	H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Gleb Natapov

On 09/27/2012 12:28 PM, Andrew Jones wrote:

>> No, I am not there yet.
>> 
>> So In summary, we are suffering with inconsistent benchmark result,
>> while measuring the benefit of our improvement in PLE/pvlock etc..
> 
> Are you measuring the combined throughput of all running guests, or
> just looking at the results of the benchmarks in a single test guest?
> 
> I've done some benchmarking as well and my stddevs look pretty good for
> kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
> overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
> full sequence of tests (one with the overcommit levels in scrambled
> order). The relative stddevs for each of the sets of 5 runs look pretty
> good, and the data for the 2 runs match nicely as well.
> 
> To try and get consistent results I do the following 
> - interleave the memory of all guests across all numa nodes on the
>   machine
> - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
>   guest
> - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
>   each run
> - use a ramdisk for the benchmark output files on all running guests
> - no periodically running services installed on the test guest
> - HT is turned off as you do, although I'd like to try running again
>   with it turned back on
> 
> Although, I still need to run again measuring the combined throughput
> of all running vms (including the ones launched just to generate busy
> vcpus). Maybe my results won't be as consistent then...
> 


Another way to test is to execute

 perf stat -e 'kvm_exit exit_reason==40' sleep 10

to see how many PAUSEs were intercepted in a given time (except I just
invented the filter syntax).  The fewer we get, the more useful work the
system does.  This ignores kvm_vcpu_on_spin overhead though, so it's
just a rough measure.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27  8:36               ` Avi Kivity
@ 2012-09-27 11:23                 ` Raghavendra K T
  2012-09-27 12:03                   ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27 11:23 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Gleb Natapov,
	Andrew Jones

On 09/27/2012 02:06 PM, Avi Kivity wrote:
> On 09/25/2012 03:40 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>>> equally distributed and lockholder might actually be on a different run
>>>>> queue but not running.
>>>>
>>>> Load should eventually get distributed equally -- that's what the
>>>> load-balancer is for -- so this is a temporary situation.
>>>>
>>>> We already try and favour the non running vcpu in this case, that's what
>>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>>> luck.
>>>
>>> Yes, I agree.
>>>
>>>>
>>>>> Do you think instead of using rq->nr_running, we could get a global
>>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>>
>>>> To what purpose? Also, global stuff is expensive, so you should try and
>>>> stay away from it as hard as you possibly can.
>>>
>>> Yes, that concern only had made me to fall back to rq->nr_running.
>>>
>>> Will come back with the result soon.
>>
>> Got the result with the patches:
>> So here is the result,
>>
>> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
>> 1x and 2x overcommits
>>
>> Base = 3.6.0-rc5 + ple handler optimization patches
>> A = Base + checking rq_running in vcpu_on_spin() patch
>> B = Base + checking rq->nr_running in sched/core
>> C = Base - PLE
>>
>> ---+-----------+-----------+-----------+-----------+
>>     |    Ebizzy result (rec/sec higher is better)   |
>> ---+-----------+-----------+-----------+-----------+
>>     |    Base   |     A     |      B    |     C     |
>> ---+-----------+-----------+-----------+-----------+
>> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
>> 2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
>> ---+-----------+-----------+-----------+-----------+
>>
>>     % improvements w.r.t BASE
>> ---+------------+------------+------------+
>>     |      A     |    B       |     C      |
>> ---+------------+------------+------------+
>> 1x | 206.37603  |  139.70410 |  210.19323 |
>> 2x | -3.06555   |  -4.33218  |  -98.08773 |
>> ---+------------+------------+------------+
>>
>> we are getting the benefit of almost PLE disabled case with this
>> approach. With patch B, we have dropped a bit in gain.
>> (because we still would iterate vcpus until we decide to do a directed
>> yield).
>
> This gives us a good case for tracking preemption on a per-vm basis.  As
> long as we aren't preempted, we can keep the PLE window high, and also
> return immediately from the handler without looking for candidates.

1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..







^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27  8:50                 ` Avi Kivity
@ 2012-09-27 11:26                   ` Raghavendra K T
  2012-09-27 12:06                     ` Avi Kivity
       [not found]                   ` <CAJocwcc19F+PtsQ5okGMvYeVnkEigpZRpwWY9JgeRPFqfcVoXA@mail.gmail.com>
  1 sibling, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27 11:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jiannan Ouyang, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/27/2012 02:20 PM, Avi Kivity wrote:
> On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
>> I've actually implemented this preempted_bitmap idea.
>
> Interesting, please share the code if you can.
>
>> However, I'm doing this to expose this information to the guest, so the
>> guest is able to know if the lock holder is preempted or not before
>> spining. Right now, I'm doing experiment to show that this idea works.
>>
>> I'm wondering what do you guys think of the relationship between the
>> pv_ticketlock approach and PLE handler approach. Are we going to adopt
>> PLE instead of the pv ticketlock, and why?
>
> Right now we're searching for the best solution.  The tradeoffs are more
> or less:
>
> PLE:
> - works for unmodified / non-Linux guests
> - works for all types of spins (e.g. smp_call_function*())
> - utilizes an existing hardware interface (PAUSE instruction) so likely
> more robust compared to a software interface
>
> PV:
> - has more information, so it can perform better

Should we also consider that we always have an edge here for non-PLE
machine?

>
> Given these tradeoffs, if we can get PLE to work for moderate amounts of
> overcommit then I'll prefer it (even if it slightly underperforms PV).
> If we are unable to make it work well, then we'll have to add PV.
>
Avi,
Thanks for this summary.. It is of great help to proceed in right
direction..


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 10:28             ` Andrew Jones
  2012-09-27 10:44               ` Avi Kivity
@ 2012-09-27 11:31               ` Raghavendra K T
  1 sibling, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-27 11:31 UTC (permalink / raw)
  To: Andrew Jones
  Cc: dlaor, Chegu Vinod, Peter Zijlstra, H. Peter Anvin,
	Marcelo Tosatti, Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On 09/27/2012 03:58 PM, Andrew Jones wrote:
> On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
>> On 09/25/2012 08:30 PM, Dor Laor wrote:
>>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>>> test code should be developed to test this in a predictable,
>>>>> deterministic way.
>>>>>
>>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>>> time you write to a /sys/.... entry.
>>>>>
>>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>>> locks *10 times the amount of vcpus.
>>>>>
>>>>> All the threads are running a
>>>>> while (1) {
>>>>>
>>>>> spin_lock(my_lock);
>>>>> sum += execute_dummy_cpu_computation(time);
>>>>> spin_unlock(my_lock);
>>>>>
>>>>> if (sys_tells_thread_to_die()) break;
>>>>> }
>>>>>
>>>>> print_result(sum);
>>>>>
>>>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>>>> the ticket lock order deterministic and known (like a linear walk of all
>>>>> the threads trying to catch that lock).
>>>>
>>>> By Cloning you mean hierarchy of the locks?
>>>
>>> No, I meant to clone the implementation of the current spin lock code in
>>> order to set any order you may like for the ticket selection.
>>> (even for a non pvticket lock version)
>>>
>>> For instance, let's say you have N threads trying to grab the lock, you
>>> can always make the ticket go linearly from 1->2...->N.
>>> Not sure it's a good idea, just a recommendation.
>>>
>>>> Also I believe time should be passed via sysfs / hardcoded for each
>>>> type of lock we are mimicking
>>>
>>> Yap
>>>
>>>>
>>>>>
>>>>> This way you can easy calculate:
>>>>> 1. the score of a single vcpu running a single thread
>>>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>>>> taking the same spin lock. The overall sum should be close as
>>>>> possible to #1.
>>>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>>>> (belonging to all VMs) > #pcpus.
>>>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>>>> lock
>>>>> 5. Like 4 + 2
>>>>>
>>>>> Hopefully this way will allows you to judge and evaluate the exact
>>>>> overhead of scheduling VMs and threads since you have the ideal result
>>>>> in hand and you know what the threads are doing.
>>>>>
>>>>> My 2 cents, Dor
>>>>>
>>>>
>>>> Thank you,
>>>> I think this is an excellent idea. ( Though I am trying to put all the
>>>> pieces together you mentioned). So overall we should be able to measure
>>>> the performance of pvspinlock/PLE improvements with a deterministic
>>>> load in guest.
>>>>
>>>> Only thing I am missing is,
>>>> How to generate different combinations of the lock.
>>>>
>>>> Okay, let me see if I can come with a solid model for this.
>>>>
>>>
>>> Do you mean the various options for PLE/pvticket/other? I haven't
>>> thought of it and assumed its static but it can also be controlled
>>> through the temporary /sys interface.
>>>
>>
>> No, I am not there yet.
>>
>> So In summary, we are suffering with inconsistent benchmark result,
>> while measuring the benefit of our improvement in PLE/pvlock etc..
>
> Are you measuring the combined throughput of all running guests, or
> just looking at the results of the benchmarks in a single test guest?
>
> I've done some benchmarking as well and my stddevs look pretty good for
> kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
> overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
> full sequence of tests (one with the overcommit levels in scrambled
> order). The relative stddevs for each of the sets of 5 runs look pretty
> good, and the data for the 2 runs match nicely as well.
>
> To try and get consistent results I do the following
> - interleave the memory of all guests across all numa nodes on the
>    machine
> - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
>    guest

I was not doing this.

> - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
>    each run

was doing already as you know

> - use a ramdisk for the benchmark output files on all running guests

Yes.. this is also helpful

> - no periodically running services installed on the test guest
> - HT is turned off as you do, although I'd like to try running again
>    with it turned back on
> Although, I still need to run again measuring the combined throughput
> of all running vms (including the ones launched just to generate busy
> vcpus). Maybe my results won't be as consistent then...

May be. I take average from all the VMs..

>
> Drew
>
>>
>> So good point from your suggestion is,
>> - Giving predictability to workload that runs in guest, so that we have
>> pi-pi comparison of improvement.
>>
>> - we can easily tune the workload via sysfs, and we can have script to
>> automate them.
>>
>> What is complicated is:
>> - How can we simulate a workload close to what we measure with
>> benchmarks?
>> - How can we mimic lock holding time/ lock hierarchy close to the way
>> it is seen with real workloads (for e.g. highly contended zone lru lock
>> with similar amount of lockholding times).
>> - How close it would be to when we forget about other types of spinning
>> (for e.g, flush_tlb).
>>
>> So I feel it is not as trivial as it looks like.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 11:23                 ` Raghavendra K T
@ 2012-09-27 12:03                   ` Avi Kivity
  2012-09-27 12:25                     ` Andrew Theurer
                                       ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-27 12:03 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>
>> This gives us a good case for tracking preemption on a per-vm basis.  As
>> long as we aren't preempted, we can keep the PLE window high, and also
>> return immediately from the handler without looking for candidates.
> 
> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> long back)  is also another thing worth trying, so we reduce the chance
> of LHP.

Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.

I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.

> 
> IIRC, with defer preemption :
> we will have hook in spinlock/unlock path to measure depth of lock held,
> and shared with host scheduler (may be via MSRs now).
> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> give say one chance.

A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).

> 
> 2) looking at the result (comparing A & C) , I do feel we have
> significant in iterating over vcpus (when compared to even vmexit)
> so We still would need undercommit fix sugested by PeterZ (improving by
> 140%). ?

Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?

> 
> So looking back at threads/ discussions so far, I am trying to
> summarize, the discussions so far. I feel, at least here are the few
> potential candidates to go in:
> 
> 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> 3) preempt_notify handler to identify preempted VCPUs (Avi)
> 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> 6) Pv spinlock
> 7) Jiannan's proposed improvements
> 8) Defer preemption patches
> 
> Did we miss anything (or added extra?)
> 
> So here are my action items:
> - I plan to repost this series with what PeterZ, Rik suggested with
> performance analysis.
> - I ll go back and explore on (3) and (6) ..
> 
> Please Let me know..

Undoubtedly we'll think of more stuff.  But this looks like a good start.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27 11:26                   ` Raghavendra K T
@ 2012-09-27 12:06                     ` Avi Kivity
  2012-09-28 18:18                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-27 12:06 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jiannan Ouyang, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/27/2012 01:26 PM, Raghavendra K T wrote:
> On 09/27/2012 02:20 PM, Avi Kivity wrote:
>> On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
>>> I've actually implemented this preempted_bitmap idea.
>>
>> Interesting, please share the code if you can.
>>
>>> However, I'm doing this to expose this information to the guest, so the
>>> guest is able to know if the lock holder is preempted or not before
>>> spining. Right now, I'm doing experiment to show that this idea works.
>>>
>>> I'm wondering what do you guys think of the relationship between the
>>> pv_ticketlock approach and PLE handler approach. Are we going to adopt
>>> PLE instead of the pv ticketlock, and why?
>>
>> Right now we're searching for the best solution.  The tradeoffs are more
>> or less:
>>
>> PLE:
>> - works for unmodified / non-Linux guests
>> - works for all types of spins (e.g. smp_call_function*())
>> - utilizes an existing hardware interface (PAUSE instruction) so likely
>> more robust compared to a software interface
>>
>> PV:
>> - has more information, so it can perform better
> 
> Should we also consider that we always have an edge here for non-PLE
> machine?

True.  The deployment share for these is decreasing rapidly though.  I
hate optimizing for obsolete hardware.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 12:03                   ` Avi Kivity
@ 2012-09-27 12:25                     ` Andrew Theurer
  2012-09-28  5:38                     ` Raghavendra K T
  2012-10-03 14:29                     ` Raghavendra K T
  2 siblings, 0 replies; 126+ messages in thread
From: Andrew Theurer @ 2012-09-27 12:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
> >> This gives us a good case for tracking preemption on a per-vm basis.  As
> >> long as we aren't preempted, we can keep the PLE window high, and also
> >> return immediately from the handler without looking for candidates.
> > 
> > 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> > long back)  is also another thing worth trying, so we reduce the chance
> > of LHP.
> 
> Yes, we have to keep it in mind.  It will be useful for fine grained
> locks, not so much so coarse locks or IPIs.
> 
> I would still of course prefer a PLE solution, but if we can't get it to
> work we can consider preemption deferral.
> 
> > 
> > IIRC, with defer preemption :
> > we will have hook in spinlock/unlock path to measure depth of lock held,
> > and shared with host scheduler (may be via MSRs now).
> > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> > give say one chance.
> 
> A downside is that we have to do that even when undercommitted.
> 
> Also there may be a lot of false positives (deferred preemptions even
> when there is no contention).
> 
> > 
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 
> Or were you referring to something else?
> 
> > 
> > So looking back at threads/ discussions so far, I am trying to
> > summarize, the discussions so far. I feel, at least here are the few
> > potential candidates to go in:
> > 
> > 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> > 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> > 3) preempt_notify handler to identify preempted VCPUs (Avi)
> > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> > 6) Pv spinlock
> > 7) Jiannan's proposed improvements
> > 8) Defer preemption patches
> > 
> > Did we miss anything (or added extra?)
> > 
> > So here are my action items:
> > - I plan to repost this series with what PeterZ, Rik suggested with
> > performance analysis.
> > - I ll go back and explore on (3) and (6) ..
> > 
> > Please Let me know..
> 
> Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 12:03                   ` Avi Kivity
  2012-09-27 12:25                     ` Andrew Theurer
@ 2012-09-28  5:38                     ` Raghavendra K T
  2012-09-28  5:45                       ` H. Peter Anvin
                                         ` (2 more replies)
  2012-10-03 14:29                     ` Raghavendra K T
  2 siblings, 3 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-28  5:38 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra
  Cc: H. Peter Anvin, Marcelo Tosatti, Ingo Molnar, Rik van Riel,
	Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Gleb Natapov,
	Andrew Jones

On 09/27/2012 05:33 PM, Avi Kivity wrote:
> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>
>>> This gives us a good case for tracking preemption on a per-vm basis.  As
>>> long as we aren't preempted, we can keep the PLE window high, and also
>>> return immediately from the handler without looking for candidates.
>>
>> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
>> long back)  is also another thing worth trying, so we reduce the chance
>> of LHP.
>
> Yes, we have to keep it in mind.  It will be useful for fine grained
> locks, not so much so coarse locks or IPIs.
>

Agree.

> I would still of course prefer a PLE solution, but if we can't get it to
> work we can consider preemption deferral.
>

Okay.

>>
>> IIRC, with defer preemption :
>> we will have hook in spinlock/unlock path to measure depth of lock held,
>> and shared with host scheduler (may be via MSRs now).
>> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
>> give say one chance.
>
> A downside is that we have to do that even when undercommitted.
>
> Also there may be a lot of false positives (deferred preemptions even
> when there is no contention).

Yes. That is a worry.

>
>>
>> 2) looking at the result (comparing A & C) , I do feel we have
>> significant in iterating over vcpus (when compared to even vmexit)
>> so We still would need undercommit fix sugested by PeterZ (improving by
>> 140%). ?
>
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
>
> Or were you referring to something else?

No. I was referring to the same thing.

However. I had tried following also (which works well to check 
undercommited scenario). But thinking to use only for yielding in case
of overcommit (yield in overcommit suggested by Rik) and keep 
undercommit patch as suggested by PeterZ

[ patch is not in proper diff I suppose ].

Will test them.

Peter, Can I post your patch with your from/sob.. in V2?
Please let me know..

---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 28f00bc..9ed3759 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
kvm_vcpu *vcpu)
  	return eligible;
  }
  #endif
+
+bool kvm_overcommitted()
+{
+	unsigned long load;
+
+	load = avenrun[0] + FIXED_1/200;
+	load = load >> FSHIFT;
+	load = (load << 7) / num_online_cpus();
+
+	if (load > 128)
+		return true;
+
+	return false;
+}
+
  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
  	struct kvm *kvm = me->kvm;
@@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  	int pass;
  	int i;

+	if (!kvm_overcommitted())
+		return;
+
  	kvm_vcpu_set_in_spin_loop(me, true);
  	/*
  	 * We boost the priority of a VCPU that is runnable but not


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28  5:38                     ` Raghavendra K T
@ 2012-09-28  5:45                       ` H. Peter Anvin
  2012-09-28  6:03                         ` Raghavendra K T
  2012-09-28  8:38                       ` Peter Zijlstra
  2012-09-28 11:40                       ` Andrew Theurer
  2 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2012-09-28  5:45 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/27/2012 10:38 PM, Raghavendra K T wrote:
> +
> +bool kvm_overcommitted()
> +{

This better not be C...

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28  5:45                       ` H. Peter Anvin
@ 2012-09-28  6:03                         ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-28  6:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Peter Zijlstra, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/28/2012 11:15 AM, H. Peter Anvin wrote:
> On 09/27/2012 10:38 PM, Raghavendra K T wrote:
>> +
>> +bool kvm_overcommitted()
>> +{
>
> This better not be C...

I think you meant I should have had like kvm_overcommitted(void) and 
(different function name perhaps)

or is it the body of function?



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
       [not found]                   ` <CAJocwcc19F+PtsQ5okGMvYeVnkEigpZRpwWY9JgeRPFqfcVoXA@mail.gmail.com>
@ 2012-09-28  6:16                     ` Raghavendra K T
  2012-09-30  8:18                       ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-09-28  6:16 UTC (permalink / raw)
  To: Jiannan Ouyang
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/28/2012 02:37 AM, Jiannan Ouyang wrote:
>
>
> On Thu, Sep 27, 2012 at 4:50 AM, Avi Kivity <avi@redhat.com
> <mailto:avi@redhat.com>> wrote:
>
>     On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
>      > I've actually implemented this preempted_bitmap idea.
>
>     Interesting, please share the code if you can.
>
>      > However, I'm doing this to expose this information to the guest,
>     so the
>      > guest is able to know if the lock holder is preempted or not before
>      > spining. Right now, I'm doing experiment to show that this idea
>     works.
>      >
>      > I'm wondering what do you guys think of the relationship between the
>      > pv_ticketlock approach and PLE handler approach. Are we going to
>     adopt
>      > PLE instead of the pv ticketlock, and why?
>
>     Right now we're searching for the best solution.  The tradeoffs are more
>     or less:
>
>     PLE:
>     - works for unmodified / non-Linux guests
>     - works for all types of spins (e.g. smp_call_function*())
>     - utilizes an existing hardware interface (PAUSE instruction) so likely
>     more robust compared to a software interface
>
>     PV:
>     - has more information, so it can perform better
>
>     Given these tradeoffs, if we can get PLE to work for moderate amounts of
>     overcommit then I'll prefer it (even if it slightly underperforms PV).
>     If we are unable to make it work well, then we'll have to add PV.
>
>     --
>     error compiling committee.c: too many arguments to function
>
>
> FYI. The preempted_bitmap patch.
>
> I delete some unrelated code in the generated patch file and seems
> broken the patch file format... I hope anyone could teach me some
> solutions.
> However, it's pretty straight forward, four things: declaration,
> initialization, set and clear. I think you guys can figure it out easily!
>
> As Avi sugguested, you could check task state TASK_RUNNING in sched_out.
>
> Signed-off-by: Jiannan Ouyang <ouyang@cs.pitt.edu
> <mailto:ouyang@cs.pitt.edu>>
>
> diff --git a/arch/x86/include/asm/
>
>     paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>     index 8613cbb..4fcb648 100644
>     --- a/arch/x86/include/asm/paravirt_types.h
>     +++ b/arch/x86/include/asm/paravirt_types.h
>     @@ -73,6 +73,16 @@ struct pv_info {
>              const char *name;
>       };

I suppose we need this in common place since s390 also should have this,
if we are using this information in vcpu_on_spin()..

>
>     +struct pv_sched_info {
>     +       unsigned long   sched_bitmap;

Thinking, whether we need something similar to cpumask here?
Only thing is we are representing guest (v)cpumask.

>     +} __attribute__((__packed__));
>     +
>       struct pv_init_ops {
>              /*
>               * Patch may replace one of the defined code sequences with
>     diff --git a/arch/x86/kernel/paravirt-spinlocks.c
>     b/arch/x86/kernel/paravirt-spinlocks.c
>     index 676b8c7..2242d22 100644
>     --- a/arch/x86/kernel/paravirt-spinlocks.c
>     +++ b/arch/x86/kernel/paravirt-spinlocks.c
>
>     +struct pv_sched_info pv_sched_info = {
>     +        .sched_bitmap = (unsigned long)-1,
>     +};
>     diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>     index 44ee712..3eb277e 100644
>     --- a/virt/kvm/kvm_main.c
>     +++ b/virt/kvm/kvm_main.c
>     @@ -494,6 +494,11 @@ static struct kvm *kvm_create_vm(unsigned long
>     type)
>              mutex_init(&kvm->slots_lock);
>              atomic_set(&kvm->users_count, 1);
>
>     +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>     +        kvm->pv_sched_info.sched_bitmap = (unsigned long)-1;
>     +#endif
>     +
>              r = kvm_init_mmu_notifier(kvm);
>              if (r)
>                      goto out_err;
>     @@ -2697,7 +2702,13 @@ struct kvm_vcpu
>     *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
>       static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
>       {
>              struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>
>     +       set_bit(vcpu->vcpu_id, &vcpu->kvm->pv_sched_info.sched_bitmap);
>              kvm_arch_vcpu_load(vcpu, cpu);
>       }
>
>     @@ -2705,7 +2716,13 @@ static void kvm_sched_out(struct
>     preempt_notifier *pn,
>                                struct task_struct *next)
>       {
>              struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>
>     +       clear_bit(vcpu->vcpu_id,
>     &vcpu->kvm->pv_sched_info.sched_bitmap);
>              kvm_arch_vcpu_put(vcpu);
>       }


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28  5:38                     ` Raghavendra K T
  2012-09-28  5:45                       ` H. Peter Anvin
@ 2012-09-28  8:38                       ` Peter Zijlstra
  2012-09-28 11:40                       ` Andrew Theurer
  2 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-28  8:38 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
> 
> Peter, Can I post your patch with your from/sob.. in V2?
> Please let me know.. 

Yeah I guess ;-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28  5:38                     ` Raghavendra K T
  2012-09-28  5:45                       ` H. Peter Anvin
  2012-09-28  8:38                       ` Peter Zijlstra
@ 2012-09-28 11:40                       ` Andrew Theurer
  2012-09-28 14:11                         ` Raghavendra K T
                                           ` (2 more replies)
  2 siblings, 3 replies; 126+ messages in thread
From: Andrew Theurer @ 2012-09-28 11:40 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
> On 09/27/2012 05:33 PM, Avi Kivity wrote:
> > On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>>
> >>> This gives us a good case for tracking preemption on a per-vm basis.  As
> >>> long as we aren't preempted, we can keep the PLE window high, and also
> >>> return immediately from the handler without looking for candidates.
> >>
> >> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> >> long back)  is also another thing worth trying, so we reduce the chance
> >> of LHP.
> >
> > Yes, we have to keep it in mind.  It will be useful for fine grained
> > locks, not so much so coarse locks or IPIs.
> >
> 
> Agree.
> 
> > I would still of course prefer a PLE solution, but if we can't get it to
> > work we can consider preemption deferral.
> >
> 
> Okay.
> 
> >>
> >> IIRC, with defer preemption :
> >> we will have hook in spinlock/unlock path to measure depth of lock held,
> >> and shared with host scheduler (may be via MSRs now).
> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> >> give say one chance.
> >
> > A downside is that we have to do that even when undercommitted.

Hopefully vcpu preemption is very rare when undercommitted, so it should
not happen much at all.
> >
> > Also there may be a lot of false positives (deferred preemptions even
> > when there is no contention).

It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?
> 
> Yes. That is a worry.
> 
> >
> >>
> >> 2) looking at the result (comparing A & C) , I do feel we have
> >> significant in iterating over vcpus (when compared to even vmexit)
> >> so We still would need undercommit fix sugested by PeterZ (improving by
> >> 140%). ?
> >
> > Looking only at the current runqueue?  My worry is that it misses a lot
> > of cases.  Maybe try the current runqueue first and then others.
> >
> > Or were you referring to something else?
> 
> No. I was referring to the same thing.
> 
> However. I had tried following also (which works well to check 
> undercommited scenario). But thinking to use only for yielding in case
> of overcommit (yield in overcommit suggested by Rik) and keep 
> undercommit patch as suggested by PeterZ
> 
> [ patch is not in proper diff I suppose ].
> 
> Will test them.
> 
> Peter, Can I post your patch with your from/sob.. in V2?
> Please let me know..
> 
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 28f00bc..9ed3759 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
> kvm_vcpu *vcpu)
>   	return eligible;
>   }
>   #endif
> +
> +bool kvm_overcommitted()
> +{
> +	unsigned long load;
> +
> +	load = avenrun[0] + FIXED_1/200;
> +	load = load >> FSHIFT;
> +	load = (load << 7) / num_online_cpus();
> +
> +	if (load > 128)
> +		return true;
> +
> +	return false;
> +}
> +
>   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   {
>   	struct kvm *kvm = me->kvm;
> @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   	int pass;
>   	int i;
> 
> +	if (!kvm_overcommitted())
> +		return;
> +
>   	kvm_vcpu_set_in_spin_loop(me, true);
>   	/*
>   	 * We boost the priority of a VCPU that is runnable but not



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28 11:40                       ` Andrew Theurer
@ 2012-09-28 14:11                         ` Raghavendra K T
  2012-09-28 14:13                         ` Peter Zijlstra
  2012-09-30  8:24                         ` Avi Kivity
  2 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-09-28 14:11 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/28/2012 05:10 PM, Andrew Theurer wrote:
> On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
>> On 09/27/2012 05:33 PM, Avi Kivity wrote:
>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>
[...]
>>>
>>> Also there may be a lot of false positives (deferred preemptions even
>>> when there is no contention).
>
> It will be interesting to see how this behaves with a very high lock
> activity in a guest.  Once the scheduler defers preemption, is it for a
> fixed amount of time, or does it know to cut the deferral short as soon
> as the lock depth is reduced [by x]?

Design/protocol that Vatsa, had in mind was something like this:

- scheduler does not give a vcpu holding lock forever, it may give one
chance that would give only few ticks. In addition to giving chance,
scheduler also sets some indication that he has been given chance.

- vcpu once he release (all) the lock(s), if it had given chance,
it should clear that (ACK), and relinquish the cpu.





^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28 11:40                       ` Andrew Theurer
  2012-09-28 14:11                         ` Raghavendra K T
@ 2012-09-28 14:13                         ` Peter Zijlstra
  2012-09-30  8:24                         ` Avi Kivity
  2 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2012-09-28 14:13 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Avi Kivity, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Fri, 2012-09-28 at 06:40 -0500, Andrew Theurer wrote:
> It will be interesting to see how this behaves with a very high lock
> activity in a guest.  Once the scheduler defers preemption, is it for
> a
> fixed amount of time, or does it know to cut the deferral short as
> soon
> as the lock depth is reduced [by x]? 

Since the locks live in a guest/userspace, we don't even know they're
held at all, let alone when state changes.

Also, afaik PLE simply exits the guest whenever you do a busy-wait,
there's no guarantee its due to a lock at all, we could be waiting for a
'virtual' hardware resource or whatnot.



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-27 12:06                     ` Avi Kivity
@ 2012-09-28 18:18                       ` Konrad Rzeszutek Wilk
  2012-09-30  8:16                         ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-09-28 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

> >> PLE:
> >> - works for unmodified / non-Linux guests
> >> - works for all types of spins (e.g. smp_call_function*())
> >> - utilizes an existing hardware interface (PAUSE instruction) so likely
> >> more robust compared to a software interface
> >>
> >> PV:
> >> - has more information, so it can perform better
> > 
> > Should we also consider that we always have an edge here for non-PLE
> > machine?
> 
> True.  The deployment share for these is decreasing rapidly though.  I
> hate optimizing for obsolete hardware.

Keep in mind that the patchset that Jeremy provided also cleans (remove)
parts of the pv spinlock code. It removes the various spin_lock,
spin_unlock, etc that touch paravirt code. Instead the pv code is only
in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK
the end code is the same as it is now.

On a different subject-  I am curious whether the Haswell new locking
instructions (the transactional ones?) can be put in usage for the slow
case?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-28 18:18                       ` Konrad Rzeszutek Wilk
@ 2012-09-30  8:16                         ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-30  8:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Raghavendra K T, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

On 09/28/2012 08:18 PM, Konrad Rzeszutek Wilk wrote:
>> >> PLE:
>> >> - works for unmodified / non-Linux guests
>> >> - works for all types of spins (e.g. smp_call_function*())
>> >> - utilizes an existing hardware interface (PAUSE instruction) so likely
>> >> more robust compared to a software interface
>> >>
>> >> PV:
>> >> - has more information, so it can perform better
>> > 
>> > Should we also consider that we always have an edge here for non-PLE
>> > machine?
>> 
>> True.  The deployment share for these is decreasing rapidly though.  I
>> hate optimizing for obsolete hardware.
> 
> Keep in mind that the patchset that Jeremy provided also cleans (remove)
> parts of the pv spinlock code. It removes the various spin_lock,
> spin_unlock, etc that touch paravirt code. Instead the pv code is only
> in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK
> the end code is the same as it is now.

We still need to maintain information about the lock holder if pv is
enabled at all, even if it is never used.

> On a different subject-  I am curious whether the Haswell new locking
> instructions (the transactional ones?) can be put in usage for the slow
> case?

Transactions are blown on any context switch, so no.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-28  6:16                     ` Raghavendra K T
@ 2012-09-30  8:18                       ` Avi Kivity
  2012-09-30 11:07                         ` Gleb Natapov
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-30  8:18 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jiannan Ouyang, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 09/28/2012 08:16 AM, Raghavendra K T wrote:
> 
>>
>>     +struct pv_sched_info {
>>     +       unsigned long   sched_bitmap;
> 
> Thinking, whether we need something similar to cpumask here?
> Only thing is we are representing guest (v)cpumask.
> 

DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)

cpumask is for host masks, this is a guest mask.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-28 11:40                       ` Andrew Theurer
  2012-09-28 14:11                         ` Raghavendra K T
  2012-09-28 14:13                         ` Peter Zijlstra
@ 2012-09-30  8:24                         ` Avi Kivity
  2 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-09-30  8:24 UTC (permalink / raw)
  To: habanero
  Cc: Raghavendra K T, Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 09/28/2012 01:40 PM, Andrew Theurer wrote:
>> 
>> >>
>> >> IIRC, with defer preemption :
>> >> we will have hook in spinlock/unlock path to measure depth of lock held,
>> >> and shared with host scheduler (may be via MSRs now).
>> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
>> >> give say one chance.
>> >
>> > A downside is that we have to do that even when undercommitted.
> 
> Hopefully vcpu preemption is very rare when undercommitted, so it should
> not happen much at all.

As soon as you're preempted, you're effectively overcommitted (even if
the system as a whole is undercommitted).  What I meant was that you
need to communicate your lock state to the host, and with fine-grained
locking this can happen a lot.  It may be as simple as an
increment/decrement instruction though.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-30  8:18                       ` Avi Kivity
@ 2012-09-30 11:07                         ` Gleb Natapov
  2012-09-30 11:13                           ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-09-30 11:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
> > 
> >>
> >>     +struct pv_sched_info {
> >>     +       unsigned long   sched_bitmap;
> > 
> > Thinking, whether we need something similar to cpumask here?
> > Only thing is we are representing guest (v)cpumask.
> > 
> 
> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
> 
vcpu_id can be greater than KVM_MAX_VCPUS.

> cpumask is for host masks, this is a guest mask.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-30 11:07                         ` Gleb Natapov
@ 2012-09-30 11:13                           ` Avi Kivity
  2012-10-03 14:17                             ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-09-30 11:13 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Raghavendra K T, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 09/30/2012 01:07 PM, Gleb Natapov wrote:
> On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
>> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
>> > 
>> >>
>> >>     +struct pv_sched_info {
>> >>     +       unsigned long   sched_bitmap;
>> > 
>> > Thinking, whether we need something similar to cpumask here?
>> > Only thing is we are representing guest (v)cpumask.
>> > 
>> 
>> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
>> 
> vcpu_id can be greater than KVM_MAX_VCPUS.

Use the index into the vcpu table as the bitmap index then.  In fact
it's better because then the lookup to get the vcpu pointer is trivial.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-24 15:41       ` Avi Kivity
  2012-09-24 16:06         ` Avi Kivity
  2012-09-25  7:36         ` Raghavendra K T
@ 2012-10-03 12:22         ` Raghavendra K T
  2012-10-03 17:05           ` Avi Kivity
  2 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-03 12:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Rik van Riel, Peter Zijlstra, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Gleb Natapov

* Avi Kivity <avi@redhat.com> [2012-09-24 17:41:19]:

> On 09/21/2012 08:24 PM, Raghavendra K T wrote:
> > On 09/21/2012 06:32 PM, Rik van Riel wrote:
> >> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
> >>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>>
> >>> When total number of VCPUs of system is less than or equal to physical
> >>> CPUs,
> >>> PLE exits become costly since each VCPU can have dedicated PCPU, and
> >>> trying to find a target VCPU to yield_to just burns time in PLE handler.
> >>>
> >>> This patch reduces overhead, by simply doing a return in such
> >>> scenarios by
> >>> checking the length of current cpu runqueue.
> >>
> >> I am not convinced this is the way to go.
> >>
> >> The VCPU that is holding the lock, and is not releasing it,
> >> probably got scheduled out. That implies that VCPU is on a
> >> runqueue with at least one other task.
> > 
> > I see your point here, we have two cases:
> > 
> > case 1)
> > 
> > rq1 : vcpu1->wait(lockA) (spinning)
> > rq2 : vcpu2->holding(lockA) (running)
> > 
> > Here Ideally vcpu1 should not enter PLE handler, since it would surely
> > get the lock within ple_window cycle. (assuming ple_window is tuned for
> > that workload perfectly).
> > 
> > May be this explains why we are not seeing benefit with kernbench.
> > 
> > On the other side, Since we cannot have a perfect ple_window tuned for
> > all type of workloads, for those workloads, which may need more than
> > 4096 cycles, we gain. thinking is it that we are seeing in benefited
> > cases?
> 
> Maybe we need to increase the ple window regardless.  4096 cycles is 2
> microseconds or less (call it t_spin).  The overhead from
> kvm_vcpu_on_spin() and the associated task switches is at least a few
> microseconds, increasing as contention is added (call it t_tield).  The
> time for a natural context switch is several milliseconds (call it
> t_slice).  There is also the time the lock holder owns the lock,
> assuming no contention (t_hold).
> 
> If t_yield > t_spin, then in the undercommitted case it dominates
> t_spin.  If t_hold > t_spin we lose badly.
> 
> If t_spin > t_yield, then the undercommitted case doesn't suffer as much
> as most of the spinning happens in the guest instead of the host, so it
> can pick up the unlock timely.  We don't lose too much in the
> overcommitted case provided the values aren't too far apart (say a
> factor of 3).
> 
> Obviously t_spin must be significantly smaller than t_slice, otherwise
> it accomplishes nothing.
> 
> Regarding t_hold: if it is small, then a larger t_spin helps avoid false
> exits.  If it is large, then we're not very sensitive to t_spin.  It
> doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
> yielding for several milliseconds.
> 
> So I think it's worth trying again with ple_window of 20000-40000.
> 

Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.

Results:
16 core PLE machine with 16 vcpu guest. 

base kernel = 3.6-rc5 + ple handler optimization patch 
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096

		base_pleopt_8k	base_pleopt_16k	base_pleopt_32k
-----------------------------------------------------------------			
kernbench_1x	-5.54915	-15.94529	-44.31562
kernbench_2x	-7.89399	-17.75039	-37.73498
-----------------------------------------------------------------			
sysbench_1x	0.45955		-0.98778	0.05252
sysbench_2x	1.44071		-0.81625	1.35620
sysbench_3x 	0.45549		1.51795		-0.41573
-----------------------------------------------------------------			
			
hackbench_1x	-3.80272	-13.91456	-40.79059
hackbench_2x 	-4.78999	-7.61382	-7.24475
-----------------------------------------------------------------			
ebizzy_1x	-2.54626	-16.86050	-38.46109
ebizzy_2x	-8.75526	-19.29116	-48.33314
-----------------------------------------------------------------			

I also got perf top output to analyse the difference. Difference comes
because of flushtlb (and also spinlock).

Ebizzy run for 4k ple_window
-  87.20%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
      - 100.00% _raw_spin_unlock_irqrestore
         + 52.89% release_pages
         + 47.10% pagevec_lru_move_fn
-   5.71%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
      + 86.03% default_send_IPI_mask_allbutself_phys
      + 13.96% default_send_IPI_mask_sequence_phys
-   3.10%  [kernel]  [k] smp_call_function_many
     smp_call_function_many


Ebizzy run for 32k ple_window

-  91.40%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
      - 100.00% _raw_spin_unlock_irqrestore
         + 53.13% release_pages
         + 46.86% pagevec_lru_move_fn
-   4.38%  [kernel]  [k] smp_call_function_many
     smp_call_function_many
-   2.51%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
      + 90.76% default_send_IPI_mask_allbutself_phys
      + 9.24% default_send_IPI_mask_sequence_phys


Below is the detailed result:			
patch = base_pleopt_8k 
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
    base         stddev    patch       stdev       %improve    
+-----------+-----------+-----------+------------+-----------+
    41.0027     0.7990	    43.2780     0.5180	  -5.54915
    89.2983     1.2406	    96.3475     1.8891	  -7.89399
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
     9.9010     0.0558	     9.8555     0.1246	   0.45955
    19.7611     0.4290	    19.4764     0.0835	   1.44071
    29.1775     0.9903	    29.0446     0.8641	   0.45549
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              hackbench 
+-----------+-----------+-----------+------------+-----------+
    77.1580     1.9787	    80.0921     2.9696	  -3.80272
   239.2490     1.5660	   250.7090     2.6074	  -4.78999
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
  4256.2500   186.8053	  4147.8750   206.1840	  -2.54626
  2197.2500    93.1048	  2004.8750    85.7995	  -8.75526
+-----------+-----------+-----------+------------+-----------+

patch = base_pleopt_16k
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
    base         stddev    patch       stdev       %improve    
+-----------+-----------+-----------+------------+-----------+
    41.0027     0.7990	    47.5407     0.5739	 -15.94529
    89.2983     1.2406	   105.1491     1.2244	 -17.75039
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
     9.9010     0.0558	     9.9988     0.1106	  -0.98778
    19.7611     0.4290	    19.9224     0.9016	  -0.81625
    29.1775     0.9903	    28.7346     0.2788	   1.51795
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              hackbench 
+-----------+-----------+-----------+------------+-----------+
    77.1580     1.9787	    87.8942     2.2132	 -13.91456
   239.2490     1.5660	   257.4650     5.3674	  -7.61382
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
  4256.2500   186.8053	  3538.6250   101.1165	 -16.86050
  2197.2500    93.1048	  1773.3750    91.8414	 -19.29116
+-----------+-----------+-----------+------------+-----------+

patch = base_pleopt_32k
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
    base         stddev    patch       stdev       %improve    
+-----------+-----------+-----------+------------+-----------+
    41.0027     0.7990	    59.1733     0.8102	 -44.31562
    89.2983     1.2406	   122.9950     1.5534	 -37.73498
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
     9.9010     0.0558	     9.8958     0.0593	   0.05252
    19.7611     0.4290	    19.4931     0.1767	   1.35620
    29.1775     0.9903	    29.2988     1.0420	  -0.41573
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              hackbench 
+-----------+-----------+-----------+------------+-----------+
    77.1580     1.9787	   108.6312    13.1500	 -40.79059
   239.2490     1.5660	   256.5820     2.2722	  -7.24475
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
  4256.2500   186.8053	  2619.2500    80.8150	 -38.46109
  2197.2500    93.1048	  1135.2500    22.2887	 -48.33314
+-----------+-----------+-----------+------------+-----------+


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-09-30 11:13                           ` Avi Kivity
@ 2012-10-03 14:17                             ` Raghavendra K T
  2012-10-03 14:56                               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-03 14:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gleb Natapov, Raghavendra K T, Jiannan Ouyang, Peter Zijlstra,
	Rik van Riel, H. Peter Anvin, Ingo Molnar, Marcelo Tosatti,
	Srikar, Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer,
	LKML, Srivatsa Vaddagiri

* Avi Kivity <avi@redhat.com> [2012-09-30 13:13:09]:

> On 09/30/2012 01:07 PM, Gleb Natapov wrote:
> > On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
> >> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
> >> > 
> >> >>
> >> >>     +struct pv_sched_info {
> >> >>     +       unsigned long   sched_bitmap;
> >> > 
> >> > Thinking, whether we need something similar to cpumask here?
> >> > Only thing is we are representing guest (v)cpumask.
> >> > 
> >> 
> >> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
> >> 
> > vcpu_id can be greater than KVM_MAX_VCPUS.
> 
> Use the index into the vcpu table as the bitmap index then.  In fact
> it's better because then the lookup to get the vcpu pointer is trivial.

Did you mean, while setting the bitmap,

we should do 
for (i = 1..n)
if (kvm->vcpus[i] == vcpu) set ith position in bitmap?

I just wanted to know whether there is any easy way to convert from 
vcpu  pointer to index in kvm vcpu table.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-09-27 12:03                   ` Avi Kivity
  2012-09-27 12:25                     ` Andrew Theurer
  2012-09-28  5:38                     ` Raghavendra K T
@ 2012-10-03 14:29                     ` Raghavendra K T
  2012-10-03 17:25                       ` Avi Kivity
  2 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-03 14:29 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra
  Cc: Raghavendra K T, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

* Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:

> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
[...]
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 

Okay. Do you mean we can have something like

+       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+               yielded = -ESRCH;
+               goto out_irq;
+       }

in the Peter's patch ?

( I thought lot about && or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.

I am experimenting all these for V2 patch. Will come back with analysis
and patch.

> Or were you referring to something else?
> 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-03 14:17                             ` Raghavendra K T
@ 2012-10-03 14:56                               ` Avi Kivity
  2012-10-04  7:29                                 ` Gleb Natapov
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-03 14:56 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Gleb Natapov, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 10/03/2012 04:17 PM, Raghavendra K T wrote:
> * Avi Kivity <avi@redhat.com> [2012-09-30 13:13:09]:
> 
>> On 09/30/2012 01:07 PM, Gleb Natapov wrote:
>> > On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
>> >> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
>> >> > 
>> >> >>
>> >> >>     +struct pv_sched_info {
>> >> >>     +       unsigned long   sched_bitmap;
>> >> > 
>> >> > Thinking, whether we need something similar to cpumask here?
>> >> > Only thing is we are representing guest (v)cpumask.
>> >> > 
>> >> 
>> >> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
>> >> 
>> > vcpu_id can be greater than KVM_MAX_VCPUS.
>> 
>> Use the index into the vcpu table as the bitmap index then.  In fact
>> it's better because then the lookup to get the vcpu pointer is trivial.
> 
> Did you mean, while setting the bitmap,
> 
> we should do 
> for (i = 1..n)
> if (kvm->vcpus[i] == vcpu) set ith position in bitmap?

You can store i in the vcpu itself:

  set_bit(vcpu->index, &kvm->preempted);

> 
> I just wanted to know whether there is any easy way to convert from 
> vcpu  pointer to index in kvm vcpu table.
> 



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-03 12:22         ` Raghavendra K T
@ 2012-10-03 17:05           ` Avi Kivity
  2012-10-04 10:49             ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-03 17:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>> So I think it's worth trying again with ple_window of 20000-40000.
>> 
> 
> Hi Avi,
> 
> I ran different benchmarks increasing ple_window, and results does not
> seem to be encouraging for increasing ple_window.

Thanks for testing! Comments below.

> Results:
> 16 core PLE machine with 16 vcpu guest. 
> 
> base kernel = 3.6-rc5 + ple handler optimization patch 
> base_pleopt_8k = base kernel + ple window = 8k
> base_pleopt_16k = base kernel + ple window = 16k
> base_pleopt_32k = base kernel + ple window = 32k
> 
> 
> Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096
> 
> 		base_pleopt_8k	base_pleopt_16k	base_pleopt_32k
> -----------------------------------------------------------------			
> kernbench_1x	-5.54915	-15.94529	-44.31562
> kernbench_2x	-7.89399	-17.75039	-37.73498

So, 44% degradation even with no overcommit?  That's surprising.

> I also got perf top output to analyse the difference. Difference comes
> because of flushtlb (and also spinlock).

That's in the guest, yes?

> 
> Ebizzy run for 4k ple_window
> -  87.20%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       - 100.00% _raw_spin_unlock_irqrestore
>          + 52.89% release_pages
>          + 47.10% pagevec_lru_move_fn
> -   5.71%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       + 86.03% default_send_IPI_mask_allbutself_phys
>       + 13.96% default_send_IPI_mask_sequence_phys
> -   3.10%  [kernel]  [k] smp_call_function_many
>      smp_call_function_many
> 
> 
> Ebizzy run for 32k ple_window
> 
> -  91.40%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       - 100.00% _raw_spin_unlock_irqrestore
>          + 53.13% release_pages
>          + 46.86% pagevec_lru_move_fn
> -   4.38%  [kernel]  [k] smp_call_function_many
>      smp_call_function_many
> -   2.51%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       + 90.76% default_send_IPI_mask_allbutself_phys
>       + 9.24% default_send_IPI_mask_sequence_phys
> 

Both the 4k and the 32k results are crazy.  Why is
arch_local_irq_restore() so prominent?  Do you have a very high
interrupt rate in the guest?




-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-10-03 14:29                     ` Raghavendra K T
@ 2012-10-03 17:25                       ` Avi Kivity
  2012-10-04 10:56                         ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-03 17:25 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/03/2012 04:29 PM, Raghavendra K T wrote:
> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
> 
>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>> >>
> [...]
>> > 2) looking at the result (comparing A & C) , I do feel we have
>> > significant in iterating over vcpus (when compared to even vmexit)
>> > so We still would need undercommit fix sugested by PeterZ (improving by
>> > 140%). ?
>> 
>> Looking only at the current runqueue?  My worry is that it misses a lot
>> of cases.  Maybe try the current runqueue first and then others.
>> 
> 
> Okay. Do you mean we can have something like
> 
> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +               yielded = -ESRCH;
> +               goto out_irq;
> +       }
> 
> in the Peter's patch ?
> 
> ( I thought lot about && or || . Both seem to have their own cons ).
> But that should be only when we have short term imbalance, as PeterZ
> told.

I'm missing the context.  What is p_rq?

What I mean was:

  if can_yield_to_process_in_current_rq
     do that
  else if can_yield_to_process_in_other_rq
     do that
  else
     return -ESRCH


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-03 14:56                               ` Avi Kivity
@ 2012-10-04  7:29                                 ` Gleb Natapov
  2012-10-05  8:36                                   ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Gleb Natapov @ 2012-10-04  7:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Wed, Oct 03, 2012 at 04:56:57PM +0200, Avi Kivity wrote:
> On 10/03/2012 04:17 PM, Raghavendra K T wrote:
> > * Avi Kivity <avi@redhat.com> [2012-09-30 13:13:09]:
> > 
> >> On 09/30/2012 01:07 PM, Gleb Natapov wrote:
> >> > On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
> >> >> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
> >> >> > 
> >> >> >>
> >> >> >>     +struct pv_sched_info {
> >> >> >>     +       unsigned long   sched_bitmap;
> >> >> > 
> >> >> > Thinking, whether we need something similar to cpumask here?
> >> >> > Only thing is we are representing guest (v)cpumask.
> >> >> > 
> >> >> 
> >> >> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
> >> >> 
> >> > vcpu_id can be greater than KVM_MAX_VCPUS.
> >> 
> >> Use the index into the vcpu table as the bitmap index then.  In fact
> >> it's better because then the lookup to get the vcpu pointer is trivial.
> > 
> > Did you mean, while setting the bitmap,
> > 
> > we should do 
> > for (i = 1..n)
> > if (kvm->vcpus[i] == vcpu) set ith position in bitmap?
> 
> You can store i in the vcpu itself:
> 
>   set_bit(vcpu->index, &kvm->preempted);
> 
This will make the fact that vcpus are stored in an array into API
instead of implementation detail :( There were patches for vcpu
destruction that changed the array to rculist. Well, it will be still
possible to make the array rcu protected and copy it every time vcpu is
deleted/added I guess.

--
			Gleb.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-03 17:05           ` Avi Kivity
@ 2012-10-04 10:49             ` Raghavendra K T
  2012-10-04 12:41               ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-04 10:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/03/2012 10:35 PM, Avi Kivity wrote:
> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>
>>
>> Hi Avi,
>>
>> I ran different benchmarks increasing ple_window, and results does not
>> seem to be encouraging for increasing ple_window.
>
> Thanks for testing! Comments below.
>
>> Results:
>> 16 core PLE machine with 16 vcpu guest.
>>
>> base kernel = 3.6-rc5 + ple handler optimization patch
>> base_pleopt_8k = base kernel + ple window = 8k
>> base_pleopt_16k = base kernel + ple window = 16k
>> base_pleopt_32k = base kernel + ple window = 32k
>>
>>
>> Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096
>>
>> 		base_pleopt_8k	base_pleopt_16k	base_pleopt_32k
>> -----------------------------------------------------------------			
>> kernbench_1x	-5.54915	-15.94529	-44.31562
>> kernbench_2x	-7.89399	-17.75039	-37.73498
>
> So, 44% degradation even with no overcommit?  That's surprising.

Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
spending 8 times the original ple_window cycles for 16 vcpus
significant?

>
>> I also got perf top output to analyse the difference. Difference comes
>> because of flushtlb (and also spinlock).
>
> That's in the guest, yes?

Yes. Perf is in guest.

>
>>
>> Ebizzy run for 4k ple_window
>> -  87.20%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        - 100.00% _raw_spin_unlock_irqrestore
>>           + 52.89% release_pages
>>           + 47.10% pagevec_lru_move_fn
>> -   5.71%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        + 86.03% default_send_IPI_mask_allbutself_phys
>>        + 13.96% default_send_IPI_mask_sequence_phys
>> -   3.10%  [kernel]  [k] smp_call_function_many
>>       smp_call_function_many
>>
>>
>> Ebizzy run for 32k ple_window
>>
>> -  91.40%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        - 100.00% _raw_spin_unlock_irqrestore
>>           + 53.13% release_pages
>>           + 46.86% pagevec_lru_move_fn
>> -   4.38%  [kernel]  [k] smp_call_function_many
>>       smp_call_function_many
>> -   2.51%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        + 90.76% default_send_IPI_mask_allbutself_phys
>>        + 9.24% default_send_IPI_mask_sequence_phys
>>
>
> Both the 4k and the 32k results are crazy.  Why is
> arch_local_irq_restore() so prominent?  Do you have a very high
> interrupt rate in the guest?

How to measure if I have high interrupt rate in guest?
 From /proc/interrupt numbers I am not able to judge :(

I went back and got the results on a 32 core machine with 32 vcpu guest.
Strangely, I got result supporting the claim that increasing ple_window 
helps for non-overcommitted scenario.

32 core 32 vcpu guest 1x scenarios.

ple_gap = 0
kernbench: Elapsed Time 38.61
ebizzy: 7463 records/s

ple_window = 4k
kernbench: Elapsed Time 43.5067
ebizzy:    2528 records/s

ple_window = 32k
kernebench : Elapsed Time 39.4133
ebizzy: 7196 records/s


perf top for ebizzy for above:
ple_gap = 0
-  84.74%  [kernel]  [k] arch_local_irq_restore
    - arch_local_irq_restore
       - 100.00% _raw_spin_unlock_irqrestore
          + 50.96% release_pages
          + 49.02% pagevec_lru_move_fn
-   6.57%  [kernel]  [k] arch_local_irq_restore
    - arch_local_irq_restore
       + 92.54% default_send_IPI_mask_allbutself_phys
       + 7.46% default_send_IPI_mask_sequence_phys
-   1.54%  [kernel]  [k] smp_call_function_many
      smp_call_function_many

ple_window = 32k
-  84.47%  [kernel]  [k] arch_local_irq_restore
    + arch_local_irq_restore
-   6.46%  [kernel]  [k] arch_local_irq_restore
    - arch_local_irq_restore
       + 93.51% default_send_IPI_mask_allbutself_phys
       + 6.49% default_send_IPI_mask_sequence_phys
-   1.80%  [kernel]  [k] smp_call_function_many
    - smp_call_function_many
       + 99.98% native_flush_tlb_others


ple_window = 4k
-  91.35%  [kernel]  [k] arch_local_irq_restore
    - arch_local_irq_restore
       - 100.00% _raw_spin_unlock_irqrestore
          + 53.19% release_pages
          + 46.81% pagevec_lru_move_fn
-   3.90%  [kernel]  [k] smp_call_function_many
      smp_call_function_many
-   2.94%  [kernel]  [k] arch_local_irq_restore
    - arch_local_irq_restore
       + 93.12% default_send_IPI_mask_allbutself_phys
       + 6.88% default_send_IPI_mask_sequence_phys

Let me know if I can try something here..
/me confused :(


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-10-03 17:25                       ` Avi Kivity
@ 2012-10-04 10:56                         ` Raghavendra K T
  2012-10-04 12:44                           ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-04 10:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/03/2012 10:55 PM, Avi Kivity wrote:
> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>
>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>
>> [...]
>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>> so We still would need undercommit fix sugested by PeterZ (improving by
>>>> 140%). ?
>>>
>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>> of cases.  Maybe try the current runqueue first and then others.
>>>
>>
>> Okay. Do you mean we can have something like
>>
>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>> +               yielded = -ESRCH;
>> +               goto out_irq;
>> +       }
>>
>> in the Peter's patch ?
>>
>> ( I thought lot about && or || . Both seem to have their own cons ).
>> But that should be only when we have short term imbalance, as PeterZ
>> told.
>
> I'm missing the context.  What is p_rq?

p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.

>
> What I mean was:
>
>    if can_yield_to_process_in_current_rq
>       do that
>    else if can_yield_to_process_in_other_rq
>       do that
>    else
>       return -ESRCH

I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to 
that? ignoring whatever the target vcpu we received for yield_to.

Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.
Does it add more overhead especially in <= 1x scenario?


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 10:49             ` Raghavendra K T
@ 2012-10-04 12:41               ` Avi Kivity
  2012-10-04 13:07                 ` Peter Zijlstra
                                   ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Avi Kivity @ 2012-10-04 12:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/04/2012 12:49 PM, Raghavendra K T wrote:
> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>
>>>
>>> Hi Avi,
>>>
>>> I ran different benchmarks increasing ple_window, and results does not
>>> seem to be encouraging for increasing ple_window.
>>
>> Thanks for testing! Comments below.
>>
>>> Results:
>>> 16 core PLE machine with 16 vcpu guest.
>>>
>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>> base_pleopt_8k = base kernel + ple window = 8k
>>> base_pleopt_16k = base kernel + ple window = 16k
>>> base_pleopt_32k = base kernel + ple window = 32k
>>>
>>>
>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>> ple_window = 4096
>>>
>>>         base_pleopt_8k    base_pleopt_16k    base_pleopt_32k
>>> -----------------------------------------------------------------           
>>>
>>> kernbench_1x    -5.54915    -15.94529    -44.31562
>>> kernbench_2x    -7.89399    -17.75039    -37.73498
>>
>> So, 44% degradation even with no overcommit?  That's surprising.
> 
> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
> spending 8 times the original ple_window cycles for 16 vcpus
> significant?

A PLE exit when not overcommitted cannot do any good, it is better to
spin in the guest rather that look for candidates on the host.  In fact
when we benchmark we often disable PLE completely.

> 
>>
>>> I also got perf top output to analyse the difference. Difference comes
>>> because of flushtlb (and also spinlock).
>>
>> That's in the guest, yes?
> 
> Yes. Perf is in guest.
> 
>>
>>>
>>> Ebizzy run for 4k ple_window
>>> -  87.20%  [kernel]  [k] arch_local_irq_restore
>>>     - arch_local_irq_restore
>>>        - 100.00% _raw_spin_unlock_irqrestore
>>>           + 52.89% release_pages
>>>           + 47.10% pagevec_lru_move_fn
>>> -   5.71%  [kernel]  [k] arch_local_irq_restore
>>>     - arch_local_irq_restore
>>>        + 86.03% default_send_IPI_mask_allbutself_phys
>>>        + 13.96% default_send_IPI_mask_sequence_phys
>>> -   3.10%  [kernel]  [k] smp_call_function_many
>>>       smp_call_function_many
>>>
>>>
>>> Ebizzy run for 32k ple_window
>>>
>>> -  91.40%  [kernel]  [k] arch_local_irq_restore
>>>     - arch_local_irq_restore
>>>        - 100.00% _raw_spin_unlock_irqrestore
>>>           + 53.13% release_pages
>>>           + 46.86% pagevec_lru_move_fn
>>> -   4.38%  [kernel]  [k] smp_call_function_many
>>>       smp_call_function_many
>>> -   2.51%  [kernel]  [k] arch_local_irq_restore
>>>     - arch_local_irq_restore
>>>        + 90.76% default_send_IPI_mask_allbutself_phys
>>>        + 9.24% default_send_IPI_mask_sequence_phys
>>>
>>
>> Both the 4k and the 32k results are crazy.  Why is
>> arch_local_irq_restore() so prominent?  Do you have a very high
>> interrupt rate in the guest?
> 
> How to measure if I have high interrupt rate in guest?
> From /proc/interrupt numbers I am not able to judge :(

'vmstat 1'

> 
> I went back and got the results on a 32 core machine with 32 vcpu guest.
> Strangely, I got result supporting the claim that increasing ple_window
> helps for non-overcommitted scenario.
> 
> 32 core 32 vcpu guest 1x scenarios.
> 
> ple_gap = 0
> kernbench: Elapsed Time 38.61
> ebizzy: 7463 records/s
> 
> ple_window = 4k
> kernbench: Elapsed Time 43.5067
> ebizzy:    2528 records/s
> 
> ple_window = 32k
> kernebench : Elapsed Time 39.4133
> ebizzy: 7196 records/s

So maybe something was wrong with the first measurement.

> 
> 
> perf top for ebizzy for above:
> ple_gap = 0
> -  84.74%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       - 100.00% _raw_spin_unlock_irqrestore
>          + 50.96% release_pages
>          + 49.02% pagevec_lru_move_fn
> -   6.57%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       + 92.54% default_send_IPI_mask_allbutself_phys
>       + 7.46% default_send_IPI_mask_sequence_phys
> -   1.54%  [kernel]  [k] smp_call_function_many
>      smp_call_function_many

Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf instruction.

> 
> ple_window = 32k
> -  84.47%  [kernel]  [k] arch_local_irq_restore
>    + arch_local_irq_restore
> -   6.46%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       + 93.51% default_send_IPI_mask_allbutself_phys
>       + 6.49% default_send_IPI_mask_sequence_phys
> -   1.80%  [kernel]  [k] smp_call_function_many
>    - smp_call_function_many
>       + 99.98% native_flush_tlb_others
> 
> 
> ple_window = 4k
> -  91.35%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       - 100.00% _raw_spin_unlock_irqrestore
>          + 53.19% release_pages
>          + 46.81% pagevec_lru_move_fn
> -   3.90%  [kernel]  [k] smp_call_function_many
>      smp_call_function_many
> -   2.94%  [kernel]  [k] arch_local_irq_restore
>    - arch_local_irq_restore
>       + 93.12% default_send_IPI_mask_allbutself_phys
>       + 6.88% default_send_IPI_mask_sequence_phys
> 
> Let me know if I can try something here..
> /me confused :(
> 

I'm even more confused.  Please try 'perf kvm' from the host, it does
fewer dirty tricks with the PMU and so may be more accurate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-10-04 10:56                         ` Raghavendra K T
@ 2012-10-04 12:44                           ` Avi Kivity
  2012-10-05  9:04                             ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-04 12:44 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/04/2012 12:56 PM, Raghavendra K T wrote:
> On 10/03/2012 10:55 PM, Avi Kivity wrote:
>> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>>
>>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>>
>>> [...]
>>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>>> so We still would need undercommit fix sugested by PeterZ
>>>>> (improving by
>>>>> 140%). ?
>>>>
>>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>>> of cases.  Maybe try the current runqueue first and then others.
>>>>
>>>
>>> Okay. Do you mean we can have something like
>>>
>>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>> +               yielded = -ESRCH;
>>> +               goto out_irq;
>>> +       }
>>>
>>> in the Peter's patch ?
>>>
>>> ( I thought lot about && or || . Both seem to have their own cons ).
>>> But that should be only when we have short term imbalance, as PeterZ
>>> told.
>>
>> I'm missing the context.  What is p_rq?
> 
> p_rq is the run queue of target vcpu.
> What I was trying below was to address Rik concern. Suppose
> rq of source vcpu has one task, but target probably has two task,
> with a eligible vcpu waiting to be scheduled.
> 
>>
>> What I mean was:
>>
>>    if can_yield_to_process_in_current_rq
>>       do that
>>    else if can_yield_to_process_in_other_rq
>>       do that
>>    else
>>       return -ESRCH
> 
> I think you are saying we have to check the run queue of the
> source vcpu, if we have a vcpu belonging to same VM and try yield to
> that? ignoring whatever the target vcpu we received for yield_to.
> 
> Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
> belonging to same run queue first. If we don't succeed, go again for
> a vcpu in different runqueue.

Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.

> Does it add more overhead especially in <= 1x scenario?

The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 12:41               ` Avi Kivity
@ 2012-10-04 13:07                 ` Peter Zijlstra
  2012-10-04 15:00                   ` Avi Kivity
  2012-10-04 14:41                 ` Andrew Theurer
  2012-10-05  9:02                 ` Raghavendra K T
  2 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2012-10-04 13:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> 
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction. 

Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 12:41               ` Avi Kivity
  2012-10-04 13:07                 ` Peter Zijlstra
@ 2012-10-04 14:41                 ` Andrew Theurer
  2012-10-05  9:06                   ` Raghavendra K T
  2012-10-05  9:02                 ` Raghavendra K T
  2 siblings, 1 reply; 126+ messages in thread
From: Andrew Theurer @ 2012-10-04 14:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, Rik van Riel, Peter Zijlstra, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
> > On 10/03/2012 10:35 PM, Avi Kivity wrote:
> >> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
> >>>> So I think it's worth trying again with ple_window of 20000-40000.
> >>>>
> >>>
> >>> Hi Avi,
> >>>
> >>> I ran different benchmarks increasing ple_window, and results does not
> >>> seem to be encouraging for increasing ple_window.
> >>
> >> Thanks for testing! Comments below.
> >>
> >>> Results:
> >>> 16 core PLE machine with 16 vcpu guest.
> >>>
> >>> base kernel = 3.6-rc5 + ple handler optimization patch
> >>> base_pleopt_8k = base kernel + ple window = 8k
> >>> base_pleopt_16k = base kernel + ple window = 16k
> >>> base_pleopt_32k = base kernel + ple window = 32k
> >>>
> >>>
> >>> Percentage improvements of benchmarks w.r.t base_pleopt with
> >>> ple_window = 4096
> >>>
> >>>         base_pleopt_8k    base_pleopt_16k    base_pleopt_32k
> >>> -----------------------------------------------------------------           
> >>>
> >>> kernbench_1x    -5.54915    -15.94529    -44.31562
> >>> kernbench_2x    -7.89399    -17.75039    -37.73498
> >>
> >> So, 44% degradation even with no overcommit?  That's surprising.
> > 
> > Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
> > spending 8 times the original ple_window cycles for 16 vcpus
> > significant?
> 
> A PLE exit when not overcommitted cannot do any good, it is better to
> spin in the guest rather that look for candidates on the host.  In fact
> when we benchmark we often disable PLE completely.

Agreed.  However, I really do not understand why the kernbench regressed
with bigger ple_window.  It should stay the same or improve.  Raghu, do
you have perf data for the kernbench runs?
> 
> > 
> >>
> >>> I also got perf top output to analyse the difference. Difference comes
> >>> because of flushtlb (and also spinlock).
> >>
> >> That's in the guest, yes?
> > 
> > Yes. Perf is in guest.
> > 
> >>
> >>>
> >>> Ebizzy run for 4k ple_window
> >>> -  87.20%  [kernel]  [k] arch_local_irq_restore
> >>>     - arch_local_irq_restore
> >>>        - 100.00% _raw_spin_unlock_irqrestore
> >>>           + 52.89% release_pages
> >>>           + 47.10% pagevec_lru_move_fn
> >>> -   5.71%  [kernel]  [k] arch_local_irq_restore
> >>>     - arch_local_irq_restore
> >>>        + 86.03% default_send_IPI_mask_allbutself_phys
> >>>        + 13.96% default_send_IPI_mask_sequence_phys
> >>> -   3.10%  [kernel]  [k] smp_call_function_many
> >>>       smp_call_function_many
> >>>
> >>>
> >>> Ebizzy run for 32k ple_window
> >>>
> >>> -  91.40%  [kernel]  [k] arch_local_irq_restore
> >>>     - arch_local_irq_restore
> >>>        - 100.00% _raw_spin_unlock_irqrestore
> >>>           + 53.13% release_pages
> >>>           + 46.86% pagevec_lru_move_fn
> >>> -   4.38%  [kernel]  [k] smp_call_function_many
> >>>       smp_call_function_many
> >>> -   2.51%  [kernel]  [k] arch_local_irq_restore
> >>>     - arch_local_irq_restore
> >>>        + 90.76% default_send_IPI_mask_allbutself_phys
> >>>        + 9.24% default_send_IPI_mask_sequence_phys
> >>>
> >>
> >> Both the 4k and the 32k results are crazy.  Why is
> >> arch_local_irq_restore() so prominent?  Do you have a very high
> >> interrupt rate in the guest?
> > 
> > How to measure if I have high interrupt rate in guest?
> > From /proc/interrupt numbers I am not able to judge :(
> 
> 'vmstat 1'
> 
> > 
> > I went back and got the results on a 32 core machine with 32 vcpu guest.
> > Strangely, I got result supporting the claim that increasing ple_window
> > helps for non-overcommitted scenario.
> > 
> > 32 core 32 vcpu guest 1x scenarios.
> > 
> > ple_gap = 0
> > kernbench: Elapsed Time 38.61
> > ebizzy: 7463 records/s
> > 
> > ple_window = 4k
> > kernbench: Elapsed Time 43.5067
> > ebizzy:    2528 records/s
> > 
> > ple_window = 32k
> > kernebench : Elapsed Time 39.4133
> > ebizzy: 7196 records/s
> 
> So maybe something was wrong with the first measurement.

OK, this is more in line with what I expected for kernbench.  FWIW, in
order to show an improvement for a larger ple_window, we really need a
workload which we know has a longer lock holding time (without factoring
in LHP).  We have noticed this on IO based locks mostly.  We saw it with
a massive disk IO test (qla2xxx lock), and also with a large web serving
test (some vfs related lock, but I forget what exactly it was).
> 
> > 
> > 
> > perf top for ebizzy for above:
> > ple_gap = 0
> > -  84.74%  [kernel]  [k] arch_local_irq_restore
> >    - arch_local_irq_restore
> >       - 100.00% _raw_spin_unlock_irqrestore
> >          + 50.96% release_pages
> >          + 49.02% pagevec_lru_move_fn
> > -   6.57%  [kernel]  [k] arch_local_irq_restore
> >    - arch_local_irq_restore
> >       + 92.54% default_send_IPI_mask_allbutself_phys
> >       + 7.46% default_send_IPI_mask_sequence_phys
> > -   1.54%  [kernel]  [k] smp_call_function_many
> >      smp_call_function_many
> 
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction.

I do have a feeling that ebizzy just has too many variables and LHP is
just one of many problems.  However, am I curious what perf kvm from
host shows as Avi suggested below.
> 
> > 
> > ple_window = 32k
> > -  84.47%  [kernel]  [k] arch_local_irq_restore
> >    + arch_local_irq_restore
> > -   6.46%  [kernel]  [k] arch_local_irq_restore
> >    - arch_local_irq_restore
> >       + 93.51% default_send_IPI_mask_allbutself_phys
> >       + 6.49% default_send_IPI_mask_sequence_phys
> > -   1.80%  [kernel]  [k] smp_call_function_many
> >    - smp_call_function_many
> >       + 99.98% native_flush_tlb_others
> > 
> > 
> > ple_window = 4k
> > -  91.35%  [kernel]  [k] arch_local_irq_restore
> >    - arch_local_irq_restore
> >       - 100.00% _raw_spin_unlock_irqrestore
> >          + 53.19% release_pages
> >          + 46.81% pagevec_lru_move_fn
> > -   3.90%  [kernel]  [k] smp_call_function_many
> >      smp_call_function_many
> > -   2.94%  [kernel]  [k] arch_local_irq_restore
> >    - arch_local_irq_restore
> >       + 93.12% default_send_IPI_mask_allbutself_phys
> >       + 6.88% default_send_IPI_mask_sequence_phys
> > 
> > Let me know if I can try something here..
> > /me confused :(
> > 
> 
> I'm even more confused.  Please try 'perf kvm' from the host, it does
> fewer dirty tricks with the PMU and so may be more accurate.
> 



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 13:07                 ` Peter Zijlstra
@ 2012-10-04 15:00                   ` Avi Kivity
  2012-10-09 18:51                     ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-04 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raghavendra K T, Rik van Riel, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>> 
>> Again the numbers are ridiculously high for arch_local_irq_restore.
>> Maybe there's a bad perf/kvm interaction when we're injecting an
>> interrupt, I can't believe we're spending 84% of the time running the
>> popf instruction. 
> 
> Smells like a software fallback that doesn't do NMI, hrtimer based
> sampling typically hits popf where we re-enable interrupts.

Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
host will expose it (and a good idea anyway to get best performance).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04  7:29                                 ` Gleb Natapov
@ 2012-10-05  8:36                                   ` Raghavendra K T
  2012-10-07  9:51                                     ` Avi Kivity
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-05  8:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 10/04/2012 12:59 PM, Gleb Natapov wrote:
> On Wed, Oct 03, 2012 at 04:56:57PM +0200, Avi Kivity wrote:
>> On 10/03/2012 04:17 PM, Raghavendra K T wrote:
>>> * Avi Kivity <avi@redhat.com> [2012-09-30 13:13:09]:
>>>
>>>> On 09/30/2012 01:07 PM, Gleb Natapov wrote:
>>>>> On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
>>>>>> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>      +struct pv_sched_info {
>>>>>>>>      +       unsigned long   sched_bitmap;
>>>>>>>
>>>>>>> Thinking, whether we need something similar to cpumask here?
>>>>>>> Only thing is we are representing guest (v)cpumask.
>>>>>>>
>>>>>>
>>>>>> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
>>>>>>
>>>>> vcpu_id can be greater than KVM_MAX_VCPUS.
>>>>
>>>> Use the index into the vcpu table as the bitmap index then.  In fact
>>>> it's better because then the lookup to get the vcpu pointer is trivial.
>>>
>>> Did you mean, while setting the bitmap,
>>>
>>> we should do
>>> for (i = 1..n)
>>> if (kvm->vcpus[i] == vcpu) set ith position in bitmap?
>>
>> You can store i in the vcpu itself:
>>
>>    set_bit(vcpu->index, &kvm->preempted);
>>
> This will make the fact that vcpus are stored in an array into API
> instead of implementation detail :( There were patches for vcpu
> destruction that changed the array to rculist. Well, it will be still
> possible to make the array rcu protected and copy it every time vcpu is
> deleted/added I guess.
>

If IUC, summary is, we are going with
- Let vcpu array be rcu protected.
- we use index inside vcpu and should be updated when a vcpu is
added/deleted.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 12:41               ` Avi Kivity
  2012-10-04 13:07                 ` Peter Zijlstra
  2012-10-04 14:41                 ` Andrew Theurer
@ 2012-10-05  9:02                 ` Raghavendra K T
  2 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-05  9:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Peter Zijlstra, H. Peter Anvin, Ingo Molnar,
	Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/04/2012 06:11 PM, Avi Kivity wrote:
> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
>> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>>
>>>>
>>>> Hi Avi,
>>>>
>>>> I ran different benchmarks increasing ple_window, and results does not
>>>> seem to be encouraging for increasing ple_window.
>>>
>>> Thanks for testing! Comments below.
>>>
>>>> Results:
>>>> 16 core PLE machine with 16 vcpu guest.
>>>>
>>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>>> base_pleopt_8k = base kernel + ple window = 8k
>>>> base_pleopt_16k = base kernel + ple window = 16k
>>>> base_pleopt_32k = base kernel + ple window = 32k
>>>>
>>>>
>>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>>> ple_window = 4096
>>>>
>>>>          base_pleopt_8k    base_pleopt_16k    base_pleopt_32k
>>>> -----------------------------------------------------------------
>>>>
>>>> kernbench_1x    -5.54915    -15.94529    -44.31562
>>>> kernbench_2x    -7.89399    -17.75039    -37.73498
>>>
>>> So, 44% degradation even with no overcommit?  That's surprising.
>>
>> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
>> spending 8 times the original ple_window cycles for 16 vcpus
>> significant?
>
> A PLE exit when not overcommitted cannot do any good, it is better to
> spin in the guest rather that look for candidates on the host.  In fact
> when we benchmark we often disable PLE completely.
>
>>
>>>
>>>> I also got perf top output to analyse the difference. Difference comes
>>>> because of flushtlb (and also spinlock).
>>>
>>> That's in the guest, yes?
>>
>> Yes. Perf is in guest.
>>
>>>
>>>>
>>>> Ebizzy run for 4k ple_window
>>>> -  87.20%  [kernel]  [k] arch_local_irq_restore
>>>>      - arch_local_irq_restore
>>>>         - 100.00% _raw_spin_unlock_irqrestore
>>>>            + 52.89% release_pages
>>>>            + 47.10% pagevec_lru_move_fn
>>>> -   5.71%  [kernel]  [k] arch_local_irq_restore
>>>>      - arch_local_irq_restore
>>>>         + 86.03% default_send_IPI_mask_allbutself_phys
>>>>         + 13.96% default_send_IPI_mask_sequence_phys
>>>> -   3.10%  [kernel]  [k] smp_call_function_many
>>>>        smp_call_function_many
>>>>
>>>>
>>>> Ebizzy run for 32k ple_window
>>>>
>>>> -  91.40%  [kernel]  [k] arch_local_irq_restore
>>>>      - arch_local_irq_restore
>>>>         - 100.00% _raw_spin_unlock_irqrestore
>>>>            + 53.13% release_pages
>>>>            + 46.86% pagevec_lru_move_fn
>>>> -   4.38%  [kernel]  [k] smp_call_function_many
>>>>        smp_call_function_many
>>>> -   2.51%  [kernel]  [k] arch_local_irq_restore
>>>>      - arch_local_irq_restore
>>>>         + 90.76% default_send_IPI_mask_allbutself_phys
>>>>         + 9.24% default_send_IPI_mask_sequence_phys
>>>>
>>>
>>> Both the 4k and the 32k results are crazy.  Why is
>>> arch_local_irq_restore() so prominent?  Do you have a very high
>>> interrupt rate in the guest?
>>
>> How to measure if I have high interrupt rate in guest?
>>  From /proc/interrupt numbers I am not able to judge :(
>
> 'vmstat 1'
>

Thanks you. 'll save this. Apart from in,cs I think r: The number of 
processes waiting for run time, would be useful for me in vmstat.

>>
>> I went back and got the results on a 32 core machine with 32 vcpu guest.
>> Strangely, I got result supporting the claim that increasing ple_window
>> helps for non-overcommitted scenario.
>>
>> 32 core 32 vcpu guest 1x scenarios.
>>
>> ple_gap = 0
>> kernbench: Elapsed Time 38.61
>> ebizzy: 7463 records/s
>>
>> ple_window = 4k
>> kernbench: Elapsed Time 43.5067
>> ebizzy:    2528 records/s
>>
>> ple_window = 32k
>> kernebench : Elapsed Time 39.4133
>> ebizzy: 7196 records/s
>
> So maybe something was wrong with the first measurement.

May be I was not clear. The first time I had run on x240 (sandybridge)
16 core cpu,

Then ran on 32 core x3850 to confirm the perf top results.
But yes both had

[    0.018997] Performance Events: Broken PMU hardware detected, using 
software events only.

problem as rightly pointed by you and PeterZ.

after -cpu host, I see that is fixed on x240,

[    0.017997] Performance Events: 16-deep LBR, SandyBridge events, 
Intel PMU driver.
[    0.018868] NMI watchdog: enabled on all CPUs, permanently consumes 
one hw-PMU counter.

So I 'll try it on x240 again.

( Some how mx3850 -cpu host resulted in
[    0.026995] Performance Events: unsupported p6 CPU model 26 no PMU 
driver, software events only.
I think qemu needs some fix as pointed in
http://www.mail-archive.com/kvm@vger.kernel.org/msg55836.html


>
>>
>>
>> perf top for ebizzy for above:
>> ple_gap = 0
>> -  84.74%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        - 100.00% _raw_spin_unlock_irqrestore
>>           + 50.96% release_pages
>>           + 49.02% pagevec_lru_move_fn
>> -   6.57%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        + 92.54% default_send_IPI_mask_allbutself_phys
>>        + 7.46% default_send_IPI_mask_sequence_phys
>> -   1.54%  [kernel]  [k] smp_call_function_many
>>       smp_call_function_many
>
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction.
>
>>
>> ple_window = 32k
>> -  84.47%  [kernel]  [k] arch_local_irq_restore
>>     + arch_local_irq_restore
>> -   6.46%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        + 93.51% default_send_IPI_mask_allbutself_phys
>>        + 6.49% default_send_IPI_mask_sequence_phys
>> -   1.80%  [kernel]  [k] smp_call_function_many
>>     - smp_call_function_many
>>        + 99.98% native_flush_tlb_others
>>
>>
>> ple_window = 4k
>> -  91.35%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        - 100.00% _raw_spin_unlock_irqrestore
>>           + 53.19% release_pages
>>           + 46.81% pagevec_lru_move_fn
>> -   3.90%  [kernel]  [k] smp_call_function_many
>>       smp_call_function_many
>> -   2.94%  [kernel]  [k] arch_local_irq_restore
>>     - arch_local_irq_restore
>>        + 93.12% default_send_IPI_mask_allbutself_phys
>>        + 6.88% default_send_IPI_mask_sequence_phys
>>
>> Let me know if I can try something here..
>> /me confused :(
>>
>
> I'm even more confused.  Please try 'perf kvm' from the host, it does
> fewer dirty tricks with the PMU and so may be more accurate.
>

I will try with host perf kvm this time..


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
  2012-10-04 12:44                           ` Avi Kivity
@ 2012-10-05  9:04                             ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-05  9:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Ingo Molnar,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	chegu vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/04/2012 06:14 PM, Avi Kivity wrote:
> On 10/04/2012 12:56 PM, Raghavendra K T wrote:
>> On 10/03/2012 10:55 PM, Avi Kivity wrote:
>>> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>>>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>>>
>>>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>>>
>>>> [...]
>>>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>>>> so We still would need undercommit fix sugested by PeterZ
>>>>>> (improving by
>>>>>> 140%). ?
>>>>>
>>>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>>>> of cases.  Maybe try the current runqueue first and then others.
>>>>>
>>>>
>>>> Okay. Do you mean we can have something like
>>>>
>>>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +               yielded = -ESRCH;
>>>> +               goto out_irq;
>>>> +       }
>>>>
>>>> in the Peter's patch ?
>>>>
>>>> ( I thought lot about && or || . Both seem to have their own cons ).
>>>> But that should be only when we have short term imbalance, as PeterZ
>>>> told.
>>>
>>> I'm missing the context.  What is p_rq?
>>
>> p_rq is the run queue of target vcpu.
>> What I was trying below was to address Rik concern. Suppose
>> rq of source vcpu has one task, but target probably has two task,
>> with a eligible vcpu waiting to be scheduled.
>>
>>>
>>> What I mean was:
>>>
>>>     if can_yield_to_process_in_current_rq
>>>        do that
>>>     else if can_yield_to_process_in_other_rq
>>>        do that
>>>     else
>>>        return -ESRCH
>>
>> I think you are saying we have to check the run queue of the
>> source vcpu, if we have a vcpu belonging to same VM and try yield to
>> that? ignoring whatever the target vcpu we received for yield_to.
>>
>> Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
>> belonging to same run queue first. If we don't succeed, go again for
>> a vcpu in different runqueue.
>
> Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
> results if all vcpus on the current runqueue are spinners, so probably
> not a good idea.

Okay. 'll drop vcpu from same rq idea now.

>
>> Does it add more overhead especially in <= 1x scenario?
>
> The current runqueue should have just our vcpu in that case, so low
> overhead.  But it's a bad idea due to the above scenario.
>


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 14:41                 ` Andrew Theurer
@ 2012-10-05  9:06                   ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-05  9:06 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Rik van Riel, Peter Zijlstra, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov

On 10/04/2012 08:11 PM, Andrew Theurer wrote:
> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
>>> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>>>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>>>
>>>>>
>>>>> Hi Avi,
>>>>>
>>>>> I ran different benchmarks increasing ple_window, and results does not
>>>>> seem to be encouraging for increasing ple_window.
>>>>
>>>> Thanks for testing! Comments below.
>>>>
>>>>> Results:
>>>>> 16 core PLE machine with 16 vcpu guest.
>>>>>
>>>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>>>> base_pleopt_8k = base kernel + ple window = 8k
>>>>> base_pleopt_16k = base kernel + ple window = 16k
>>>>> base_pleopt_32k = base kernel + ple window = 32k
>>>>>
>>>>>
>>>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>>>> ple_window = 4096
>>>>>
>>>>>          base_pleopt_8k    base_pleopt_16k    base_pleopt_32k
>>>>> -----------------------------------------------------------------
>>>>>
>>>>> kernbench_1x    -5.54915    -15.94529    -44.31562
>>>>> kernbench_2x    -7.89399    -17.75039    -37.73498
>>>>
>>>> So, 44% degradation even with no overcommit?  That's surprising.
>>>
>>> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
>>> spending 8 times the original ple_window cycles for 16 vcpus
>>> significant?
>>
>> A PLE exit when not overcommitted cannot do any good, it is better to
>> spin in the guest rather that look for candidates on the host.  In fact
>> when we benchmark we often disable PLE completely.
>
> Agreed.  However, I really do not understand why the kernbench regressed
> with bigger ple_window.  It should stay the same or improve.  Raghu, do
> you have perf data for the kernbench runs?

Andrew, No. 'll get this with perf kvm.



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-05  8:36                                   ` Raghavendra K T
@ 2012-10-07  9:51                                     ` Avi Kivity
  0 siblings, 0 replies; 126+ messages in thread
From: Avi Kivity @ 2012-10-07  9:51 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Gleb Natapov, Jiannan Ouyang, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, chegu vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 10/05/2012 10:36 AM, Raghavendra K T wrote:
>>>
>>> You can store i in the vcpu itself:
>>>
>>>    set_bit(vcpu->index, &kvm->preempted);
>>>
>> This will make the fact that vcpus are stored in an array into API
>> instead of implementation detail :( There were patches for vcpu
>> destruction that changed the array to rculist. Well, it will be still
>> possible to make the array rcu protected and copy it every time vcpu is
>> deleted/added I guess.
>>
> 
> If IUC, summary is, we are going with
> - Let vcpu array be rcu protected.

That's for the future.  For now ->vcpus[] is statically allocated.

> - we use index inside vcpu and should be updated when a vcpu is
> added/deleted.

Yes.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-04 15:00                   ` Avi Kivity
@ 2012-10-09 18:51                     ` Raghavendra K T
  2012-10-10  2:59                       ` Andrew Theurer
                                         ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-09 18:51 UTC (permalink / raw)
  To: Avi Kivity, Andrew M. Theurer
  Cc: Peter Zijlstra, Raghavendra K T, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

* Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:

> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >> 
> >> Again the numbers are ridiculously high for arch_local_irq_restore.
> >> Maybe there's a bad perf/kvm interaction when we're injecting an
> >> interrupt, I can't believe we're spending 84% of the time running the
> >> popf instruction. 
> > 
> > Smells like a software fallback that doesn't do NMI, hrtimer based
> > sampling typically hits popf where we re-enable interrupts.
> 
> Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
> is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
> host will expose it (and a good idea anyway to get best performance).
> 

Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.

I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish  to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.

I can respin the whole series including this default ple_window change.

I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements
================

16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---------------+---------------+-----------------+-------------------+
               |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---------------+---------------+-----------------+-------------------+
kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
---------------+---------------+-----------------+-------------------+
sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
---------------+---------------+-----------------+-------------------+
ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
---------------+---------------+-----------------+-------------------+

perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 
========================================================================
pleopt   ple_gap=0
--------------------
ebizzy : 18131 records/s
63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
    5.65%  [guest.kernel]  [g] smp_call_function_many
    3.12%  [guest.kernel]  [g] clear_page
    3.02%  [guest.kernel]  [g] down_read_trylock
    1.85%  [guest.kernel]  [g] async_page_fault
    1.81%  [guest.kernel]  [g] up_read
    1.76%  [guest.kernel]  [g] native_apic_mem_write
    1.70%  [guest.kernel]  [g] find_vma

kernbench :Elapsed Time 29.4933 (27.6007)
   5.72%  [guest.kernel]  [g] async_page_fault
    3.48%  [guest.kernel]  [g] pvclock_clocksource_read
    2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
    2.58%  [guest.kernel]  [g] clear_page
    2.09%  [guest.kernel]  [g] page_cache_get_speculative
    2.00%  [guest.kernel]  [g] do_raw_spin_lock
    1.78%  [guest.kernel]  [g] unmap_single_vma
    1.74%  [guest.kernel]  [g] kmem_cache_alloc

pleopt ple_window = 4k
---------------------------
ebizzy: 10176 records/s
   69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
    3.34%  [guest.kernel]  [g] clear_page
    2.16%  [guest.kernel]  [g] down_read_trylock
    1.94%  [guest.kernel]  [g] async_page_fault
    1.89%  [guest.kernel]  [g] native_apic_mem_write
    1.63%  [guest.kernel]  [g] smp_call_function_many
    1.58%  [guest.kernel]  [g] SetPageLRU
    1.37%  [guest.kernel]  [g] up_read
    1.01%  [guest.kernel]  [g] find_vma


kernbench: 29.9533
nts: 240K cycles
    6.04%  [guest.kernel]  [g] async_page_fault
    4.17%  [guest.kernel]  [g] pvclock_clocksource_read
    3.28%  [guest.kernel]  [g] clear_page
    2.57%  [guest.kernel]  [g] copy_user_generic_unrolled
    2.30%  [guest.kernel]  [g] do_raw_spin_lock
    2.13%  [guest.kernel]  [g] _raw_spin_lock_irqsave
    1.93%  [guest.kernel]  [g] page_cache_get_speculative
    1.92%  [guest.kernel]  [g] unmap_single_vma
    1.77%  [guest.kernel]  [g] kmem_cache_alloc
    1.61%  [guest.kernel]  [g] __d_lookup_rcu
    1.19%  [guest.kernel]  [g] find_vma
    1.19%  [guest.kernel]  [g] __list_del_entry


pleopt: ple_window=16k
-------------------------
ebizzy: 16990
 62.35%  [guest.kernel]  [g] _raw_spin_lock_irqsave
    5.22%  [guest.kernel]  [g] smp_call_function_many
    3.57%  [guest.kernel]  [g] down_read_trylock
    3.20%  [guest.kernel]  [g] clear_page
    2.16%  [guest.kernel]  [g] up_read
    1.89%  [guest.kernel]  [g] find_vma
    1.86%  [guest.kernel]  [g] async_page_fault
    1.81%  [guest.kernel]  [g] native_apic_mem_write

kernbench: 28.5
 6.24%  [guest.kernel]  [g] async_page_fault
    4.16%  [guest.kernel]  [g] pvclock_clocksource_read
    3.33%  [guest.kernel]  [g] clear_page
    2.50%  [guest.kernel]  [g] copy_user_generic_unrolled
    2.08%  [guest.kernel]  [g] do_raw_spin_lock
    1.98%  [guest.kernel]  [g] unmap_single_vma
    1.89%  [guest.kernel]  [g] kmem_cache_alloc
    1.82%  [guest.kernel]  [g] page_cache_get_speculative
    1.46%  [guest.kernel]  [g] __d_lookup_rcu
    1.42%  [guest.kernel]  [g] _raw_spin_lock_irqsave
    1.15%  [guest.kernel]  [g] __list_del_entry
    1.10%  [guest.kernel]  [g] find_vma



Detailed result for the run
=============================
patched = base_pleopt_16k 
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
   base        stddev       patched    stdev        %improve     
+-----------+-----------+-----------+------------+-----------+
1x    30.0440     1.1896    29.9167     1.6755	   0.42371
2x    62.0083     3.4884    62.8825     2.5509	  -1.40981
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
1x     7.1779     0.0577     7.2442     0.0479	  -0.92367
2x    15.5362     0.3370    15.8822     0.3591	  -2.22706
3x    23.8249     0.1513    24.0048     0.1844	  -0.75509
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
1x 10358.0000   442.6598   16054.8750  252.5088    54.99976
2x  2705.5000   130.0286   2466.5000   120.0024	  -8.83386
+-----------+-----------+-----------+------------+-----------+

patched = base_pleopt_32k
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
   base        stddev       patched    stdev        %improve     
+-----------+-----------+-----------+------------+-----------+
1x    30.0440     1.1896    29.6980     0.6760	   1.15164
2x    62.0083     3.4884    72.8491     4.4616	 -17.48282
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
1x     7.1779     0.0577     7.1605     0.0447	   0.24241
2x    15.5362     0.3370    15.5842     0.1731	  -0.30896
3x    23.8249     0.1513    23.8024     0.2342	   0.09444
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
1x  10358.0000   442.6598   17328.3750   281.4569   67.29460
2x  2705.5000   130.0286    1964.6250   143.0793   -27.38403
+-----------+-----------+-----------+------------+-----------+

patched = base_pleopt_nople
+-----------+-----------+-----------+------------+-----------+
                              kernbench 
+-----------+-----------+-----------+------------+-----------+
   base        stddev       patched    stdev        %improve     
+-----------+-----------+-----------+------------+-----------+
1x    30.0440     1.1896    30.0160     0.7523	   0.09320
2x    62.0083     3.4884   415.9334   189.9901	  -570.77053
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              sysbench 
+-----------+-----------+-----------+------------+-----------+
1x     7.1779     0.0577     7.1973     0.0354	  -0.27027
2x    15.5362     0.3370    15.7344     0.2315	  -1.27573
3x    23.8249     0.1513    24.5343     0.3437	  -2.97756
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
                              ebizzy 
+-----------+-----------+-----------+------------+-----------+
1x 10358.0000   442.6598 18037.5000   315.2074	   74.14076
2x  2705.5000   130.0286   102.2500   104.3521	  -96.22066
+-----------+-----------+-----------+------------+-----------+


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-09 18:51                     ` Raghavendra K T
@ 2012-10-10  2:59                       ` Andrew Theurer
  2012-10-10 17:54                         ` Raghavendra K T
  2012-10-10 14:24                       ` Andrew Theurer
  2012-10-18 12:39                       ` Avi Kivity
  2 siblings, 1 reply; 126+ messages in thread
From: Andrew Theurer @ 2012-10-10  2:59 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
> 
> > On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> > >> 
> > >> Again the numbers are ridiculously high for arch_local_irq_restore.
> > >> Maybe there's a bad perf/kvm interaction when we're injecting an
> > >> interrupt, I can't believe we're spending 84% of the time running the
> > >> popf instruction. 
> > > 
> > > Smells like a software fallback that doesn't do NMI, hrtimer based
> > > sampling typically hits popf where we re-enable interrupts.
> > 
> > Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
> > is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
> > host will expose it (and a good idea anyway to get best performance).
> > 
> 
> Hi Avi, you are right. SandyBridge machine result was not proper.
> I cleaned up the services, enabled PMU, re-ran all the test again.
> 
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> 
> Let me know if you think we can increase the default ple_window
> itself to 16k.
> 
> I am experimenting with V2 version of undercommit improvement(this) patch
> series, But I think if you wish  to go for increase of
> default ple_window, then we would have to measure the benefit of patches
> when ple_window = 16k.
> 
> I can respin the whole series including this default ple_window change.
> 
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
> 
> Improvements
> ================
> 
> 16 core PLE machine with 16 vcpu guest
> 
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
> 
> % improvements w.r.t base (ple_window = 4k)
> ---------------+---------------+-----------------+-------------------+
>                |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---------------+---------------+-----------------+-------------------+
> kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
> kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
> ---------------+---------------+-----------------+-------------------+
> sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
> sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
> sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
> ---------------+---------------+-----------------+-------------------+
> ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
> ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
> ---------------+---------------+-----------------+-------------------+
> 
> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 
> ========================================================================

Is the perf data for 1x overcommit?

> pleopt   ple_gap=0
> --------------------
> ebizzy : 18131 records/s
> 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>     5.65%  [guest.kernel]  [g] smp_call_function_many
>     3.12%  [guest.kernel]  [g] clear_page
>     3.02%  [guest.kernel]  [g] down_read_trylock
>     1.85%  [guest.kernel]  [g] async_page_fault
>     1.81%  [guest.kernel]  [g] up_read
>     1.76%  [guest.kernel]  [g] native_apic_mem_write
>     1.70%  [guest.kernel]  [g] find_vma

Does 'perf kvm top' not give host samples at the same time?  Would be
nice to see the host overhead as a function of varying ple window.  I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.

> kernbench :Elapsed Time 29.4933 (27.6007)
>    5.72%  [guest.kernel]  [g] async_page_fault
>     3.48%  [guest.kernel]  [g] pvclock_clocksource_read
>     2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
>     2.58%  [guest.kernel]  [g] clear_page
>     2.09%  [guest.kernel]  [g] page_cache_get_speculative
>     2.00%  [guest.kernel]  [g] do_raw_spin_lock
>     1.78%  [guest.kernel]  [g] unmap_single_vma
>     1.74%  [guest.kernel]  [g] kmem_cache_alloc

> 
> pleopt ple_window = 4k
> ---------------------------
> ebizzy: 10176 records/s
>    69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>     3.34%  [guest.kernel]  [g] clear_page
>     2.16%  [guest.kernel]  [g] down_read_trylock
>     1.94%  [guest.kernel]  [g] async_page_fault
>     1.89%  [guest.kernel]  [g] native_apic_mem_write
>     1.63%  [guest.kernel]  [g] smp_call_function_many
>     1.58%  [guest.kernel]  [g] SetPageLRU
>     1.37%  [guest.kernel]  [g] up_read
>     1.01%  [guest.kernel]  [g] find_vma
> 
> 
> kernbench: 29.9533
> nts: 240K cycles
>     6.04%  [guest.kernel]  [g] async_page_fault
>     4.17%  [guest.kernel]  [g] pvclock_clocksource_read
>     3.28%  [guest.kernel]  [g] clear_page
>     2.57%  [guest.kernel]  [g] copy_user_generic_unrolled
>     2.30%  [guest.kernel]  [g] do_raw_spin_lock
>     2.13%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>     1.93%  [guest.kernel]  [g] page_cache_get_speculative
>     1.92%  [guest.kernel]  [g] unmap_single_vma
>     1.77%  [guest.kernel]  [g] kmem_cache_alloc
>     1.61%  [guest.kernel]  [g] __d_lookup_rcu
>     1.19%  [guest.kernel]  [g] find_vma
>     1.19%  [guest.kernel]  [g] __list_del_entry
> 
> 
> pleopt: ple_window=16k
> -------------------------
> ebizzy: 16990
>  62.35%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>     5.22%  [guest.kernel]  [g] smp_call_function_many
>     3.57%  [guest.kernel]  [g] down_read_trylock
>     3.20%  [guest.kernel]  [g] clear_page
>     2.16%  [guest.kernel]  [g] up_read
>     1.89%  [guest.kernel]  [g] find_vma
>     1.86%  [guest.kernel]  [g] async_page_fault
>     1.81%  [guest.kernel]  [g] native_apic_mem_write
> 
> kernbench: 28.5
>  6.24%  [guest.kernel]  [g] async_page_fault
>     4.16%  [guest.kernel]  [g] pvclock_clocksource_read
>     3.33%  [guest.kernel]  [g] clear_page
>     2.50%  [guest.kernel]  [g] copy_user_generic_unrolled
>     2.08%  [guest.kernel]  [g] do_raw_spin_lock
>     1.98%  [guest.kernel]  [g] unmap_single_vma
>     1.89%  [guest.kernel]  [g] kmem_cache_alloc
>     1.82%  [guest.kernel]  [g] page_cache_get_speculative
>     1.46%  [guest.kernel]  [g] __d_lookup_rcu
>     1.42%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>     1.15%  [guest.kernel]  [g] __list_del_entry
>     1.10%  [guest.kernel]  [g] find_vma
> 
> 
> 
> Detailed result for the run
> =============================
> patched = base_pleopt_16k 
> +-----------+-----------+-----------+------------+-----------+
>                               kernbench 
> +-----------+-----------+-----------+------------+-----------+
>    base        stddev       patched    stdev        %improve     
> +-----------+-----------+-----------+------------+-----------+
> 1x    30.0440     1.1896    29.9167     1.6755	   0.42371
> 2x    62.0083     3.4884    62.8825     2.5509	  -1.40981
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               sysbench 
> +-----------+-----------+-----------+------------+-----------+
> 1x     7.1779     0.0577     7.2442     0.0479	  -0.92367
> 2x    15.5362     0.3370    15.8822     0.3591	  -2.22706
> 3x    23.8249     0.1513    24.0048     0.1844	  -0.75509
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               ebizzy 
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000   442.6598   16054.8750  252.5088    54.99976
> 2x  2705.5000   130.0286   2466.5000   120.0024	  -8.83386
> +-----------+-----------+-----------+------------+-----------+
> 
> patched = base_pleopt_32k
> +-----------+-----------+-----------+------------+-----------+
>                               kernbench 
> +-----------+-----------+-----------+------------+-----------+
>    base        stddev       patched    stdev        %improve     
> +-----------+-----------+-----------+------------+-----------+
> 1x    30.0440     1.1896    29.6980     0.6760	   1.15164
> 2x    62.0083     3.4884    72.8491     4.4616	 -17.48282
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               sysbench 
> +-----------+-----------+-----------+------------+-----------+
> 1x     7.1779     0.0577     7.1605     0.0447	   0.24241
> 2x    15.5362     0.3370    15.5842     0.1731	  -0.30896
> 3x    23.8249     0.1513    23.8024     0.2342	   0.09444
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               ebizzy 
> +-----------+-----------+-----------+------------+-----------+
> 1x  10358.0000   442.6598   17328.3750   281.4569   67.29460
> 2x  2705.5000   130.0286    1964.6250   143.0793   -27.38403
> +-----------+-----------+-----------+------------+-----------+
> 
> patched = base_pleopt_nople
> +-----------+-----------+-----------+------------+-----------+
>                               kernbench 
> +-----------+-----------+-----------+------------+-----------+
>    base        stddev       patched    stdev        %improve     
> +-----------+-----------+-----------+------------+-----------+
> 1x    30.0440     1.1896    30.0160     0.7523	   0.09320
> 2x    62.0083     3.4884   415.9334   189.9901	  -570.77053
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               sysbench 
> +-----------+-----------+-----------+------------+-----------+
> 1x     7.1779     0.0577     7.1973     0.0354	  -0.27027
> 2x    15.5362     0.3370    15.7344     0.2315	  -1.27573
> 3x    23.8249     0.1513    24.5343     0.3437	  -2.97756
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                               ebizzy 
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000   442.6598 18037.5000   315.2074	   74.14076
> 2x  2705.5000   130.0286   102.2500   104.3521	  -96.22066
> +-----------+-----------+-----------+------------+-----------+
> 



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-09 18:51                     ` Raghavendra K T
  2012-10-10  2:59                       ` Andrew Theurer
@ 2012-10-10 14:24                       ` Andrew Theurer
  2012-10-10 17:43                         ` Raghavendra K T
  2012-10-11 10:39                         ` Nikunj A Dadhania
  2012-10-18 12:39                       ` Avi Kivity
  2 siblings, 2 replies; 126+ messages in thread
From: Andrew Theurer @ 2012-10-10 14:24 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results.  I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record').  The Y
axis is the host cpus, each row being 10 pixels high.  For these tests,
there are 80 host cpus, so the total height is 800 pixels.  The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running.  vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color.  There are a maximum of 12 assignable colors, so in any VMs >12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another.  The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler.  They are pretty well aligned across all
cpus.  Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test.  We have 2x over-commit and we
generally see the switching of threads at 4ms.  One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on.  This is most likely
because of the yield_to initiated by the PLE handler.  In this case
there is not that much yielding to do.   It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This one looks quite different.  In short, it's a mess.  The switching
between tasks can be lower than 10 microseconds.  It basically never
recovers.  There is constant yielding all the time.  

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches.  While I am not recommending gang scheduling, I
think it's a good data point.  The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Note that the task switching intervals of 4ms are quite obvious again,
and this time all vCPUs from same VM run at the same time.  It
represents the best possible outcome.

Anyway, I thought the bitmaps might help better visualize what's going
on.

-Andrew

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 14:24                       ` Andrew Theurer
@ 2012-10-10 17:43                         ` Raghavendra K T
  2012-10-10 19:27                           ` Andrew Theurer
  2012-10-11 10:39                         ` Nikunj A Dadhania
  1 sibling, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-10 17:43 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> I ran 'perf sched map' on the dbench workload for medium and large VMs,
> and I thought I would share some of the results.  I think it helps to
> visualize what's going on regarding the yielding.
>
> These files are png bitmaps, generated from processing output from 'perf
> sched map' (and perf data generated from 'perf sched record').  The Y
> axis is the host cpus, each row being 10 pixels high.  For these tests,
> there are 80 host cpus, so the total height is 800 pixels.  The X axis
> is time (in microseconds), with each pixel representing 1 microsecond.
> Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
> obviously, and zooming in/out while viewing is recommended.
>
> Each row (each host cpu) is assigned a color based on what thread is
> running.  vCPUs of the same VM are assigned a common color (like red,
> blue, magenta, etc), and each vCPU has a unique brightness for that
> color.  There are a maximum of 12 assignable colors, so in any VMs >12
> revert to vCPU color of gray. I would use more colors, but it becomes
> harder to distinguish one color from another.  The white color
> represents missing data from perf, and black color represents any thread
> which is not a vCPU.
>
> For the following tests, VMs were pinned to host NUMA nodes and to
> specific cpus to help with consistency and operate within the
> constraints of the last test (gang scheduler).
>
> Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
> described above only 12 of the VMs have a color, rest are gray).
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This looks very nice to visualize what is happening. Beginning of the 
graph looks little messy but later it is clear.

>
> If you zoom out and look at the whole bitmap, you may notice the 4ms
> intervals of the scheduler.  They are pretty well aligned across all
> cpus.  Normally, for cpu bound workloads, we would expect to see each
> thread to run for 4 ms, then something else getting to run, and so on.
> That is mostly true in this test.  We have 2x over-commit and we
> generally see the switching of threads at 4ms.  One thing to note is
> that not all vCPU threads for the same VM run at exactly the same time,
> and that is expected and the whole reason for lock-holder preemption.
> Now, if you zoom in on the bitmap, you should notice within the 4ms
> intervals there is some task switching going on.  This is most likely
> because of the yield_to initiated by the PLE handler.  In this case
> there is not that much yielding to do.   It's quite clean, and the
> performance is quite good.
>
> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> CPU over-commit is still 2x.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

I think this link still 10x16. Could you paste the link again?

>
> This one looks quite different.  In short, it's a mess.  The switching
> between tasks can be lower than 10 microseconds.  It basically never
> recovers.  There is constant yielding all the time.
>
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches.  While I am not recommending gang scheduling, I
> think it's a good data point.  The performance is 3.88x the PLE result.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
>
> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time.  It
> represents the best possible outcome.
>
>
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
>
> -Andrew
>
>
>
>


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10  2:59                       ` Andrew Theurer
@ 2012-10-10 17:54                         ` Raghavendra K T
  2012-10-10 18:03                           ` David Ahern
  2012-10-10 19:36                           ` Andrew Theurer
  0 siblings, 2 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-10 17:54 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
>>
>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>
>>>>> Again the numbers are ridiculously high for arch_local_irq_restore.
>>>>> Maybe there's a bad perf/kvm interaction when we're injecting an
>>>>> interrupt, I can't believe we're spending 84% of the time running the
>>>>> popf instruction.
>>>>
>>>> Smells like a software fallback that doesn't do NMI, hrtimer based
>>>> sampling typically hits popf where we re-enable interrupts.
>>>
>>> Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
>>> is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
>>> host will expose it (and a good idea anyway to get best performance).
>>>
>>
>> Hi Avi, you are right. SandyBridge machine result was not proper.
>> I cleaned up the services, enabled PMU, re-ran all the test again.
>>
>> Here is the summary:
>> We do get good benefit by increasing ple window. Though we don't
>> see good benefit for kernbench and sysbench, for ebizzy, we get huge
>> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>>
>> Let me know if you think we can increase the default ple_window
>> itself to 16k.
>>
>> I am experimenting with V2 version of undercommit improvement(this) patch
>> series, But I think if you wish  to go for increase of
>> default ple_window, then we would have to measure the benefit of patches
>> when ple_window = 16k.
>>
>> I can respin the whole series including this default ple_window change.
>>
>> I also have the perf kvm top result for both ebizzy and kernbench.
>> I think they are in expected lines now.
>>
>> Improvements
>> ================
>>
>> 16 core PLE machine with 16 vcpu guest
>>
>> base = 3.6.0-rc5 + ple handler optimization patches
>> base_pleopt_16k = base + ple_window = 16k
>> base_pleopt_32k = base + ple_window = 32k
>> base_pleopt_nople = base + ple_gap = 0
>> kernbench, hackbench, sysbench (time in sec lower is better)
>> ebizzy (rec/sec higher is better)
>>
>> % improvements w.r.t base (ple_window = 4k)
>> ---------------+---------------+-----------------+-------------------+
>>                 |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
>> ---------------+---------------+-----------------+-------------------+
>> kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
>> kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
>> ---------------+---------------+-----------------+-------------------+
>> sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
>> sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
>> sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
>> ---------------+---------------+-----------------+-------------------+
>> ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
>> ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
>> ---------------+---------------+-----------------+-------------------+
>>
>> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
>> ========================================================================
>
> Is the perf data for 1x overcommit?

Yes, 16vcpu guest on 16 core

>
>> pleopt   ple_gap=0
>> --------------------
>> ebizzy : 18131 records/s
>> 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
>>      5.65%  [guest.kernel]  [g] smp_call_function_many
>>      3.12%  [guest.kernel]  [g] clear_page
>>      3.02%  [guest.kernel]  [g] down_read_trylock
>>      1.85%  [guest.kernel]  [g] async_page_fault
>>      1.81%  [guest.kernel]  [g] up_read
>>      1.76%  [guest.kernel]  [g] native_apic_mem_write
>>      1.70%  [guest.kernel]  [g] find_vma
>
> Does 'perf kvm top' not give host samples at the same time?  Would be
> nice to see the host overhead as a function of varying ple window.  I
> would expect that to be the major difference between 4/16/32k window
> sizes.

No, I did something like this
perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
good idea.

(I am getting some segfaults with perf top, I think it is already fixed
but yet to see the patch that fixes)



>
> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> has just terrible scalability to begin with.  I do not think we should
> try to optimize such a bad workload.
>

I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 17:54                         ` Raghavendra K T
@ 2012-10-10 18:03                           ` David Ahern
  2012-10-10 18:14                             ` Raghavendra K T
  2012-10-10 19:36                           ` Andrew Theurer
  1 sibling, 1 reply; 126+ messages in thread
From: David Ahern @ 2012-10-10 18:03 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: habanero, Avi Kivity, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 10/10/12 11:54 AM, Raghavendra K T wrote:
> No, I did something like this
> perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
> good idea.
>
> (I am getting some segfaults with perf top, I think it is already fixed
> but yet to see the patch that fixes)

What version of perf:  perf --version



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 18:03                           ` David Ahern
@ 2012-10-10 18:14                             ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-10 18:14 UTC (permalink / raw)
  To: David Ahern
  Cc: habanero, Avi Kivity, Peter Zijlstra, Rik van Riel,
	H. Peter Anvin, Ingo Molnar, Marcelo Tosatti, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, chegu vinod, LKML,
	Srivatsa Vaddagiri, Gleb Natapov, Andrew Jones

On 10/10/2012 11:33 PM, David Ahern wrote:
> On 10/10/12 11:54 AM, Raghavendra K T wrote:
>> No, I did something like this
>> perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
>> good idea.
>>
>> (I am getting some segfaults with perf top, I think it is already fixed
>> but yet to see the patch that fixes)
>
> What version of perf:  perf --version
>

perf version 2.6.32-279.el6.x86_64.debug

(I searched that it is fixed in 288. could not dig-out actual patch
though)


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 17:43                         ` Raghavendra K T
@ 2012-10-10 19:27                           ` Andrew Theurer
  2012-10-11 17:13                             ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Andrew Theurer @ 2012-10-10 19:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> > I ran 'perf sched map' on the dbench workload for medium and large VMs,
> > and I thought I would share some of the results.  I think it helps to
> > visualize what's going on regarding the yielding.
> >
> > These files are png bitmaps, generated from processing output from 'perf
> > sched map' (and perf data generated from 'perf sched record').  The Y
> > axis is the host cpus, each row being 10 pixels high.  For these tests,
> > there are 80 host cpus, so the total height is 800 pixels.  The X axis
> > is time (in microseconds), with each pixel representing 1 microsecond.
> > Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
> > obviously, and zooming in/out while viewing is recommended.
> >
> > Each row (each host cpu) is assigned a color based on what thread is
> > running.  vCPUs of the same VM are assigned a common color (like red,
> > blue, magenta, etc), and each vCPU has a unique brightness for that
> > color.  There are a maximum of 12 assignable colors, so in any VMs >12
> > revert to vCPU color of gray. I would use more colors, but it becomes
> > harder to distinguish one color from another.  The white color
> > represents missing data from perf, and black color represents any thread
> > which is not a vCPU.
> >
> > For the following tests, VMs were pinned to host NUMA nodes and to
> > specific cpus to help with consistency and operate within the
> > constraints of the last test (gang scheduler).
> >
> > Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
> > described above only 12 of the VMs have a color, rest are gray).
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
> 
> This looks very nice to visualize what is happening. Beginning of the 
> graph looks little messy but later it is clear.
> 
> >
> > If you zoom out and look at the whole bitmap, you may notice the 4ms
> > intervals of the scheduler.  They are pretty well aligned across all
> > cpus.  Normally, for cpu bound workloads, we would expect to see each
> > thread to run for 4 ms, then something else getting to run, and so on.
> > That is mostly true in this test.  We have 2x over-commit and we
> > generally see the switching of threads at 4ms.  One thing to note is
> > that not all vCPU threads for the same VM run at exactly the same time,
> > and that is expected and the whole reason for lock-holder preemption.
> > Now, if you zoom in on the bitmap, you should notice within the 4ms
> > intervals there is some task switching going on.  This is most likely
> > because of the yield_to initiated by the PLE handler.  In this case
> > there is not that much yielding to do.   It's quite clean, and the
> > performance is quite good.
> >
> > Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> > CPU over-commit is still 2x.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
> 
> I think this link still 10x16. Could you paste the link again?

Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ

> 
> >
> > This one looks quite different.  In short, it's a mess.  The switching
> > between tasks can be lower than 10 microseconds.  It basically never
> > recovers.  There is constant yielding all the time.
> >
> > Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> > scheduling patches.  While I am not recommending gang scheduling, I
> > think it's a good data point.  The performance is 3.88x the PLE result.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
> >
> > Note that the task switching intervals of 4ms are quite obvious again,
> > and this time all vCPUs from same VM run at the same time.  It
> > represents the best possible outcome.
> >
> >
> > Anyway, I thought the bitmaps might help better visualize what's going
> > on.
> >
> > -Andrew
> >
> >
> >
> >
> 



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 17:54                         ` Raghavendra K T
  2012-10-10 18:03                           ` David Ahern
@ 2012-10-10 19:36                           ` Andrew Theurer
  2012-10-15 12:10                             ` Raghavendra K T
  1 sibling, 1 reply; 126+ messages in thread
From: Andrew Theurer @ 2012-10-10 19:36 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
> >>
> >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>
> >>>>> Again the numbers are ridiculously high for arch_local_irq_restore.
> >>>>> Maybe there's a bad perf/kvm interaction when we're injecting an
> >>>>> interrupt, I can't believe we're spending 84% of the time running the
> >>>>> popf instruction.
> >>>>
> >>>> Smells like a software fallback that doesn't do NMI, hrtimer based
> >>>> sampling typically hits popf where we re-enable interrupts.
> >>>
> >>> Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
> >>> is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
> >>> host will expose it (and a good idea anyway to get best performance).
> >>>
> >>
> >> Hi Avi, you are right. SandyBridge machine result was not proper.
> >> I cleaned up the services, enabled PMU, re-ran all the test again.
> >>
> >> Here is the summary:
> >> We do get good benefit by increasing ple window. Though we don't
> >> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> >> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> >>
> >> Let me know if you think we can increase the default ple_window
> >> itself to 16k.
> >>
> >> I am experimenting with V2 version of undercommit improvement(this) patch
> >> series, But I think if you wish  to go for increase of
> >> default ple_window, then we would have to measure the benefit of patches
> >> when ple_window = 16k.
> >>
> >> I can respin the whole series including this default ple_window change.
> >>
> >> I also have the perf kvm top result for both ebizzy and kernbench.
> >> I think they are in expected lines now.
> >>
> >> Improvements
> >> ================
> >>
> >> 16 core PLE machine with 16 vcpu guest
> >>
> >> base = 3.6.0-rc5 + ple handler optimization patches
> >> base_pleopt_16k = base + ple_window = 16k
> >> base_pleopt_32k = base + ple_window = 32k
> >> base_pleopt_nople = base + ple_gap = 0
> >> kernbench, hackbench, sysbench (time in sec lower is better)
> >> ebizzy (rec/sec higher is better)
> >>
> >> % improvements w.r.t base (ple_window = 4k)
> >> ---------------+---------------+-----------------+-------------------+
> >>                 |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> >> ---------------+---------------+-----------------+-------------------+
> >> kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
> >> kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
> >> ---------------+---------------+-----------------+-------------------+
> >> sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
> >> sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
> >> sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
> >> ---------------+---------------+-----------------+-------------------+
> >> ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
> >> ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
> >> ---------------+---------------+-----------------+-------------------+
> >>
> >> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
> >> ========================================================================
> >
> > Is the perf data for 1x overcommit?
> 
> Yes, 16vcpu guest on 16 core
> 
> >
> >> pleopt   ple_gap=0
> >> --------------------
> >> ebizzy : 18131 records/s
> >> 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
> >>      5.65%  [guest.kernel]  [g] smp_call_function_many
> >>      3.12%  [guest.kernel]  [g] clear_page
> >>      3.02%  [guest.kernel]  [g] down_read_trylock
> >>      1.85%  [guest.kernel]  [g] async_page_fault
> >>      1.81%  [guest.kernel]  [g] up_read
> >>      1.76%  [guest.kernel]  [g] native_apic_mem_write
> >>      1.70%  [guest.kernel]  [g] find_vma
> >
> > Does 'perf kvm top' not give host samples at the same time?  Would be
> > nice to see the host overhead as a function of varying ple window.  I
> > would expect that to be the major difference between 4/16/32k window
> > sizes.
> 
> No, I did something like this
> perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
> good idea.
> 
> (I am getting some segfaults with perf top, I think it is already fixed
> but yet to see the patch that fixes)
> 
> 
> 
> >
> > A big concern I have (if this is 1x overcommit) for ebizzy is that it
> > has just terrible scalability to begin with.  I do not think we should
> > try to optimize such a bad workload.
> >
> 
> I think my way of running dbench has some flaw, so I went to ebizzy.
> Could you let me know how you generally run dbench?

I mount a tmpfs and then specify that mount for dbench to run on.  This
eliminates all IO.  I use a 300 second run time and number of threads is
equal to number of vcpus.  All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved.  Without any lock-holder
preemption, the time in spin_lock should be very low:


>     21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
>      3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
>      2.81%      10176         dbench  dbench              [.] child_run
>      2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
>      2.33%       8423         dbench  dbench              [.] next_token
>      2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
>      1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
>      1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
>      1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
>      1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
>      1.38%       5009         dbench  libc-2.12.so        [.] memmove
>      1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
>      1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit

-Andrew



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 14:24                       ` Andrew Theurer
  2012-10-10 17:43                         ` Raghavendra K T
@ 2012-10-11 10:39                         ` Nikunj A Dadhania
  1 sibling, 0 replies; 126+ messages in thread
From: Nikunj A Dadhania @ 2012-10-11 10:39 UTC (permalink / raw)
  To: habanero, Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, KVM, Jiannan Ouyang,
	chegu vinod, LKML, Srivatsa Vaddagiri, Gleb Natapov,
	Andrew Jones

On Wed, 10 Oct 2012 09:24:55 -0500, Andrew Theurer <habanero@linux.vnet.ibm.com> wrote:
> 
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches.  While I am not recommending gang scheduling, I
> think it's a good data point.  The performance is 3.88x the PLE result.
> 
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

That looks pretty good and serves the purpose. And the result says it all.

> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time.  It
> represents the best possible outcome.
> 
> 
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
> 
> -Andrew
> 

Regards
Nikunj


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 19:27                           ` Andrew Theurer
@ 2012-10-11 17:13                             ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-11 17:13 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/11/2012 12:57 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
>> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
>>> I ran 'perf sched map' on the dbench workload for medium and large VMs,
>>> and I thought I would share some of the results.  I think it helps to
>>> visualize what's going on regarding the yielding.
>>>
>>> These files are png bitmaps, generated from processing output from 'perf
>>> sched map' (and perf data generated from 'perf sched record').  The Y
>>> axis is the host cpus, each row being 10 pixels high.  For these tests,
>>> there are 80 host cpus, so the total height is 800 pixels.  The X axis
>>> is time (in microseconds), with each pixel representing 1 microsecond.
>>> Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
>>> obviously, and zooming in/out while viewing is recommended.
>>>
>>> Each row (each host cpu) is assigned a color based on what thread is
>>> running.  vCPUs of the same VM are assigned a common color (like red,
>>> blue, magenta, etc), and each vCPU has a unique brightness for that
>>> color.  There are a maximum of 12 assignable colors, so in any VMs >12
>>> revert to vCPU color of gray. I would use more colors, but it becomes
>>> harder to distinguish one color from another.  The white color
>>> represents missing data from perf, and black color represents any thread
>>> which is not a vCPU.
>>>
>>> For the following tests, VMs were pinned to host NUMA nodes and to
>>> specific cpus to help with consistency and operate within the
>>> constraints of the last test (gang scheduler).
>>>
>>> Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
>>> described above only 12 of the VMs have a color, rest are gray).
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> This looks very nice to visualize what is happening. Beginning of the
>> graph looks little messy but later it is clear.
>>
>>>
>>> If you zoom out and look at the whole bitmap, you may notice the 4ms
>>> intervals of the scheduler.  They are pretty well aligned across all
>>> cpus.  Normally, for cpu bound workloads, we would expect to see each
>>> thread to run for 4 ms, then something else getting to run, and so on.
>>> That is mostly true in this test.  We have 2x over-commit and we
>>> generally see the switching of threads at 4ms.  One thing to note is
>>> that not all vCPU threads for the same VM run at exactly the same time,
>>> and that is expected and the whole reason for lock-holder preemption.
>>> Now, if you zoom in on the bitmap, you should notice within the 4ms
>>> intervals there is some task switching going on.  This is most likely
>>> because of the yield_to initiated by the PLE handler.  In this case
>>> there is not that much yielding to do.   It's quite clean, and the
>>> performance is quite good.
>>>
>>> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
>>> CPU over-commit is still 2x.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> I think this link still 10x16. Could you paste the link again?
>
> Oops
> https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ
>
>>
>>>
>>> This one looks quite different.  In short, it's a mess.  The switching
>>> between tasks can be lower than 10 microseconds.  It basically never
>>> recovers.  There is constant yielding all the time.
>>>
>>> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
>>> scheduling patches.  While I am not recommending gang scheduling, I
>>> think it's a good data point.  The performance is 3.88x the PLE result.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Yes.. we see lot of yields.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-10 19:36                           ` Andrew Theurer
@ 2012-10-15 12:10                             ` Raghavendra K T
  2012-10-15 14:34                               ` Andrew Theurer
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-15 12:10 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>>>> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
>>>>
>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>>>
[...]
>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
>>> has just terrible scalability to begin with.  I do not think we should
>>> try to optimize such a bad workload.
>>>
>>
>> I think my way of running dbench has some flaw, so I went to ebizzy.
>> Could you let me know how you generally run dbench?
>
> I mount a tmpfs and then specify that mount for dbench to run on.  This
> eliminates all IO.  I use a 300 second run time and number of threads is
> equal to number of vcpus.  All of the VMs of course need to have a
> synchronized start.
>
> I would also make sure you are using a recent kernel for dbench, where
> the dcache scalability is much improved.  Without any lock-holder
> preemption, the time in spin_lock should be very low:
>
>
>>      21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
>>       3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
>>       2.81%      10176         dbench  dbench              [.] child_run
>>       2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
>>       2.33%       8423         dbench  dbench              [.] next_token
>>       2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
>>       1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
>>       1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
>>       1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
>>       1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
>>       1.38%       5009         dbench  libc-2.12.so        [.] memmove
>>       1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
>>       1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
>

Hi Andrew,
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.

So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.

(PS: Thanks for pointing towards, perf in latest kernel. It works fine.)

Results:
dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
base = 3.6.0-rc5 with ple handler optimization patch.

x => base + ple_window = 4k
+ => base + ple_window = 16k
* => base + ple_gap = 0

dbench 1x overcommit case
=========================
     N           Min           Max        Median           Avg        Stddev
x   8        5322.5       5519.05       5482.71     5461.0962     63.522276
+   8       5255.45       5530.55       5496.94     5455.2137     93.070363
*   8       5350.85       5477.81      5408.065     5418.4338     44.762697


dbench 2x overcommit case
==========================

     N           Min           Max        Median           Avg        Stddev
x   8       3054.32       3194.47       3137.33      3132.625     54.491615
+   8        3040.8       3148.87      3088.615     3088.1887     32.862336
*   8       3031.51       3171.99        3083.6     3097.4612     50.526977


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-15 12:10                             ` Raghavendra K T
@ 2012-10-15 14:34                               ` Andrew Theurer
  2012-10-19  8:30                                 ` Raghavendra K T
  0 siblings, 1 reply; 126+ messages in thread
From: Andrew Theurer @ 2012-10-15 14:34 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
> >>>>
> >>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>
> [...]
> >>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>> has just terrible scalability to begin with.  I do not think we should
> >>> try to optimize such a bad workload.
> >>>
> >>
> >> I think my way of running dbench has some flaw, so I went to ebizzy.
> >> Could you let me know how you generally run dbench?
> >
> > I mount a tmpfs and then specify that mount for dbench to run on.  This
> > eliminates all IO.  I use a 300 second run time and number of threads is
> > equal to number of vcpus.  All of the VMs of course need to have a
> > synchronized start.
> >
> > I would also make sure you are using a recent kernel for dbench, where
> > the dcache scalability is much improved.  Without any lock-holder
> > preemption, the time in spin_lock should be very low:
> >
> >
> >>      21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
> >>       3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
> >>       2.81%      10176         dbench  dbench              [.] child_run
> >>       2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
> >>       2.33%       8423         dbench  dbench              [.] next_token
> >>       2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
> >>       1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
> >>       1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
> >>       1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
> >>       1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
> >>       1.38%       5009         dbench  libc-2.12.so        [.] memmove
> >>       1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
> >>       1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
> >
> 
> Hi Andrew,
> I ran the test with dbench with tmpfs. I do not see any improvements in
> dbench for 16k ple window.
> 
> So it seems apart from ebizzy no workload benefited by that. and I
> agree that, it may not be good to optimize for ebizzy.
> I shall drop changing to 16k default window and continue with other
> original patch series. Need to experiment with latest kernel.

Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade >1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go....   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself.  If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE			  426 +/- 11.03%
no PLE w/ gangsched	32001 +/- .37%
PLE with yield()	29207 +/- .28%
PLE with yield_to()	 8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result.  Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang scheduling without needing to
actually implement gang scheduling.

I did test a smaller VM:

dbench with 10-way VMs, 16 of them on 80-way host:

no PLE			 6248 +/- 7.69%	  
no PLE w/ gangsched	28379 +/- .07%
PLE with yield()	29196 +/- 1.62%
PLE with yield_to()	32217 +/- 1.76%

There is some degrade from yield() to yield_to() here, but nearly as
large as the uplift we see on the larger VMs.  Regardless, I have an
idea to fix that: Instead of using yield() all the time, we could use
yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
All other exits use yield().  That rate of yield_to() should be more
than enough for the smaller VMs, and the result should be hopefully just
the same as the current code.  I have not coded this up yet, but it's my
next step.

I am also hopeful the limitation of yield_to() will also make the 1x
issue just go away as well (even with 4096 ple_window).  The vast
majority of exits will result in yield() which should be harmless.

Keep in mind this did require ensuring sibling vcpus do not share host
runqueues -I do think that can be possible given some optional scheduler
tweaks.

> 
> (PS: Thanks for pointing towards, perf in latest kernel. It works fine.)
> 
> Results:
> dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
> base = 3.6.0-rc5 with ple handler optimization patch.
> 
> x => base + ple_window = 4k
> + => base + ple_window = 16k
> * => base + ple_gap = 0
> 
> dbench 1x overcommit case
> =========================
>      N           Min           Max        Median           Avg        Stddev
> x   8        5322.5       5519.05       5482.71     5461.0962     63.522276
> +   8       5255.45       5530.55       5496.94     5455.2137     93.070363
> *   8       5350.85       5477.81      5408.065     5418.4338     44.762697
> 
> 
> dbench 2x overcommit case
> ==========================
> 
>      N           Min           Max        Median           Avg        Stddev
> x   8       3054.32       3194.47       3137.33      3132.625     54.491615
> +   8        3040.8       3148.87      3088.615     3088.1887     32.862336
> *   8       3031.51       3171.99        3083.6     3097.4612     50.526977
> 

-Andrew

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-09 18:51                     ` Raghavendra K T
  2012-10-10  2:59                       ` Andrew Theurer
  2012-10-10 14:24                       ` Andrew Theurer
@ 2012-10-18 12:39                       ` Avi Kivity
  2012-10-19  8:19                         ` Raghavendra K T
  2 siblings, 1 reply; 126+ messages in thread
From: Avi Kivity @ 2012-10-18 12:39 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew M. Theurer, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/09/2012 08:51 PM, Raghavendra K T wrote:
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> 
> Let me know if you think we can increase the default ple_window
> itself to 16k.
>

I think so, there is no point running with untuned defaults.

> 
> I can respin the whole series including this default ple_window change.

It can come as a separate patch.

> 
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
> 
> Improvements
> ================
> 
> 16 core PLE machine with 16 vcpu guest
> 
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
> 
> % improvements w.r.t base (ple_window = 4k)
> ---------------+---------------+-----------------+-------------------+
>                |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---------------+---------------+-----------------+-------------------+
> kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
> kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
> ---------------+---------------+-----------------+-------------------+
> sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
> sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
> sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
> ---------------+---------------+-----------------+-------------------+
> ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
> ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
> ---------------+---------------+-----------------+-------------------+

So it seems we want dynamic PLE windows.  As soon as we enter overcommit
we need to decrease the window.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-18 12:39                       ` Avi Kivity
@ 2012-10-19  8:19                         ` Raghavendra K T
  0 siblings, 0 replies; 126+ messages in thread
From: Raghavendra K T @ 2012-10-19  8:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrew M. Theurer, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/18/2012 06:09 PM, Avi Kivity wrote:
> On 10/09/2012 08:51 PM, Raghavendra K T wrote:
>> Here is the summary:
>> We do get good benefit by increasing ple window. Though we don't
>> see good benefit for kernbench and sysbench, for ebizzy, we get huge
>> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>>
>> Let me know if you think we can increase the default ple_window
>> itself to 16k.
>>
>
> I think so, there is no point running with untuned defaults.
>

Oaky.

>>
>> I can respin the whole series including this default ple_window change.
>
> It can come as a separate patch.

Yes. Will spin it separately.

>
>>
>> I also have the perf kvm top result for both ebizzy and kernbench.
>> I think they are in expected lines now.
>>
>> Improvements
>> ================
>>
>> 16 core PLE machine with 16 vcpu guest
>>
>> base = 3.6.0-rc5 + ple handler optimization patches
>> base_pleopt_16k = base + ple_window = 16k
>> base_pleopt_32k = base + ple_window = 32k
>> base_pleopt_nople = base + ple_gap = 0
>> kernbench, hackbench, sysbench (time in sec lower is better)
>> ebizzy (rec/sec higher is better)
>>
>> % improvements w.r.t base (ple_window = 4k)
>> ---------------+---------------+-----------------+-------------------+
>>                 |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
>> ---------------+---------------+-----------------+-------------------+
>> kernbench_1x   |  0.42371      |  1.15164        |   0.09320         |
>> kernbench_2x   | -1.40981      | -17.48282       |  -570.77053       |
>> ---------------+---------------+-----------------+-------------------+
>> sysbench_1x    | -0.92367      | 0.24241         | -0.27027          |
>> sysbench_2x    | -2.22706      |-0.30896         | -1.27573          |
>> sysbench_3x    | -0.75509      | 0.09444         | -2.97756          |
>> ---------------+---------------+-----------------+-------------------+
>> ebizzy_1x      | 54.99976      | 67.29460        |  74.14076         |
>> ebizzy_2x      | -8.83386      |-27.38403        | -96.22066         |
>> ---------------+---------------+-----------------+-------------------+
>
> So it seems we want dynamic PLE windows.  As soon as we enter overcommit
> we need to decrease the window.
>

Okay.
I have some rough idea on the implementation. I 'll try that after this
V2 experiments are over.
So in brief, I have this in my queue priority wise

1) V2 version of this patch series( in progress)
2) default PLE window
3) preemption notifiers
4) Pv spinlock


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-15 14:34                               ` Andrew Theurer
@ 2012-10-19  8:30                                 ` Raghavendra K T
  2012-10-19 13:31                                   ` Andrew Theurer
  0 siblings, 1 reply; 126+ messages in thread
From: Raghavendra K T @ 2012-10-19  8:30 UTC (permalink / raw)
  To: habanero
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
>> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
>>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
>>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
>>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>>>>>> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
>>>>>>
>>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>>>>>
>> [...]
>>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
>>>>> has just terrible scalability to begin with.  I do not think we should
>>>>> try to optimize such a bad workload.
>>>>>
>>>>
>>>> I think my way of running dbench has some flaw, so I went to ebizzy.
>>>> Could you let me know how you generally run dbench?
>>>
>>> I mount a tmpfs and then specify that mount for dbench to run on.  This
>>> eliminates all IO.  I use a 300 second run time and number of threads is
>>> equal to number of vcpus.  All of the VMs of course need to have a
>>> synchronized start.
>>>
>>> I would also make sure you are using a recent kernel for dbench, where
>>> the dcache scalability is much improved.  Without any lock-holder
>>> preemption, the time in spin_lock should be very low:
>>>
>>>
>>>>       21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
>>>>        3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
>>>>        2.81%      10176         dbench  dbench              [.] child_run
>>>>        2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
>>>>        2.33%       8423         dbench  dbench              [.] next_token
>>>>        2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
>>>>        1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
>>>>        1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
>>>>        1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
>>>>        1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
>>>>        1.38%       5009         dbench  libc-2.12.so        [.] memmove
>>>>        1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
>>>>        1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
>>>
>>
>> Hi Andrew,
>> I ran the test with dbench with tmpfs. I do not see any improvements in
>> dbench for 16k ple window.
>>
>> So it seems apart from ebizzy no workload benefited by that. and I
>> agree that, it may not be good to optimize for ebizzy.
>> I shall drop changing to 16k default window and continue with other
>> original patch series. Need to experiment with latest kernel.
>
> Thanks for running this again.  I do believe there are some workloads,
> when run at 1x overcommit, would benefit from a larger ple_window [with
> he current ple handling code], but I do not also want to potentially
> degrade >1x with a larger window.  I do, however, think there may be a
> another option.  I have not fully worked this out, but I think I am on
> to something.
>
> I decided to revert back to just a yield() instead of a yield_to().  My
> motivation was that yield_to() [for large VMs] is like a dog chasing its
> tail, round and round we go....   Just yield(), in particular a yield()
> which results in yielding to something -other- than the current VM's
> vcpus, helps synchronize the execution of sibling vcpus by deferring
> them until the lock holder vcpu is running again.  The more we can do to
> get all vcpus running at the same time, the far less we deal with the
> preemption problem.  The other benefit is that yield() is far, far lower
> overhead than yield_to()
>
> This does assume that vcpus from same VM do not share same runqueues.
> Yielding to a sibling vcpu with yield() is not productive for larger VMs
> in the same way that yield_to() is not.  My recent results include
> restricting vcpu placement so that sibling vcpus do not get to run on
> the same runqueue.  I do believe we could implement a initial placement
> and load balance policy to strive for this restriction (making it purely
> optional, but I bet could also help user apps which use spin locks).
>
> For 1x VMs which still vm_exit due to PLE, I believe we could probably
> just leave the ple_window alone, as long as we mostly use yield()
> instead of yield_to().  The problem with the unneeded exits in this case
> has been the overhead in routines leading up to yield_to() and the
> yield_to() itself.  If we use yield() most of the time, this overhead
> will go away.
>
> Here is a comparison of yield_to() and yield():
>
> dbench with 20-way VMs, 8 of them on 80-way host:
>
> no PLE			  426 +/- 11.03%
> no PLE w/ gangsched	32001 +/- .37%
> PLE with yield()	29207 +/- .28%
> PLE with yield_to()	 8175 +/- 1.37%
>
> Yield() is far and way better than yield_to() here and almost approaches
> gang sched result.  Here is a link for the perf sched map bitmap:
>
> https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
>
> The thrashing is way down and sibling vcpus tend to run together,
> approximating the behavior of the gang scheduling without needing to
> actually implement gang scheduling.
>
> I did test a smaller VM:
>
> dbench with 10-way VMs, 16 of them on 80-way host:
>
> no PLE			 6248 +/- 7.69%	
> no PLE w/ gangsched	28379 +/- .07%
> PLE with yield()	29196 +/- 1.62%
> PLE with yield_to()	32217 +/- 1.76%

Hi Andrew, Results are encouraging.

>
> There is some degrade from yield() to yield_to() here, but nearly as
> large as the uplift we see on the larger VMs.  Regardless, I have an
> idea to fix that: Instead of using yield() all the time, we could use
> yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> All other exits use yield().  That rate of yield_to() should be more
> than enough for the smaller VMs, and the result should be hopefully just
> the same as the current code.  I have not coded this up yet, but it's my
> next step.

I personally feel rate limiting yield_to may be a good idea.

>
> I am also hopeful the limitation of yield_to() will also make the 1x
> issue just go away as well (even with 4096 ple_window).  The vast
> majority of exits will result in yield() which should be harmless.
>
> Keep in mind this did require ensuring sibling vcpus do not share host
> runqueues -I do think that can be possible given some optional scheduler
> tweaks.

I think this is a concern (placing). Having rate limit alone may
suffice.May be tuning that taking into overcommitted/non-overcommitted
scenario also into account would be better.

Okay below is my V2 implementation I am experimenting

1) check source -and- target runq to decide on exiting the ple handler
2)

vcpu_on_spin()
{

  .....
  if yield_to_same_vm did not succeed and we are overcommitted
     yield()

}

I think combining your thoughts and (2) complicates scenario a bit.
anyways let me see how my experiment goes. I will also check how yield
performs without any pinning.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
  2012-10-19  8:30                                 ` Raghavendra K T
@ 2012-10-19 13:31                                   ` Andrew Theurer
  0 siblings, 0 replies; 126+ messages in thread
From: Andrew Theurer @ 2012-10-19 13:31 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, H. Peter Anvin,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, chegu vinod, LKML, Srivatsa Vaddagiri,
	Gleb Natapov, Andrew Jones

On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
> On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> >> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>>>> * Avi Kivity <avi@redhat.com> [2012-10-04 17:00:28]:
> >>>>>>
> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>>>
> >> [...]
> >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>>>> has just terrible scalability to begin with.  I do not think we should
> >>>>> try to optimize such a bad workload.
> >>>>>
> >>>>
> >>>> I think my way of running dbench has some flaw, so I went to ebizzy.
> >>>> Could you let me know how you generally run dbench?
> >>>
> >>> I mount a tmpfs and then specify that mount for dbench to run on.  This
> >>> eliminates all IO.  I use a 300 second run time and number of threads is
> >>> equal to number of vcpus.  All of the VMs of course need to have a
> >>> synchronized start.
> >>>
> >>> I would also make sure you are using a recent kernel for dbench, where
> >>> the dcache scalability is much improved.  Without any lock-holder
> >>> preemption, the time in spin_lock should be very low:
> >>>
> >>>
> >>>>       21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
> >>>>        3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
> >>>>        2.81%      10176         dbench  dbench              [.] child_run
> >>>>        2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
> >>>>        2.33%       8423         dbench  dbench              [.] next_token
> >>>>        2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
> >>>>        1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
> >>>>        1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
> >>>>        1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
> >>>>        1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
> >>>>        1.38%       5009         dbench  libc-2.12.so        [.] memmove
> >>>>        1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
> >>>>        1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
> >>>
> >>
> >> Hi Andrew,
> >> I ran the test with dbench with tmpfs. I do not see any improvements in
> >> dbench for 16k ple window.
> >>
> >> So it seems apart from ebizzy no workload benefited by that. and I
> >> agree that, it may not be good to optimize for ebizzy.
> >> I shall drop changing to 16k default window and continue with other
> >> original patch series. Need to experiment with latest kernel.
> >
> > Thanks for running this again.  I do believe there are some workloads,
> > when run at 1x overcommit, would benefit from a larger ple_window [with
> > he current ple handling code], but I do not also want to potentially
> > degrade >1x with a larger window.  I do, however, think there may be a
> > another option.  I have not fully worked this out, but I think I am on
> > to something.
> >
> > I decided to revert back to just a yield() instead of a yield_to().  My
> > motivation was that yield_to() [for large VMs] is like a dog chasing its
> > tail, round and round we go....   Just yield(), in particular a yield()
> > which results in yielding to something -other- than the current VM's
> > vcpus, helps synchronize the execution of sibling vcpus by deferring
> > them until the lock holder vcpu is running again.  The more we can do to
> > get all vcpus running at the same time, the far less we deal with the
> > preemption problem.  The other benefit is that yield() is far, far lower
> > overhead than yield_to()
> >
> > This does assume that vcpus from same VM do not share same runqueues.
> > Yielding to a sibling vcpu with yield() is not productive for larger VMs
> > in the same way that yield_to() is not.  My recent results include
> > restricting vcpu placement so that sibling vcpus do not get to run on
> > the same runqueue.  I do believe we could implement a initial placement
> > and load balance policy to strive for this restriction (making it purely
> > optional, but I bet could also help user apps which use spin locks).
> >
> > For 1x VMs which still vm_exit due to PLE, I believe we could probably
> > just leave the ple_window alone, as long as we mostly use yield()
> > instead of yield_to().  The problem with the unneeded exits in this case
> > has been the overhead in routines leading up to yield_to() and the
> > yield_to() itself.  If we use yield() most of the time, this overhead
> > will go away.
> >
> > Here is a comparison of yield_to() and yield():
> >
> > dbench with 20-way VMs, 8 of them on 80-way host:
> >
> > no PLE			  426 +/- 11.03%
> > no PLE w/ gangsched	32001 +/- .37%
> > PLE with yield()	29207 +/- .28%
> > PLE with yield_to()	 8175 +/- 1.37%
> >
> > Yield() is far and way better than yield_to() here and almost approaches
> > gang sched result.  Here is a link for the perf sched map bitmap:
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
> >
> > The thrashing is way down and sibling vcpus tend to run together,
> > approximating the behavior of the gang scheduling without needing to
> > actually implement gang scheduling.
> >
> > I did test a smaller VM:
> >
> > dbench with 10-way VMs, 16 of them on 80-way host:
> >
> > no PLE			 6248 +/- 7.69%	
> > no PLE w/ gangsched	28379 +/- .07%
> > PLE with yield()	29196 +/- 1.62%
> > PLE with yield_to()	32217 +/- 1.76%
> 
> Hi Andrew, Results are encouraging.
> 
> >
> > There is some degrade from yield() to yield_to() here, but nearly as
> > large as the uplift we see on the larger VMs.  Regardless, I have an
> > idea to fix that: Instead of using yield() all the time, we could use
> > yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> > All other exits use yield().  That rate of yield_to() should be more
> > than enough for the smaller VMs, and the result should be hopefully just
> > the same as the current code.  I have not coded this up yet, but it's my
> > next step.
> 
> I personally feel rate limiting yield_to may be a good idea.
> 
> >
> > I am also hopeful the limitation of yield_to() will also make the 1x
> > issue just go away as well (even with 4096 ple_window).  The vast
> > majority of exits will result in yield() which should be harmless.
> >
> > Keep in mind this did require ensuring sibling vcpus do not share host
> > runqueues -I do think that can be possible given some optional scheduler
> > tweaks.
> 
> I think this is a concern (placing). Having rate limit alone may
> suffice.May be tuning that taking into overcommitted/non-overcommitted
> scenario also into account would be better.
> 
> Okay below is my V2 implementation I am experimenting
> 
> 1) check source -and- target runq to decide on exiting the ple handler
> 2)
> 
> vcpu_on_spin()
> {
> 
>   .....
>   if yield_to_same_vm did not succeed and we are overcommitted
>      yield()
> 
> }
> 
> I think combining your thoughts and (2) complicates scenario a bit.
> anyways let me see how my experiment goes. I will also check how yield
> performs without any pinning.

FWIW, below is the latest with throttling yield_to().  Results were
slightly higher than the above with just yield().  Although I can see an
improvement when not forcing non-shared runqueues among same-VM vcpus
(via binding), it's not as effective.  I am more concerned this problem
requires a multi-part solution, and reducing lock-holder preemption is
the other part (by not allowing sequential execution of same-VM vcpus by
virtue of sharing runqueues).

signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..595ef3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,6 +153,7 @@ struct kvm_vcpu {
 	int mode;
 	unsigned long requests;
 	unsigned long guest_debug;
+	unsigned long last_yield_to;
 
 	struct mutex mutex;
 	struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..1f0ec36 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/jiffies.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -228,6 +229,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->pid = NULL;
 	init_waitqueue_head(&vcpu->wq);
 	kvm_async_pf_vcpu_init(vcpu);
+	vcpu->last_yield_to = 0;
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -1590,27 +1592,39 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int i;
 
 	/*
+	 * A yield_to() can be quite expensive, so we try to limit its use.
+	 * to one per jiffie.  Subsequent exits just yield the current vcpu
+	 * in hopes of having it run again when the lock holding vcpu
+	 * gets to run again.  This is most effective when vcpus from 
+	 * the same VM do not share a runqueue
+	 */
+	if (me->last_yield_to == jiffies) {
+		yield();
+	} else {
+	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
 	 * else and called schedule in __vcpu_run.  Hopefully that
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i <= last_boosted_vcpu) {
-				i = last_boosted_vcpu;
-				continue;
-			} else if (pass && i > last_boosted_vcpu)
-				break;
-			if (vcpu == me)
-				continue;
-			if (waitqueue_active(&vcpu->wq))
-				continue;
-			if (kvm_vcpu_yield_to(vcpu)) {
-				kvm->last_boosted_vcpu = i;
-				yielded = 1;
-				break;
+		for (pass = 0; pass < 2 && !yielded; pass++) {
+			kvm_for_each_vcpu(i, vcpu, kvm) {
+				if (!pass && i <= last_boosted_vcpu) {
+					i = last_boosted_vcpu;
+					continue;
+				} else if (pass && i > last_boosted_vcpu)
+					break;
+				if (vcpu == me)
+					continue;
+				if (waitqueue_active(&vcpu->wq))
+					continue;
+				if (kvm_vcpu_yield_to(vcpu)) {
+					kvm->last_boosted_vcpu = i;
+					me->last_yield_to = jiffies;
+					yielded = 1;
+					break;
+				}
 			}
 		}
 	}



^ permalink raw reply related	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2012-10-19 13:31 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-21 11:59 [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler Raghavendra K T
2012-09-21 12:00 ` [PATCH RFC 1/2] kvm: Handle undercommitted guest case " Raghavendra K T
2012-09-21 13:02   ` Rik van Riel
2012-09-21 17:24     ` Raghavendra K T
2012-09-24 15:41       ` Avi Kivity
2012-09-24 16:06         ` Avi Kivity
2012-09-24 16:14           ` Peter Zijlstra
2012-09-24 16:25             ` Avi Kivity
2012-09-25  8:09           ` Raghavendra K T
2012-09-25  8:54             ` Avi Kivity
2012-09-25 13:49               ` Raghavendra K T
2012-09-27  7:44               ` Gleb Natapov
2012-09-27  8:59                 ` Avi Kivity
2012-09-27  9:11                   ` Gleb Natapov
2012-09-27  9:33                     ` Avi Kivity
2012-09-27  9:58                       ` Gleb Natapov
2012-09-27 10:04                         ` Avi Kivity
2012-09-27 10:08                           ` Gleb Natapov
2012-09-27 10:15                             ` Avi Kivity
     [not found]               ` <CAJocwcf+8u84_yDC-PK0Yni93YSTWzYvr69nq6b3pNv1MwVJzQ@mail.gmail.com>
2012-09-27  8:50                 ` Avi Kivity
2012-09-27 11:26                   ` Raghavendra K T
2012-09-27 12:06                     ` Avi Kivity
2012-09-28 18:18                       ` Konrad Rzeszutek Wilk
2012-09-30  8:16                         ` Avi Kivity
     [not found]                   ` <CAJocwcc19F+PtsQ5okGMvYeVnkEigpZRpwWY9JgeRPFqfcVoXA@mail.gmail.com>
2012-09-28  6:16                     ` Raghavendra K T
2012-09-30  8:18                       ` Avi Kivity
2012-09-30 11:07                         ` Gleb Natapov
2012-09-30 11:13                           ` Avi Kivity
2012-10-03 14:17                             ` Raghavendra K T
2012-10-03 14:56                               ` Avi Kivity
2012-10-04  7:29                                 ` Gleb Natapov
2012-10-05  8:36                                   ` Raghavendra K T
2012-10-07  9:51                                     ` Avi Kivity
2012-09-25  7:36         ` Raghavendra K T
2012-09-25  8:12           ` Avi Kivity
2012-09-25 14:21             ` Takuya Yoshikawa
2012-09-27  8:43               ` Avi Kivity
2012-10-03 12:22         ` Raghavendra K T
2012-10-03 17:05           ` Avi Kivity
2012-10-04 10:49             ` Raghavendra K T
2012-10-04 12:41               ` Avi Kivity
2012-10-04 13:07                 ` Peter Zijlstra
2012-10-04 15:00                   ` Avi Kivity
2012-10-09 18:51                     ` Raghavendra K T
2012-10-10  2:59                       ` Andrew Theurer
2012-10-10 17:54                         ` Raghavendra K T
2012-10-10 18:03                           ` David Ahern
2012-10-10 18:14                             ` Raghavendra K T
2012-10-10 19:36                           ` Andrew Theurer
2012-10-15 12:10                             ` Raghavendra K T
2012-10-15 14:34                               ` Andrew Theurer
2012-10-19  8:30                                 ` Raghavendra K T
2012-10-19 13:31                                   ` Andrew Theurer
2012-10-10 14:24                       ` Andrew Theurer
2012-10-10 17:43                         ` Raghavendra K T
2012-10-10 19:27                           ` Andrew Theurer
2012-10-11 17:13                             ` Raghavendra K T
2012-10-11 10:39                         ` Nikunj A Dadhania
2012-10-18 12:39                       ` Avi Kivity
2012-10-19  8:19                         ` Raghavendra K T
2012-10-04 14:41                 ` Andrew Theurer
2012-10-05  9:06                   ` Raghavendra K T
2012-10-05  9:02                 ` Raghavendra K T
2012-09-24 11:33   ` Peter Zijlstra
2012-09-24 11:40     ` Raghavendra K T
2012-09-21 12:00 ` [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario " Raghavendra K T
2012-09-21 13:22   ` Rik van Riel
2012-09-21 13:46   ` Takuya Yoshikawa
2012-09-21 13:52     ` Rik van Riel
2012-09-21 17:45       ` Raghavendra K T
2012-09-24 13:43         ` Takuya Yoshikawa
2012-09-24 15:26   ` Avi Kivity
2012-09-24 15:34     ` Peter Zijlstra
2012-09-24 15:43       ` Avi Kivity
2012-09-24 15:52         ` Peter Zijlstra
2012-09-24 15:58           ` Avi Kivity
2012-09-24 16:05             ` Peter Zijlstra
2012-09-24 16:10               ` Avi Kivity
2012-09-24 16:13                 ` Peter Zijlstra
2012-09-24 16:21                   ` Avi Kivity
2012-09-25 10:11                     ` Avi Kivity
2012-09-21 13:18 ` [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios " Chegu Vinod
2012-09-21 17:36   ` Raghavendra K T
2012-09-24  8:42     ` Dor Laor
2012-09-24 12:02       ` Raghavendra K T
2012-09-25 15:00         ` Dor Laor
2012-09-26 12:27           ` Konrad Rzeszutek Wilk
2012-09-27 10:07             ` Raghavendra K T
2012-09-27  9:49           ` Raghavendra K T
2012-09-27 10:28             ` Andrew Jones
2012-09-27 10:44               ` Avi Kivity
2012-09-27 11:31               ` Raghavendra K T
2012-09-27 10:33             ` Dor Laor
2012-09-24 11:34 ` Peter Zijlstra
2012-09-24 11:52   ` Raghavendra K T
2012-09-24 12:36     ` Peter Zijlstra
2012-09-24 13:29       ` Raghavendra K T
2012-09-24 13:54         ` Peter Zijlstra
2012-09-24 14:16           ` Raghavendra K T
2012-09-25 13:40             ` Raghavendra K T
2012-09-27  8:36               ` Avi Kivity
2012-09-27 11:23                 ` Raghavendra K T
2012-09-27 12:03                   ` Avi Kivity
2012-09-27 12:25                     ` Andrew Theurer
2012-09-28  5:38                     ` Raghavendra K T
2012-09-28  5:45                       ` H. Peter Anvin
2012-09-28  6:03                         ` Raghavendra K T
2012-09-28  8:38                       ` Peter Zijlstra
2012-09-28 11:40                       ` Andrew Theurer
2012-09-28 14:11                         ` Raghavendra K T
2012-09-28 14:13                         ` Peter Zijlstra
2012-09-30  8:24                         ` Avi Kivity
2012-10-03 14:29                     ` Raghavendra K T
2012-10-03 17:25                       ` Avi Kivity
2012-10-04 10:56                         ` Raghavendra K T
2012-10-04 12:44                           ` Avi Kivity
2012-10-05  9:04                             ` Raghavendra K T
2012-09-24 15:51           ` Avi Kivity
2012-09-24 16:03             ` Peter Zijlstra
2012-09-24 16:20               ` Avi Kivity
2012-09-26 13:20                 ` Andrew Jones
2012-09-26 13:26                   ` Peter Zijlstra
2012-09-26 13:39                     ` Andrew Jones
2012-09-26 13:45                       ` Peter Zijlstra
2012-09-26 12:57       ` Andrew Jones
2012-09-27 10:21         ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).