All of lore.kernel.org
 help / color / mirror / Atom feed
* Regarding improving ple handler (vcpu_on_spin)
@ 2012-06-19 20:20 Raghavendra K T
  2012-06-19 20:51 ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Raghavendra K T @ 2012-06-19 20:20 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania,
	KVM, Raghavendra K T, Ingo Molnar, LKML


In ple handler code, last_boosted_vcpu (lbv) variable is
serving as reference point to start when we enter.

    lbv = kvm->lbv;
    for each vcpu i of kvm
       if i is eligible
          if yield_to(i) is success
             lbv = i 

currently this variable is per VM and it is set after we do
yield_to(target), unfortunately it may take little longer
than we expect to come back again (depending on its lag in rb tree)
on successful yield and set the value. 

So when several ple_handle entry happens before it is set,
all of them start from same place. (and overall RR is also slower).

Also statistical analysis (below) is showing lbv is not very well
distributed with current approach.

naturally, first approach is to move lbv before yield_to, without
bothering  failure case to make RR fast. (was in Rik's V4
vcpu_on_spin patch series).

But when I did performance analysis, in no-overcommit scenario,
I saw violent/cascaded directed yield happening, leading to
more wastage of cpu in spinning. (huge degradation in 1x and
improvement in 3x,  I assume this was the reason it was moved after 
yield_to in V5 of vcpu_on_spin series.)

Second approach, I tried was, 
(1) get rid of per kvm lbv variable
(2) everybody who enters handler start from a random vcpu as reference
point.

The above gave good distribution of starting point,(and performance
improvement in 32 vcpu guest I tested)  and also IMO, it scales well
for larger VM's.

Analysis
=============
Four 32 vcpu guest running with one of them running kernbench.

PLE handler yield stat is the statistics for successfully
yielded case (for 32 vcpus)

PLE handler start stat is the statistics for frequency of
each vcpu index as starting point (for 32 vcpus)

snapshot1
=============
PLE handler yield stat :
274391  33088  32554  46688  46653  48742  48055  37491
38839  31799  28974  30303  31466  45936  36208  51580
32754  53441  28956  30738  37940  37693  26183  40022
31725  41879  23443  35826  40985  30447  37352  35445  

PLE handler start stat :
433590  383318  204835  169981  193508  203954  175960  139373
153835  125245  118532  140092  135732  134903  119349  149467
109871  160404  117140  120554  144715  125099  108527  125051
111416  141385  94815  138387  154710  116270  123130  173795

snapshot2
============
PLE handler yield stat :
1957091  59383  67866  65474  100335  77683  80958  64073
53783  44620  80131  81058  66493  56677  74222  74974
42398  132762  48982  70230  78318  65198  54446  104793
59937  57974  73367  96436  79922  59476  58835  63547  

PLE handler start stat :
2555089  611546  461121  346769  435889  452398  407495  314403
354277  298006  364202  461158  344783  288263  342165  357270
270887  451660  300020  332120  378403  317848  307969  414282
351443  328501  352840  426094  375050  330016  347540  371819

So questions I have in mind is,

1. Do you think going for randomizing last_boosted_vcpu and get rid
of per VM variable is better? 

2. Can/Do we have a mechanism, from which we will be able to decide
not to yield to vcpu who is doing frequent PLE exit (possibly
because he is doing unnecessary busy-waits) OR doing yield_to better
candidate?

On a side note: With pv patches I have tried doing yield_to a kicked
VCPU, in vcpu_block path and is giving some performance improvement.

Please let me know if you have any comments/suggestions.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-19 20:20 Regarding improving ple handler (vcpu_on_spin) Raghavendra K T
@ 2012-06-19 20:51 ` Rik van Riel
  2012-06-20 20:12   ` Raghavendra K T
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Rik van Riel @ 2012-06-19 20:51 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, Ingo Molnar, LKML

On Wed, 20 Jun 2012 01:50:50 +0530
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> 
> In ple handler code, last_boosted_vcpu (lbv) variable is
> serving as reference point to start when we enter.

> Also statistical analysis (below) is showing lbv is not very well
> distributed with current approach.

You are the second person to spot this bug today (yes, today).

Due to time zones, the first person has not had a chance yet to
test the patch below, which might fix the issue...

Please let me know how it goes.

====8<====

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0.  With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing < to <= makes sure we properly handle that case.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 virt/kvm/kvm_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..1da542b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 */
 	for (pass = 0; pass < 2 && !yielded; pass++) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i < last_boosted_vcpu) {
+			if (!pass && i <= last_boosted_vcpu) {
 				i = last_boosted_vcpu;
 				continue;
 			} else if (pass && i > last_boosted_vcpu)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-19 20:51 ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Rik van Riel
@ 2012-06-20 20:12   ` Raghavendra K T
  2012-06-21  2:11     ` Rik van Riel
  2012-06-21 11:26     ` Raghavendra K T
  2012-06-21  6:43   ` Gleb Natapov
  2012-07-06 17:11   ` Marcelo Tosatti
  2 siblings, 2 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-06-20 20:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, Ingo Molnar, LKML

On 06/20/2012 02:21 AM, Rik van Riel wrote:
> On Wed, 20 Jun 2012 01:50:50 +0530
> Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>  wrote:
>
>>
>> In ple handler code, last_boosted_vcpu (lbv) variable is
>> serving as reference point to start when we enter.
>
>> Also statistical analysis (below) is showing lbv is not very well
>> distributed with current approach.
>
> You are the second person to spot this bug today (yes, today).

Oh! really interesting.

>
> Due to time zones, the first person has not had a chance yet to
> test the patch below, which might fix the issue...

May be his timezone also falls near to mine. I am also pretty late
now. :)

>
> Please let me know how it goes.

Yes, have got result today, too tired to summarize. got better
performance result too. will come back again tomorrow morning.
have to post, randomized start point patch also, which I discussed to
know the opinion.

>
> ====8<====
>
> If last_boosted_vcpu == 0, then we fall through all test cases and
> may end up with all VCPUs pouncing on vcpu 0.  With a large enough
> guest, this can result in enormous runqueue lock contention, which
> can prevent vcpu0 from running, leading to a livelock.
>
> Changing<  to<= makes sure we properly handle that case.

Analysis shows distribution is more flatten now than before.
Here are the snapshots:
snapshot1
PLE handler yield stat :
66447   132222  75510   65875   121298  92543  111267  79523
118134  105366  116441  114195  107493  66666  86779   87733
84415   105778  94210   73197   55626   93036  112959  92035
95742   78558   72190   101719  94667   108593 63832   81580

PLE handler start stat :
334301  687807  384077  344917  504917  343988  439810  371389
466908  415509  394304  484276  376510  292821  370478  363727
366989  423441  392949  309706  292115  437900  413763  346135
364181  323031  348405  399593  336714  373995  302301  347383


snapshot2
PLE handler yield stat :
320547  267528  264316  164213  249246  182014  246468  225386
277179  310659  349767  310281  238680  187645  225791  266290
216202  316974  231077  216586  151679  356863  266031  213047
306229  182629  229334  241204  275975  265086  282218  242207

PLE handler start stat :
1335370  1378184  1252001  925414   1196973  951298   1219835  1108788
1265427  1290362  1308553  1271066  1107575  980036   1077210  1278611
1110779  1365130  1151200  1049859  937159   1577830  1209099  993391
1173766  987307   1144775  1102960  1100082  1177134  1207862  1119551


>
> Signed-off-by: Rik van Riel<riel@redhat.com>
> ---
>   virt/kvm/kvm_main.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7e14068..1da542b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   	 */
>   	for (pass = 0; pass<  2&&  !yielded; pass++) {
>   		kvm_for_each_vcpu(i, vcpu, kvm) {
> -			if (!pass&&  i<  last_boosted_vcpu) {
> +			if (!pass&&  i<= last_boosted_vcpu) {

Hmmm true, great catch. it was partial towards zero earlier.

>   				i = last_boosted_vcpu;
>   				continue;
>   			} else if (pass&&  i>  last_boosted_vcpu)
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-20 20:12   ` Raghavendra K T
@ 2012-06-21  2:11     ` Rik van Riel
  2012-06-21 11:26     ` Raghavendra K T
  1 sibling, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2012-06-21  2:11 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Avi Kivity, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, Ingo Molnar, LKML

On 06/20/2012 04:12 PM, Raghavendra K T wrote:
> On 06/20/2012 02:21 AM, Rik van Riel wrote:

>> Please let me know how it goes.
>
> Yes, have got result today, too tired to summarize. got better
> performance result too. will come back again tomorrow morning.
> have to post, randomized start point patch also, which I discussed to
> know the opinion.

The other person's problem has also gone away with this
patch.

Avi, could I convince you to apply this obvious bugfix
to kvm.git? :)

>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0. With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing< to<= makes sure we properly handle that case.

>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>> ---
>> virt/kvm/kvm_main.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> */
>> for (pass = 0; pass< 2&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& i<= last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> } else if (pass&& i> last_boosted_vcpu)
>>
>>
>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-19 20:51 ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Rik van Riel
  2012-06-20 20:12   ` Raghavendra K T
@ 2012-06-21  6:43   ` Gleb Natapov
  2012-06-21 10:23     ` Raghavendra K T
  2012-06-28  2:14     ` Raghavendra K T
  2012-07-06 17:11   ` Marcelo Tosatti
  2 siblings, 2 replies; 21+ messages in thread
From: Gleb Natapov @ 2012-06-21  6:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Raghavendra K T, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML

On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
> On Wed, 20 Jun 2012 01:50:50 +0530
> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> > 
> > In ple handler code, last_boosted_vcpu (lbv) variable is
> > serving as reference point to start when we enter.
> 
> > Also statistical analysis (below) is showing lbv is not very well
> > distributed with current approach.
> 
> You are the second person to spot this bug today (yes, today).
> 
> Due to time zones, the first person has not had a chance yet to
> test the patch below, which might fix the issue...
> 
> Please let me know how it goes.
> 
> ====8<====
> 
> If last_boosted_vcpu == 0, then we fall through all test cases and
> may end up with all VCPUs pouncing on vcpu 0.  With a large enough
> guest, this can result in enormous runqueue lock contention, which
> can prevent vcpu0 from running, leading to a livelock.
> 
> Changing < to <= makes sure we properly handle that case.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  virt/kvm/kvm_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7e14068..1da542b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 */
>  	for (pass = 0; pass < 2 && !yielded; pass++) {
>  		kvm_for_each_vcpu(i, vcpu, kvm) {
> -			if (!pass && i < last_boosted_vcpu) {
> +			if (!pass && i <= last_boosted_vcpu) {
>  				i = last_boosted_vcpu;
>  				continue;
>  			} else if (pass && i > last_boosted_vcpu)
> 
Looks correct. We can simplify this by introducing something like:

#define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
        for (n = atomic_read(&kvm->online_vcpus); \
             n && (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
             n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))

--
			Gleb.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-21  6:43   ` Gleb Natapov
@ 2012-06-21 10:23     ` Raghavendra K T
  2012-06-28  2:14     ` Raghavendra K T
  1 sibling, 0 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-06-21 10:23 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Rik van Riel, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML

On 06/21/2012 12:13 PM, Gleb Natapov wrote:
> On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>  wrote:
>>
>>>
>>> In ple handler code, last_boosted_vcpu (lbv) variable is
>>> serving as reference point to start when we enter.
>>
>>> Also statistical analysis (below) is showing lbv is not very well
>>> distributed with current approach.
>>
>> You are the second person to spot this bug today (yes, today).
>>
>> Due to time zones, the first person has not had a chance yet to
>> test the patch below, which might fix the issue...
>>
>> Please let me know how it goes.
>>
>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0.  With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing<  to<= makes sure we properly handle that case.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   	 */
>>   	for (pass = 0; pass<  2&&  !yielded; pass++) {
>>   		kvm_for_each_vcpu(i, vcpu, kvm) {
>> -			if (!pass&&  i<  last_boosted_vcpu) {
>> +			if (!pass&&  i<= last_boosted_vcpu) {
>>   				i = last_boosted_vcpu;
>>   				continue;
>>   			} else if (pass&&  i>  last_boosted_vcpu)
>>
> Looks correct. We can simplify this by introducing something like:
>
> #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
>          for (n = atomic_read(&kvm->online_vcpus); \
>               n&&  (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>               n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))
>

Thumbs up for this simplification. This really helps in all the places
where we want to start iterating from middle.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-20 20:12   ` Raghavendra K T
  2012-06-21  2:11     ` Rik van Riel
@ 2012-06-21 11:26     ` Raghavendra K T
  2012-06-22 15:11       ` Andrew Jones
  1 sibling, 1 reply; 21+ messages in thread
From: Raghavendra K T @ 2012-06-21 11:26 UTC (permalink / raw)
  To: Rik van Riel, Avi Kivity
  Cc: Marcelo Tosatti, Srikar, Srivatsa Vaddagiri, Peter Zijlstra,
	Nikunj A. Dadhania, KVM, Ingo Molnar, LKML, Gleb Natapov,
	chegu_vinod

[-- Attachment #1: Type: text/plain, Size: 4582 bytes --]

On 06/21/2012 01:42 AM, Raghavendra K T wrote:
> On 06/20/2012 02:21 AM, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com> wrote:
>>
[...]
>> Please let me know how it goes.
>
> Yes, have got result today, too tired to summarize. got better
> performance result too. will come back again tomorrow morning.
> have to post, randomized start point patch also, which I discussed to
> know the opinion.
>

Here are the results from kernbench.

PS: I think we have to only take that, both the patches perform better,
than reading into actual numbers since I am seeing more variance in
especially 3x. may be I can test with some more stable benchmark if
somebody points

+----------+-------------+------------+------------+-----------+
|  base    |  Rik patch  | % improve  |Random patch|  %improve  |
+----------+-------------+------------+------------+-----------+
| 49.98    |   49.935    | 0.0901172  |  49.924286 |  0.111597 |
| 106.0051 |   89.25806  | 18.7625    |  88.122217 |  20.2933  |
| 189.82067|   175.58783 | 8.10582    |  166.99989 |  13.6651  |
+----------+-------------+------------+------------+-----------+

I also have posted result of randomizing starting point patch.

I agree that Rik's fix should ideally go into git ASAP. and when above
patches go into git, feel free to add,

Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

But I still see some questions unanswered.
1) why can't we move setting of last_boosted_vcpu up, it gives more
randomness ( As I said earlier, it gave degradation in 1x case because
of violent yields but performance benefit in 3x case. degradation
because  most of them yielding back to same spinning guy increasing
busy-wait but it gives huge benefit with ple_window set to higher
values such as 32k/64k. But that is a different issue altogethor)

2) Having the update of last_boosted_vcpu after yield_to does not seem
to be entirely correct. and having a common variable as starting point
may not be that good too. Also RR is little slower.

suppose we have 64 vcpu guest, and 4 vcpus enter ple_handler all of
them jumping on same guy to yield may not be good. Rather I personally
feel each of them starting at different point would be good idea.

But this alone will not help, we need more filtering of eligible VCPU.
for e.g. in first pass don't choose a VCPU that has recently done
PL exit. (Thanks Vatsa for brainstorming this). May be Peter/Avi
/Rik/Vatsa can give more idea in this area ( I mean, how can we identify
that a vcpu had done a PL exit/OR exited from spinlock context etc)

other idea  may be something like identifying next eligible lock-holder
(which is already possible with PV patches), and do yield-to him.

Here is the stat from randomizing starting point patch. We can see that
the patch has amazing fairness w.r.t starting point. IMO, this would be
great only after we add more eligibility criteria to target vcpus (of
yield_to).

Randomizing start index
===========================
snapshot1
PLE handler yield stat :
218416  176802  164554  141184  148495  154709  159871  145157
135476  158025  139997  247638  152498  133338  122774  248228
158469  121825  138542  113351  164988  120432  136391  129855
172764  214015  158710  133049  83485   112134  81651   190878

PLE handler start stat :
547772  547725  547545  547931  547836  548656  548272  547849
548879  549012  547285  548185  548700  547132  548310  547286
547236  547307  548328  548059  547842  549152  547870  548340
548170  546996  546678  547842  547716  548096  547918  547546

snapshot2
==============
PLE handler yield stat :
310690  222992  275829  156876  187354  185373  187584  155534
151578  205994  223731  320894  194995  167011  153415  286910
181290  143653  173988  181413  194505  170330  194455  181617
251108  226577  192070  143843  137878  166393  131405  250657

PLE handler start stat :
781335  782388  781837  782942  782025  781357  781950  781695
783183  783312  782004  782804  783766  780825  783232  781013
781587  781228  781642  781595  781665  783530  781546  781950
782268  781443  781327  781666  781907  781593  782105  781073


Sorry for attaching patch inline, I am using a dumb client. will post
it separately if needed.

====8<====

Currently PLE handler uses per VM variable as starting point. Get rid
of the variable and use randomized starting point.
Thanks Vatsa for scheduler related clarifications.

Suggested-by: Srikar <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---

[-- Attachment #2: randomize_starting_vcpu.patch --]
[-- Type: text/x-patch, Size: 1943 bytes --]

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c446435..9799cab 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -275,7 +275,6 @@ struct kvm {
 #endif
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	atomic_t online_vcpus;
-	int last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..6bab9f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/random.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -1572,31 +1573,32 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+	int vcpu_to_boost;
 	int yielded = 0;
 	int pass;
 	int i;
+	int num_vcpus = atomic_read(&kvm->online_vcpus);
 
+	vcpu_to_boost = (random32() % num_vcpus);
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
 	 * else and called schedule in __vcpu_run.  Hopefully that
 	 * VCPU is holding the lock that we need and will release it.
-	 * We approximate round-robin by starting at the last boosted VCPU.
+	 * We approximate round-robin by starting at a random VCPU.
 	 */
 	for (pass = 0; pass < 2 && !yielded; pass++) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i < last_boosted_vcpu) {
-				i = last_boosted_vcpu;
+			if (!pass && i < vcpu_to_boost) {
+				i = vcpu_to_boost;
 				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (pass && i > vcpu_to_boost)
 				break;
 			if (vcpu == me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
 				continue;
 			if (kvm_vcpu_yield_to(vcpu)) {
-				kvm->last_boosted_vcpu = i;
 				yielded = 1;
 				break;
 			}

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-21 11:26     ` Raghavendra K T
@ 2012-06-22 15:11       ` Andrew Jones
  2012-06-22 21:00         ` Raghavendra K T
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Jones @ 2012-06-22 15:11 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML, Gleb Natapov, chegu_vinod

On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
> Here are the results from kernbench.
> 
> PS: I think we have to only take that, both the patches perform better,
> than reading into actual numbers since I am seeing more variance in
> especially 3x. may be I can test with some more stable benchmark if
> somebody points
> 

Hi Raghu,

I wonder if we should back up and try to determine the best
benchmark/test environment first. I think kernbench is good, but
I wonder about how to simulate the overcommit, and to what degree
(1x, 3x, ??). What are you currently running to simulate overcommit
now? Originally we were running kernbench in one VM and cpu hogs
(bash infinite loops) in other VMs. Then we added vcpus and infinite
loops to get up to the desired overcommit. I saw later that you've
experimented with running kernbench in the other VMs as well, rather
than cpu hogs. Is that still the case?

I started playing with benchmarking these proposals myself, but so
far have stuck to the cpu hog, since I wanted to keep variability
limited.  However, when targeting a reasonable host loadavg with a
bunch of cpu hog vcpus, it limits the overcommit too much. I certainly
haven't tried 3x this way. So I'm inclined to throw out the cpu hog
approach as well. The question is, what to replace it with? It appears
that the performance of the PLE and pvticketlock proposals are quite
dependant on the level of overcommit, so we should choose a target
overcommit level and also a constraint on the host loadavg first,
then determine how to setup a test environment that fits it and yields
results with low variance.

Here are results from my 1.125x overcommit test environment using
cpu hogs.

kcbench (a.k.a kernbench) results; 'mean-time (stddev)'
  base-noPLE:           235.730 (25.932)
  base-PLE:             238.820 (11.199)
  rand_start-PLE:       283.193 (23.262)
  pvticketlocks-noPLE:  244.987 (7.562)
  pvticketlocks-PLE:    247.597 (17.200)

base kernel:          3.5.0-rc3 + Rik's new last_boosted patch
rand_start kernel:    3.5.0-rc3 + Raghu's proposed random start patch
pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch
                                + Raghu's pvticketlock series

The relative standard deviations are as high as 11%. So I'm not
real pleased with the results, and they show degradation everywhere.
Below are the details of the benchmarking. Everything is there except
the kernel config, but our benchmarking should be reproducible with
nearly random configs anyway.

Drew

= host =
  - Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
  - 64 cpus, 4 nodes, 64G mem
  - Fedora 17 with test kernels (see tests)

= benchmark =
  - one cpu hog F17 VM
    - 64 vcpus, 8G mem
    - all vcpus run a bash infinite loop
    - kernel: 3.5.0-rc3
  - one kcbench (a.k.a kernbench) F17 VM
    - 8 vcpus, 8G mem
    - 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs
    - kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch,
      kcbench-data-0.1-9.fc17.noarch
    - gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5)
    - kernel: same test kernel as host

= test 1: base, PLE disabled (ple_gap=0) =
  - kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16):      4211 (e:237.43 P:637% U:697.98 S:815.46 F:0)
Run 2 (-j 16):      3834 (e:260.77 P:631% U:729.69 S:917.56 F:0)
Run 3 (-j 16):      4784 (e:208.99 P:644% U:638.17 S:708.63 F:0)

mean: 235.730 stddev: 25.932

= test 2: base, PLE enabled =
  - kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16):      4335 (e:230.67 P:639% U:657.74 S:818.28 F:0)
Run 2 (-j 16):      4269 (e:234.20 P:647% U:743.43 S:772.52 F:0)
Run 3 (-j 16):      3974 (e:251.59 P:639% U:724.29 S:884.21 F:0)

mean: 238.820 stddev: 11.199

= test 3: rand_start, PLE enabled =
  - kernel: 3.5.0-rc3 + Raghu's random start patch

Run 1 (-j 16):      3898 (e:256.52 P:639% U:756.14 S:884.63 F:0)
Run 2 (-j 16):      3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0)
Run 3 (-j 16):      3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0)

mean: 283.193 stddev: 23.262

= test 4: pvticketlocks, PLE disabled (ple_gap=0) =
  - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
                      + PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16):      3963 (e:252.29 P:647% U:736.43 S:897.16 F:0)
Run 2 (-j 16):      4216 (e:237.19 P:650% U:706.68 S:837.42 F:0)
Run 3 (-j 16):      4073 (e:245.48 P:649% U:709.46 S:884.68 F:0)

mean: 244.987 stddev: 7.562

= test 5: pvticketlocks, PLE enabled =
  - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
                      + PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16):      3978 (e:251.32 P:629% U:758.86 S:824.29 F:0)
Run 2 (-j 16):      4369 (e:228.84 P:634% U:708.32 S:743.71 F:0)
Run 3 (-j 16):      3807 (e:262.63 P:626% U:767.03 S:877.96 F:0)

mean: 247.597 stddev: 17.200

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-22 15:11       ` Andrew Jones
@ 2012-06-22 21:00         ` Raghavendra K T
  2012-06-23 18:34           ` Raghavendra K T
  0 siblings, 1 reply; 21+ messages in thread
From: Raghavendra K T @ 2012-06-22 21:00 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Rik van Riel, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML, Gleb Natapov, chegu_vinod,
	Jeremy Fitzhardinge

On 06/22/2012 08:41 PM, Andrew Jones wrote:
> On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
>> Here are the results from kernbench.
>>
>> PS: I think we have to only take that, both the patches perform better,
>> than reading into actual numbers since I am seeing more variance in
>> especially 3x. may be I can test with some more stable benchmark if
>> somebody points
>>
>
> Hi Raghu,
>

First of all Thank you for your test and raising valid points.
It also made the avenue for discussion of all the different experiments
  done over a month (apart from tuning/benchmarking), which may bring
more feedback and precious ideas from community to optimize the 
performance further.

I shall discuss in reply to this mail separately.

> I wonder if we should back up and try to determine the best
> benchmark/test environment first.

I agree, we have to be able to produce similar result independently.
So far sysbench (even pgbench) has been consistent, Currently trying,
if other  benchmarks like hackbench (modified #loops), ebizzy/dbench
have low variance.

[ but they too are dependent on #client/threads etc ]

I think kernbench is good, but

Yes kernbench atleast helped me to tune SPIN_THRESHOLD to good extent.
But Jeremy also had pointed out that kernbench is little inconsistent.

> I wonder about how to simulate the overcommit, and to what degree
> (1x, 3x, ??). What are you currently running to simulate overcommit
> now? Originally we were running kernbench in one VM and cpu hogs
> (bash infinite loops) in other VMs. Then we added vcpus and infinite
> loops to get up to the desired overcommit. I saw later that you've
> experimented with running kernbench in the other VMs as well, rather
> than cpu hogs. Is that still the case?
>

Yes, I am now running same benchmark on all the guest.

on non PLE, while 1 cpuhogs, played good role of simulating LHP, but on
PLE machine It did not seem to be the case.

> I started playing with benchmarking these proposals myself, but so
> far have stuck to the cpu hog, since I wanted to keep variability
> limited.  However, when targeting a reasonable host loadavg with a
> bunch of cpu hog vcpus, it limits the overcommit too much. I certainly
> haven't tried 3x this way. So I'm inclined to throw out the cpu hog
> approach as well. The question is, what to replace it with? It appears
> that the performance of the PLE and pvticketlock proposals are quite
> dependant on the level of overcommit, so we should choose a target
> overcommit level and also a constraint on the host loadavg first,
> then determine how to setup a test environment that fits it and yields
> results with low variance.
>
> Here are results from my 1.125x overcommit test environment using
> cpu hogs.

At first, result seemed backward, but after seeing individual runs and 
variations, it seems, except for rand start I believe all the result 
should converge to zero difference. So if we run the same again we may 
get completely different result.

IMO, on a 64 vcpu guest if we run -j16 it may not represent 1x load, so
what I believe is it has resulted in more of under-commit/nearly 1x
commit result. May be we should try atleast #threads = #vcpu or 2*#vcpu

>
> kcbench (a.k.a kernbench) results; 'mean-time (stddev)'
>    base-noPLE:           235.730 (25.932)
>    base-PLE:             238.820 (11.199)
>    rand_start-PLE:       283.193 (23.262)

Problem currently as we know, in PLE handler  we may end up choosing
same VCPU, which was in spinloop, that would unfortunately result in
more cpu burning.

And with randomizing start_vcpu, we are making that probability more.
we need to have a logic, not choose a vcpu that has recently PL exited 
since it cannot be a lock-holder. and next eligible lock-holder can be
picked up easily with PV patches.

>    pvticketlocks-noPLE:  244.987 (7.562)
>    pvticketlocks-PLE:    247.597 (17.200)
>
> base kernel:          3.5.0-rc3 + Rik's new last_boosted patch
> rand_start kernel:    3.5.0-rc3 + Raghu's proposed random start patch
> pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch
>                                  + Raghu's pvticketlock series

Ok, I believe SPIN_THRESHOLD was 2k right? what I had observed is with 
2k THRESHOLD, we see halt exit overheads. currently I am trying with
mostly 4k.

>
> The relative standard deviations are as high as 11%. So I'm not
> real pleased with the results, and they show degradation everywhere.
> Below are the details of the benchmarking. Everything is there except
> the kernel config, but our benchmarking should be reproducible with
> nearly random configs anyway.
>
> Drew
>
> = host =
>    - Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
>    - 64 cpus, 4 nodes, 64G mem
>    - Fedora 17 with test kernels (see tests)
>
> = benchmark =
>    - one cpu hog F17 VM
>      - 64 vcpus, 8G mem
>      - all vcpus run a bash infinite loop
>      - kernel: 3.5.0-rc3
>    - one kcbench (a.k.a kernbench) F17 VM
>      - 8 vcpus, 8G mem
>      - 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs

may be we have to check whether 1GB RAM is ok when we have 128 threads,
not sure..

>      - kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch,
>        kcbench-data-0.1-9.fc17.noarch
>      - gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5)
>      - kernel: same test kernel as host
>
> = test 1: base, PLE disabled (ple_gap=0) =
>    - kernel: 3.5.0-rc3 + Rik's last_boosted patch
>
> Run 1 (-j 16):      4211 (e:237.43 P:637% U:697.98 S:815.46 F:0)
> Run 2 (-j 16):      3834 (e:260.77 P:631% U:729.69 S:917.56 F:0)
> Run 3 (-j 16):      4784 (e:208.99 P:644% U:638.17 S:708.63 F:0)
>
> mean: 235.730 stddev: 25.932
>
> = test 2: base, PLE enabled =
>    - kernel: 3.5.0-rc3 + Rik's last_boosted patch
>
> Run 1 (-j 16):      4335 (e:230.67 P:639% U:657.74 S:818.28 F:0)
> Run 2 (-j 16):      4269 (e:234.20 P:647% U:743.43 S:772.52 F:0)
> Run 3 (-j 16):      3974 (e:251.59 P:639% U:724.29 S:884.21 F:0)
>
> mean: 238.820 stddev: 11.199
>
> = test 3: rand_start, PLE enabled =
>    - kernel: 3.5.0-rc3 + Raghu's random start patch
>
> Run 1 (-j 16):      3898 (e:256.52 P:639% U:756.14 S:884.63 F:0)
> Run 2 (-j 16):      3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0)
> Run 3 (-j 16):      3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0)
>
> mean: 283.193 stddev: 23.262
>
> = test 4: pvticketlocks, PLE disabled (ple_gap=0) =
>    - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
>                        + PARAVIRT_SPINLOCKS=y config change
>
> Run 1 (-j 16):      3963 (e:252.29 P:647% U:736.43 S:897.16 F:0)
> Run 2 (-j 16):      4216 (e:237.19 P:650% U:706.68 S:837.42 F:0)
> Run 3 (-j 16):      4073 (e:245.48 P:649% U:709.46 S:884.68 F:0)
>
> mean: 244.987 stddev: 7.562
>
> = test 5: pvticketlocks, PLE enabled =
>    - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
>                        + PARAVIRT_SPINLOCKS=y config change
>
> Run 1 (-j 16):      3978 (e:251.32 P:629% U:758.86 S:824.29 F:0)
> Run 2 (-j 16):      4369 (e:228.84 P:634% U:708.32 S:743.71 F:0)
> Run 3 (-j 16):      3807 (e:262.63 P:626% U:767.03 S:877.96 F:0)
>
> mean: 247.597 stddev: 17.200
>
>

Ok in summary,
can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm.
1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on.
also a SPIN_THRESHOLD of 4k?

Any ideas on benchmarks is welcome from all.

- Raghu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-22 21:00         ` Raghavendra K T
@ 2012-06-23 18:34           ` Raghavendra K T
  2012-06-27 20:27             ` Raghavendra K T
  0 siblings, 1 reply; 21+ messages in thread
From: Raghavendra K T @ 2012-06-23 18:34 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Rik van Riel, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML, Gleb Natapov, chegu_vinod,
	Jeremy Fitzhardinge

On 06/23/2012 02:30 AM, Raghavendra K T wrote:
> On 06/22/2012 08:41 PM, Andrew Jones wrote:
>> On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
>>> Here are the results from kernbench.
>>>
>>> PS: I think we have to only take that, both the patches perform better,
>>> than reading into actual numbers since I am seeing more variance in
>>> especially 3x. may be I can test with some more stable benchmark if
>>> somebody points
>>>
[...]
> can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm.
> 1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on.
> also a SPIN_THRESHOLD of 4k?

Please forget about 1.5x above. I am not too sure on that.

>
> Any ideas on benchmarks is welcome from all.
>

My run for other benchmarks did not have Rik's patches, so re-spinning
everything with that now.

Here is the detailed info on env and benchmark I am currently trying. 
Let me know if you have any comments

=======
kernel 3.5.0-rc1 with Rik's Ple handler fix  as base

Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM, 
32 core machine

Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) 
(GCC) with test kernels
Guest: fedora 16 with different built-in kernel from same source tree.
32 vcpus 8GB memory. (configs not changed with patches except for
CONFIG_PARAVIRT_SPINLOCK)

Note: for Pv patches, SPIN_THRESHOLD is set to 4k

Benchmarks:
1) kernbench: kernbench-0.50

cmd:
echo "3" > /proc/sys/vm/drop_caches
ccache -C
kernbench -f -H -M -o 2*vcpu

Very first run in kernbench is omitted.

2) dbench: dbench version 4.00
cmd: dbench --warmup=30 -t 120 2*vcpu

3) hackbench:
https://build.opensuse.org/package/files?package=hackbench&project=benchmark
hackbench.c modified with loops=10000
used hackbench with num-threads = 2* vcpu

4) Specjbb: specjbb2000-1.02
Input Properties:
   ramp_up_seconds = 30
   measurement_seconds = 120
   forcegc = true
   starting_number_warehouses = 1
   increment_number_warehouses = 1
   ending_number_warehouses = 8


5) sysbench: 0.4.12
sysbench --test=oltp --db-driver=pgsql prepare
sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp 
--oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
Note that driver for this pgsql.


6) ebizzy: release 0.3
cmd: ebizzy -S 120

- specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit.
- overcommit of 2x means same benchmark running on 2 guests.
- sample for each overcommit is mostly 8

Note: I ran kernbench with old kernbench0.50, may be I can try kcbench
with ramfs if necessary

will soon come with detailed results
> - Raghu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-23 18:34           ` Raghavendra K T
@ 2012-06-27 20:27             ` Raghavendra K T
  2012-06-27 20:29               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment Raghavendra K T
  2012-06-28 16:00               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Andrew Jones
  0 siblings, 2 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-06-27 20:27 UTC (permalink / raw)
  To: Andrew Jones, Avi Kivity, Ingo Molnar
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	chegu_vinod, Jeremy Fitzhardinge

On 06/24/2012 12:04 AM, Raghavendra K T wrote:
> On 06/23/2012 02:30 AM, Raghavendra K T wrote:
>> On 06/22/2012 08:41 PM, Andrew Jones wrote:
[...]
> My run for other benchmarks did not have Rik's patches, so re-spinning
> everything with that now.
>
> Here is the detailed info on env and benchmark I am currently trying.
> Let me know if you have any comments
>
> =======
> kernel 3.5.0-rc1 with Rik's Ple handler fix as base
>
> Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM,
> 32 core machine
>
> Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4)
> (GCC) with test kernels
> Guest: fedora 16 with different built-in kernel from same source tree.
> 32 vcpus 8GB memory. (configs not changed with patches except for
> CONFIG_PARAVIRT_SPINLOCK)
>
> Note: for Pv patches, SPIN_THRESHOLD is set to 4k
>
> Benchmarks:
> 1) kernbench: kernbench-0.50
>
> cmd:
> echo "3" > /proc/sys/vm/drop_caches
> ccache -C
> kernbench -f -H -M -o 2*vcpu
>
> Very first run in kernbench is omitted.
>
> 2) dbench: dbench version 4.00
> cmd: dbench --warmup=30 -t 120 2*vcpu
>
> 3) hackbench:
>https://build.opensuse.org/package/files?package=hackbench&project=benchmark
>
> hackbench.c modified with loops=10000
> used hackbench with num-threads = 2* vcpu
>
> 4) Specjbb: specjbb2000-1.02
> Input Properties:
> ramp_up_seconds = 30
> measurement_seconds = 120
> forcegc = true
> starting_number_warehouses = 1
> increment_number_warehouses = 1
> ending_number_warehouses = 8
>
>
> 5) sysbench: 0.4.12
> sysbench --test=oltp --db-driver=pgsql prepare
> sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp
> --oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
> Note that driver for this pgsql.
>
>
> 6) ebizzy: release 0.3
> cmd: ebizzy -S 120
>
> - specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit.
> - overcommit of 2x means same benchmark running on 2 guests.
> - sample for each overcommit is mostly 8
>
> Note: I ran kernbench with old kernbench0.50, may be I can try kcbench
> with ramfs if necessary
>
> will soon come with detailed results

With the above env, Here is the result I have for 4k SPIN_THRESHOLD.

Lower is better for following benchmarks:
kernbench: (time in sec)
hackbench: (time in sec)
sysbench : (time in sec)

Higher is better for following benchmarks:
specjbb: score (Throughput)
dbench : Throughput in MB/sec
ebizzy : records/sec

In summary, current PV has huge benefit on non-PLE machine.

On PLE machine, the results become very sensitive to load, type of
workload and SPIN_THRESHOLD. Also PLE interference has significant
effect on them. But still it has slight edge over non PV.

Overall, specjbb, sysbench, kernbench seem to do well with PV.

dbench has been little unreliable (same reason I have not published
2x, 3x result but experimental values are included in tarball) but
seem to be on par with PV

hackbench non-overcommit case is better and ebizzy overcommit case is 
better.
[ebizzy seems to very sensitive w.r.t SPIN_THRESHOLD].

I have still not experimented with SPIN_THRESHOLD of 2k/8k and w/, w/o PLE
after having Rik's fix.

+-----------+-----------+-----------+------------+---------+
                               specjbb
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
|114232.2500|21774.0660	|122591.0000| 18239.0900 | 7.31733 |
|112154.5000|19696.6860	|113386.2500| 22262.5890 | 1.09826 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
                               kernbench
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
|   48.9150 |   0.8608  |   48.5550 |   0.7372   | 0.74143 |
|   96.3691 |   7.9724  |   96.6367 |   1.6938   |-0.27691 |
|  192.6972 |   9.1881  |  188.3195 |   8.1267   | 2.32461 |
|  320.6500 |  29.6892  |  302.1225 |  16.0515   | 6.13245 |
++-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
                               sysbench
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
|   12.4082 |   0.2370  |   12.2797 |   0.1037   | 1.04644 |
|   14.1705 |   0.4272  |   14.0300 |   1.1478   | 1.00143 |
|   19.3769 |   1.0833  |   18.9745 |   0.0560   | 2.12074 |
|   24.5373 |   1.3237  |   22.3078 |   0.8999   | 9.99426 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
                               hackbench
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
|   73.2627 |  11.2413  |   67.5125 |   2.5722   |  8.51724|
|  134.4294 |   1.9688  |  153.6160 |   5.2033   |-12.48998|
|  215.4521 |   3.8672  |  238.8965 |   3.0035   | -9.81362|
|  303.8553 |   5.0427  |  310.3569 |   6.1463   | -2.09488|
++-----------+-----------+-----------+------------+--------+

+-----------+-----------+-----------+------------+---------+
                               ebizzy
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
| 1108.6250 |  19.3090  | 1088.2500 |   11.0809	 |-1.83786 |
| 1662.6250 | 150.5466  | 1064.0000 |    2.8284	 |-36.00481|
| 1394.0000 |  85.0867  | 1073.2857 |   10.3877	 |-23.00676|
| 1172.1250 |  20.3501  | 1245.8750 |   25.3852	 | 6.29199 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
                               dbench
+-----------+-----------+-----------+------------+---------+
|   value   |   stdev   |   value   |    stdev   | %improve|
+-----------+-----------+-----------+------------+---------+
|   29.0378 | 1.1625    | 28.8466   |    1.1132  |-0.65845 |
+-----------+-----------+-----------+------------+---------+

(benchmark values will be attached in reply to this mail)

Planning to post patches rebased to 3.5-rc. Avi, Ingo.. Please let me know.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment
  2012-06-27 20:27             ` Raghavendra K T
@ 2012-06-27 20:29               ` Raghavendra K T
  2012-06-28 16:00               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Andrew Jones
  1 sibling, 0 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-06-27 20:29 UTC (permalink / raw)
  To: Andrew Jones, Avi Kivity, Ingo Molnar
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	chegu_vinod, Jeremy Fitzhardinge

[-- Attachment #1: Type: text/plain, Size: 262 bytes --]

On 06/28/2012 01:57 AM, Raghavendra K T wrote:
> On 06/24/2012 12:04 AM, Raghavendra K T wrote:
>> On 06/23/2012 02:30 AM, Raghavendra K T wrote:
>>> On 06/22/2012 08:41 PM, Andrew Jones wrote:
[...]
>
> (benchmark values will be attached in reply to this mail)

[-- Attachment #2: pv_benchmark_summary.bz2 --]
[-- Type: application/x-bzip, Size: 7068 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-21  6:43   ` Gleb Natapov
  2012-06-21 10:23     ` Raghavendra K T
@ 2012-06-28  2:14     ` Raghavendra K T
  1 sibling, 0 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-06-28  2:14 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Rik van Riel, Avi Kivity, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	Ingo Molnar, LKML

On 06/21/2012 12:13 PM, Gleb Natapov wrote:
> On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>  wrote:
>>
>>>
>>> In ple handler code, last_boosted_vcpu (lbv) variable is
>>> serving as reference point to start when we enter.
>>
>>> Also statistical analysis (below) is showing lbv is not very well
>>> distributed with current approach.
>>
>> You are the second person to spot this bug today (yes, today).
>>
>> Due to time zones, the first person has not had a chance yet to
>> test the patch below, which might fix the issue...
>>
>> Please let me know how it goes.
>>
>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0.  With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing<  to<= makes sure we properly handle that case.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   	 */
>>   	for (pass = 0; pass<  2&&  !yielded; pass++) {
>>   		kvm_for_each_vcpu(i, vcpu, kvm) {
>> -			if (!pass&&  i<  last_boosted_vcpu) {
>> +			if (!pass&&  i<= last_boosted_vcpu) {
>>   				i = last_boosted_vcpu;
>>   				continue;
>>   			} else if (pass&&  i>  last_boosted_vcpu)
>>
> Looks correct. We can simplify this by introducing something like:
>
> #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
>          for (n = atomic_read(&kvm->online_vcpus); \
>               n&&  (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>               n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))
>

Gleb, Rik,
Any updates on this or Rik's patch status?
I can come up with the above suggested cleanup patch with Gleb's
from,sob.

Please let me know.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-27 20:27             ` Raghavendra K T
  2012-06-27 20:29               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment Raghavendra K T
@ 2012-06-28 16:00               ` Andrew Jones
  2012-06-28 16:22                 ` Raghavendra K T
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Jones @ 2012-06-28 16:00 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	chegu vinod, Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar



----- Original Message -----
> In summary, current PV has huge benefit on non-PLE machine.
> 
> On PLE machine, the results become very sensitive to load, type of
> workload and SPIN_THRESHOLD. Also PLE interference has significant
> effect on them. But still it has slight edge over non PV.
> 

Hi Raghu,

sorry for my slow response. I'm on vacation right now (until the
9th of July) and I have limited access to mail. Also, thanks for
continuing the benchmarking. Question, when you compare PLE vs.
non-PLE, are you using different machines (one with and one
without), or are you disabling its use by loading the kvm module
with the ple_gap=0 modparam as I did?

Drew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-28 16:00               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Andrew Jones
@ 2012-06-28 16:22                 ` Raghavendra K T
  2012-06-28 22:55                     ` Vinod, Chegu
  0 siblings, 1 reply; 21+ messages in thread
From: Raghavendra K T @ 2012-06-28 16:22 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	chegu vinod, Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of
>> workload and SPIN_THRESHOLD. Also PLE interference has significant
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the
> 9th of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one
> without), or are you disabling its use by loading the kvm module
> with the ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the
benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I
had limited access to PLE machine).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-28 16:22                 ` Raghavendra K T
@ 2012-06-28 22:55                     ` Vinod, Chegu
  0 siblings, 0 replies; 21+ messages in thread
From: Vinod, Chegu @ 2012-06-28 22:55 UTC (permalink / raw)
  To: Raghavendra K T, Andrew Jones
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2371 bytes --]

Hello,

I am just catching up on this email thread... 

Perhaps one of you may be able to help answer this query.. preferably along with some data.  [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ]

In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests <= physical cpus in the host  and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how  the PLE really helps ?  For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? 

Thanks
Vinod

-----Original Message-----
From: Raghavendra K T [mailto:raghavendra.kt@linux.vnet.ibm.com] 
Sent: Thursday, June 28, 2012 9:22 AM
To: Andrew Jones
Cc: Rik van Riel; Marcelo Tosatti; Srikar; Srivatsa Vaddagiri; Peter Zijlstra; Nikunj A. Dadhania; KVM; LKML; Gleb Natapov; Vinod, Chegu; Jeremy Fitzhardinge; Avi Kivity; Ingo Molnar
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of 
>> workload and SPIN_THRESHOLD. Also PLE interference has significant 
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the 9th 
> of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one without), 
> or are you disabling its use by loading the kvm module with the 
> ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I had limited access to PLE machine).


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] kvm: handle last_boosted_vcpu = 0 case
@ 2012-06-28 22:55                     ` Vinod, Chegu
  0 siblings, 0 replies; 21+ messages in thread
From: Vinod, Chegu @ 2012-06-28 22:55 UTC (permalink / raw)
  To: Raghavendra K T, Andrew Jones
  Cc: Rik van Riel, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar

Hello,

I am just catching up on this email thread... 

Perhaps one of you may be able to help answer this query.. preferably along with some data.  [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ]

In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests <= physical cpus in the host  and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how  the PLE really helps ?  For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? 

Thanks
Vinod

-----Original Message-----
From: Raghavendra K T [mailto:raghavendra.kt@linux.vnet.ibm.com] 
Sent: Thursday, June 28, 2012 9:22 AM
To: Andrew Jones
Cc: Rik van Riel; Marcelo Tosatti; Srikar; Srivatsa Vaddagiri; Peter Zijlstra; Nikunj A. Dadhania; KVM; LKML; Gleb Natapov; Vinod, Chegu; Jeremy Fitzhardinge; Avi Kivity; Ingo Molnar
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of 
>> workload and SPIN_THRESHOLD. Also PLE interference has significant 
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the 9th 
> of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one without), 
> or are you disabling its use by loading the kvm module with the 
> ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I had limited access to PLE machine).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-28 22:55                     ` Vinod, Chegu
  (?)
@ 2012-07-02 14:49                     ` Rik van Riel
  2012-07-03  3:30                       ` Raghavendra K T
  2012-07-05 14:45                       ` Andrew Theurer
  -1 siblings, 2 replies; 21+ messages in thread
From: Rik van Riel @ 2012-07-02 14:49 UTC (permalink / raw)
  To: Vinod, Chegu
  Cc: Raghavendra K T, Andrew Jones, Marcelo Tosatti, Srikar,
	Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania, KVM,
	LKML, Gleb Natapov, Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar

On 06/28/2012 06:55 PM, Vinod, Chegu wrote:
> Hello,
>
> I am just catching up on this email thread...
>
> Perhaps one of you may be able to help answer this query.. preferably along with some data.  [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ]
>
> In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests<= physical cpus in the host  and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how  the PLE really helps ?  For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ?

Yes, that should be ok.

On a related note, I wonder if we should increase the ple_gap
significantly.

After all, 4096 cycles of spinning is not that much, when you
consider how much time is spent doing the subsequent vmexit,
scanning the other VCPU's status (200 cycles per cache miss),
deciding what to do, maybe poking another CPU, and eventually
a vmenter.

A factor 4 increase in ple_gap might be what it takes to
get the amount of time spent spinning equal to the amount of
time spent on the host side doing KVM stuff...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-07-02 14:49                     ` Rik van Riel
@ 2012-07-03  3:30                       ` Raghavendra K T
  2012-07-05 14:45                       ` Andrew Theurer
  1 sibling, 0 replies; 21+ messages in thread
From: Raghavendra K T @ 2012-07-03  3:30 UTC (permalink / raw)
  To: Rik van Riel, Vinod, Chegu
  Cc: Andrew Jones, Marcelo Tosatti, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, LKML, Gleb Natapov,
	Jeremy Fitzhardinge, Avi Kivity, Ingo Molnar

On 07/02/2012 08:19 PM, Rik van Riel wrote:
> On 06/28/2012 06:55 PM, Vinod, Chegu wrote:
>> Hello,
>>
>> I am just catching up on this email thread...
>>
>> Perhaps one of you may be able to help answer this query.. preferably
>> along with some data. [BTW, I do understand the basic intent behind
>> PLE in a typical [sweet spot] use case where there is over
>> subscription etc. and the need to optimize the PLE handler in the host
>> etc. ]
>>
>> In a use case where the host has fewer but much larger guests (say
>> 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus
>> across guests<= physical cpus in the host and perhaps each guest has
>> their vcpu's pinned to specific physical cpus for other reasons), I
>> would like to understand if/how the PLE really helps ? For these use
>> cases would it be ok to turn PLE off (ple_gap=0) since is no real need
>> to take an exit and find some other VCPU to yield to ?
>
> Yes, that should be ok.

I think this should be true when we have ple_window tuned to correct
value for guest. (same what you raised)

But otherwise, IMO, it is a very tricky question to answer. PLE is
currently benefiting even flush_tlb_ipi etc apart from spinlock. Having
a properly tuned value for all types of workload, (+load) is really
complicated.
Coming back to ple_handler, IMHO, if we have slight increase in
run_queue length, having directed yield may worsen the scenario.

(In the case Vinod explained, even-though we will succeed in setting
other vcpu task as next_buddy, caller itself gets scheduled out, so
ganging effect reduces. on top of this we always have a question, have 
we chosen right guy OR a really bad guy for yielding.)

>
> On a related note, I wonder if we should increase the ple_gap
> significantly.

Did you mean ple_window?

>
> After all, 4096 cycles of spinning is not that much, when you
> consider how much time is spent doing the subsequent vmexit,
> scanning the other VCPU's status (200 cycles per cache miss),
> deciding what to do, maybe poking another CPU, and eventually
> a vmenter.
>
> A factor 4 increase in ple_gap might be what it takes to
> get the amount of time spent spinning equal to the amount of
> time spent on the host side doing KVM stuff...
>

I agree, I am experimenting with all these things left and right, along
with several optimization ideas I have. Hope to comeback on the
experiments soon.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-07-02 14:49                     ` Rik van Riel
  2012-07-03  3:30                       ` Raghavendra K T
@ 2012-07-05 14:45                       ` Andrew Theurer
  1 sibling, 0 replies; 21+ messages in thread
From: Andrew Theurer @ 2012-07-05 14:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Vinod, Chegu, Raghavendra K T, Andrew Jones, Marcelo Tosatti,
	Srikar, Srivatsa Vaddagiri, Peter Zijlstra, Nikunj A. Dadhania,
	KVM, LKML, Gleb Natapov, Jeremy Fitzhardinge, Avi Kivity,
	Ingo Molnar

On Mon, 2012-07-02 at 10:49 -0400, Rik van Riel wrote:
> On 06/28/2012 06:55 PM, Vinod, Chegu wrote:
> > Hello,
> >
> > I am just catching up on this email thread...
> >
> > Perhaps one of you may be able to help answer this query.. preferably along with some data.  [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ]
> >
> > In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests<= physical cpus in the host  and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how  the PLE really helps ?  For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ?
> 
> Yes, that should be ok.
> 
> On a related note, I wonder if we should increase the ple_gap
> significantly.
> 
> After all, 4096 cycles of spinning is not that much, when you
> consider how much time is spent doing the subsequent vmexit,
> scanning the other VCPU's status (200 cycles per cache miss),
> deciding what to do, maybe poking another CPU, and eventually
> a vmenter.
> 
> A factor 4 increase in ple_gap might be what it takes to
> get the amount of time spent spinning equal to the amount of
> time spent on the host side doing KVM stuff...

I was recently thinking the same thing as I have observed over 180,000
exits/sec from a 40-way VM on a 80-way host, where there should be no
cpu overcommit.  Also, the number of directed yields for this was only
1800/sec, so we have a 1% usefulness for our exits.  I am wondering if
the ple_window should be similar to the host scheduler task switching
granularity, and not what we think a typical max cycles should be for
holding a lock.

BTW, I have a patch to add a couple PLE stats to kvmstat which I will
send out shortly.

-Andrew





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
  2012-06-19 20:51 ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Rik van Riel
  2012-06-20 20:12   ` Raghavendra K T
  2012-06-21  6:43   ` Gleb Natapov
@ 2012-07-06 17:11   ` Marcelo Tosatti
  2 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2012-07-06 17:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Raghavendra K T, Avi Kivity, Srikar, Srivatsa Vaddagiri,
	Peter Zijlstra, Nikunj A. Dadhania, KVM, Ingo Molnar, LKML

On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
> On Wed, 20 Jun 2012 01:50:50 +0530
> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> > 
> > In ple handler code, last_boosted_vcpu (lbv) variable is
> > serving as reference point to start when we enter.
> 
> > Also statistical analysis (below) is showing lbv is not very well
> > distributed with current approach.
> 
> You are the second person to spot this bug today (yes, today).
> 
> Due to time zones, the first person has not had a chance yet to
> test the patch below, which might fix the issue...
> 
> Please let me know how it goes.
> 
> ====8<====
> 
> If last_boosted_vcpu == 0, then we fall through all test cases and
> may end up with all VCPUs pouncing on vcpu 0.  With a large enough
> guest, this can result in enormous runqueue lock contention, which
> can prevent vcpu0 from running, leading to a livelock.
> 
> Changing < to <= makes sure we properly handle that case.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Applied, thanks.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-07-06 18:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-19 20:20 Regarding improving ple handler (vcpu_on_spin) Raghavendra K T
2012-06-19 20:51 ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Rik van Riel
2012-06-20 20:12   ` Raghavendra K T
2012-06-21  2:11     ` Rik van Riel
2012-06-21 11:26     ` Raghavendra K T
2012-06-22 15:11       ` Andrew Jones
2012-06-22 21:00         ` Raghavendra K T
2012-06-23 18:34           ` Raghavendra K T
2012-06-27 20:27             ` Raghavendra K T
2012-06-27 20:29               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment Raghavendra K T
2012-06-28 16:00               ` [PATCH] kvm: handle last_boosted_vcpu = 0 case Andrew Jones
2012-06-28 16:22                 ` Raghavendra K T
2012-06-28 22:55                   ` Vinod, Chegu
2012-06-28 22:55                     ` Vinod, Chegu
2012-07-02 14:49                     ` Rik van Riel
2012-07-03  3:30                       ` Raghavendra K T
2012-07-05 14:45                       ` Andrew Theurer
2012-06-21  6:43   ` Gleb Natapov
2012-06-21 10:23     ` Raghavendra K T
2012-06-28  2:14     ` Raghavendra K T
2012-07-06 17:11   ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.