[PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios
@ 2013-01-22  7:38 Raghavendra K T
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-22  7:38 UTC (permalink / raw)
  To: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

 In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

 The first patch optimizes all the yield_to by bailing out when there
 is no need to continue in yield_to (i.e., when there is only one task 
 in source and target rq).

 Second patch uses that in PLE handler. Further when a yield_to fails
 we do not immediately go out of PLE handler instead we try thrice 
 to have better statistical possibility of false return. Otherwise that
 would affect moderate overcommit cases.
 
 Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
 around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6 
machine: 32 core mx3850 x5 PLE mc

--+-----------+-----------+-----------+------------+-----------+
               ebizzy (rec/sec higher is beter)
--+-----------+-----------+-----------+------------+-----------+
    base        stdev       patched     stdev       %improve     
--+-----------+-----------+-----------+------------+-----------+
1x   2511.3000    21.5409    6051.8000   170.2592   140.98276   
2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
--+-----------+-----------+-----------+------------+-----------+

--+-----------+-----------+-----------+------------+-----------+
        dbench (throughput in MB/sec. higher is better)
--+-----------+-----------+-----------+------------+-----------+
    base        stdev       patched     stdev       %improve     
--+-----------+-----------+-----------+------------+-----------+
1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
--+-----------+-----------+-----------+------------+-----------+

Here is the refernce of no ple result.
 ebizzy-1x_nople 7592.6000 rec/sec
 dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we are
getting impressive performance with the patches.

 Changes Since V2:
 - Dropped global measures usage patch (Peter Zilstra)
 - Do not bail out on first failure (Avi Kivity)
 - Try thrice for the failure of yield_to to get statistically more correct
   behaviour.

 Changes since V1:
 - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
 - Use yield() instead of schedule in overcommit scenarios (Rik)
 - Use loadavg knowledge to detect undercommit/overcommit

 Peter Zijlstra (1):
  Bail out of yield_to when source and target runqueue has one task

 Raghavendra K T (1):
  Handle yield_to failure return for potential undercommit case

 Please let me know your comments and suggestions.

 Link for the discussion of V3 original:
 https://lkml.org/lkml/2012/11/26/166

 Link for V2:
 https://lkml.org/lkml/2012/10/29/287

 Link for V1:
 https://lkml.org/lkml/2012/9/21/168

 kernel/sched/core.c | 25 +++++++++++++++++++------
 virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
 2 files changed, 35 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-22  7:38 [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
@ 2013-01-22  7:39 ` Raghavendra K T
  2013-01-24 10:32   ` Ingo Molnar
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2013-01-22  7:39 UTC (permalink / raw)
  To: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

From: Peter Zijlstra <peterz@infradead.org>

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Andrew Jones <drjones@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---

 kernel/sched/core.c |   25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *	true (>0) if we indeed boosted the target task.
+ *	false (0) if we failed to boost the target.
+ *	-ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+	/*
+	 * If we're the only runnable task on the rq and target rq also
+	 * has only one task, there's absolutely no point in yielding.
+	 */
+	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+		yielded = -ESRCH;
+		goto out_irq;
+	}
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
 	}
 
 	if (!curr->sched_class->yield_to_task)
-		goto out;
+		goto out_unlock;
 
 	if (curr->sched_class != p->sched_class)
-		goto out;
+		goto out_unlock;
 
 	if (task_running(p_rq, p) || p->state)
-		goto out;
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4329,11 +4341,12 @@ again:
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
-	if (yielded)
+	if (yielded > 0)
 		schedule();
 
 	return yielded;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH V3 RESEND RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2013-01-22  7:38 [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
@ 2013-01-22  7:39 ` Raghavendra K T
  2013-01-23 13:57 ` [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Andrew Jones
  2013-01-29 14:05 ` Gleb Natapov
  3 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-22  7:39 UTC (permalink / raw)
  To: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

More detail on rationale behind using three tries:
if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:

 p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs  with length 1 ]

so
num tries:         probability of aborting ple handler (1.5x overcommit)
 1                 1/4
 2                 1/8
 3                 1/16

We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
---
 Note:Updated with why number of tries to do yield is three.

 virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..053f494 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 {
 	struct pid *pid;
 	struct task_struct *task = NULL;
+	bool ret = false;
 
 	rcu_read_lock();
 	pid = rcu_dereference(target->pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 		task = get_pid_task(target->pid, PIDTYPE_PID);
 	rcu_read_unlock();
 	if (!task)
-		return false;
+		return ret;
 	if (task->flags & PF_VCPU) {
 		put_task_struct(task);
-		return false;
-	}
-	if (yield_to(task, 1)) {
-		put_task_struct(task);
-		return true;
+		return ret;
 	}
+	ret = yield_to(task, 1);
 	put_task_struct(task);
-	return false;
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
@@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return eligible;
 }
 #endif
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
+	int try = 3;
 	int pass;
 	int i;
 
@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
+	for (pass = 0; pass < 2 && !yielded && try; pass++) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (!pass && i <= last_boosted_vcpu) {
 				i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				continue;
 			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 				continue;
-			if (kvm_vcpu_yield_to(vcpu)) {
+
+			yielded = kvm_vcpu_yield_to(vcpu);
+			if (yielded > 0) {
 				kvm->last_boosted_vcpu = i;
-				yielded = 1;
 				break;
+			} else if (yielded < 0) {
+				try--;
+				if (!try)
+					break;
 			}
 		}
 	}


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios
  2013-01-22  7:38 [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
@ 2013-01-23 13:57 ` Andrew Jones
  2013-01-24  8:27   ` Raghavendra K T
  2013-01-29 14:05 ` Gleb Natapov
  3 siblings, 1 reply; 14+ messages in thread
From: Andrew Jones @ 2013-01-23 13:57 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On Tue, Jan 22, 2013 at 01:08:54PM +0530, Raghavendra K T wrote:
>  In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
> 
>  The first patch optimizes all the yield_to by bailing out when there
>  is no need to continue in yield_to (i.e., when there is only one task 
>  in source and target rq).
> 
>  Second patch uses that in PLE handler. Further when a yield_to fails
>  we do not immediately go out of PLE handler instead we try thrice 
>  to have better statistical possibility of false return. Otherwise that
>  would affect moderate overcommit cases.
>  
>  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
>  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
> 
> 
> base = 3.7.0-rc6 
> machine: 32 core mx3850 x5 PLE mc
> 
> --+-----------+-----------+-----------+------------+-----------+
>                ebizzy (rec/sec higher is beter)
> --+-----------+-----------+-----------+------------+-----------+
>     base        stdev       patched     stdev       %improve     
> --+-----------+-----------+-----------+------------+-----------+
> 1x   2511.3000    21.5409    6051.8000   170.2592   140.98276   
> 2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
> 3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
> --+-----------+-----------+-----------+------------+-----------+
> 
> --+-----------+-----------+-----------+------------+-----------+
>         dbench (throughput in MB/sec. higher is better)
> --+-----------+-----------+-----------+------------+-----------+
>     base        stdev       patched     stdev       %improve     
> --+-----------+-----------+-----------+------------+-----------+
> 1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
> 2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
> 3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
> --+-----------+-----------+-----------+------------+-----------+
> 
> Here is the refernce of no ple result.
>  ebizzy-1x_nople 7592.6000 rec/sec
>  dbench_1x_nople 7853.6960 MB/sec

I'm not sure how much we should trust ebizzy results, but even
so, the dbench results are stranger. The percent error is huge
(34%) and somehow we do much better for 1x overcommit with PLE
enabled then without (for the patched version). How does that
happen? How many guests are running in the 1x test? And are the
throughput results the combined throughput of all of them? I
wonder if this jump in throughput is just the guests' perceived
throughput, but wrong due to bad virtual time keeping. Can we
run a long-lasting benchmark and measure the elapsed time with
a clock external from the guests?

Drew

> 
> The result says we can still improve by 60% for ebizzy, but overall we are
> getting impressive performance with the patches.
> 
>  Changes Since V2:
>  - Dropped global measures usage patch (Peter Zilstra)
>  - Do not bail out on first failure (Avi Kivity)
>  - Try thrice for the failure of yield_to to get statistically more correct
>    behaviour.
> 
>  Changes since V1:
>  - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
>  - Use yield() instead of schedule in overcommit scenarios (Rik)
>  - Use loadavg knowledge to detect undercommit/overcommit
> 
>  Peter Zijlstra (1):
>   Bail out of yield_to when source and target runqueue has one task
> 
>  Raghavendra K T (1):
>   Handle yield_to failure return for potential undercommit case
> 
>  Please let me know your comments and suggestions.
> 
>  Link for the discussion of V3 original:
>  https://lkml.org/lkml/2012/11/26/166
> 
>  Link for V2:
>  https://lkml.org/lkml/2012/10/29/287
> 
>  Link for V1:
>  https://lkml.org/lkml/2012/9/21/168
> 
>  kernel/sched/core.c | 25 +++++++++++++++++++------
>  virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
>  2 files changed, 35 insertions(+), 16 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios
  2013-01-23 13:57 ` [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Andrew Jones
@ 2013-01-24  8:27   ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-24  8:27 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On 01/23/2013 07:27 PM, Andrew Jones wrote:
> On Tue, Jan 22, 2013 at 01:08:54PM +0530, Raghavendra K T wrote:
>>   In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>>   The first patch optimizes all the yield_to by bailing out when there
>>   is no need to continue in yield_to (i.e., when there is only one task
>>   in source and target rq).
>>
>>   Second patch uses that in PLE handler. Further when a yield_to fails
>>   we do not immediately go out of PLE handler instead we try thrice
>>   to have better statistical possibility of false return. Otherwise that
>>   would affect moderate overcommit cases.
>>
>>   Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
>>   around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
>>
>>
>> base = 3.7.0-rc6
>> machine: 32 core mx3850 x5 PLE mc
>>
>> --+-----------+-----------+-----------+------------+-----------+
>>                 ebizzy (rec/sec higher is beter)
>> --+-----------+-----------+-----------+------------+-----------+
>>      base        stdev       patched     stdev       %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x   2511.3000    21.5409    6051.8000   170.2592   140.98276
>> 2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
>> 3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> --+-----------+-----------+-----------+------------+-----------+
>>          dbench (throughput in MB/sec. higher is better)
>> --+-----------+-----------+-----------+------------+-----------+
>>      base        stdev       patched     stdev       %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
>> 2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
>> 3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> Here is the refernce of no ple result.
>>   ebizzy-1x_nople 7592.6000 rec/sec
>>   dbench_1x_nople 7853.6960 MB/sec
>
> I'm not sure how much we should trust ebizzy results,

Infact in my box ebizzy is giving very consistent result.

but even
> so, the dbench results are stranger. The percent error is huge
> (34%) and somehow we do much better for 1x overcommit with PLE
> enabled then without (for the patched version). How does that
> happen? How many guests are running in the 1x test?

Yes, dbench 1x result has big variance. I was running 4 guests
with 3 guests idle for 1x case.

  And are the
> throughput results the combined throughput of all of them? I
> wonder if this jump in throughput is just the guests' perceived
> throughput, but wrong due to bad virtual time keeping.  Can we
> run a long-lasting benchmark and measure the elapsed time with
> a clock external from the guests?

Are you saying guest time keeping is not reliable and hence resulting
in high variance.  dbench tests are 3 minute + 30sec warmup tests, and
look very consistent in 2x,3x,4x cases..

I am happy to go ahead and test with whatever you suggest.

But in general I am seeing undercommit cases improve very well
especially for large guests. Vinod had posted Aim7 benchmark results
which had supported that for lower overcommits. However for near 1x
cases he saw variations but definite improvements around 100-200%
IIRC against base PLE.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-22  7:39 ` [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
@ 2013-01-24 10:32   ` Ingo Molnar
  2013-01-25 10:40     ` Raghavendra K T
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2013-01-24 10:32 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones


* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> In case of undercomitted scenarios, especially in large guests
> yield_to overhead is significantly high. when run queue length of
> source and target is one, take an opportunity to bail out and return
> -ESRCH. This return condition can be further exploited to quickly come
> out of PLE handler.
> 
> (History: Raghavendra initially worked on break out of kvm ple handler upon
>  seeing source runqueue length = 1, but it had to export rq length).
>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Acked-by: Andrew Jones <drjones@redhat.com>
> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
> 
>  kernel/sched/core.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..fc219a5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *	true (>0) if we indeed boosted the target task.
> + *	false (0) if we failed to boost the target.
> + *	-ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  
>  again:
>  	p_rq = task_rq(p);
> +	/*
> +	 * If we're the only runnable task on the rq and target rq also
> +	 * has only one task, there's absolutely no point in yielding.
> +	 */
> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}

Looks good to me in principle.

Would be nice to get more consistent benchmark numbers. Once 
those are unambiguously showing that this is a win:

  Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-24 10:32   ` Ingo Molnar
@ 2013-01-25 10:40     ` Raghavendra K T
  2013-01-25 10:47       ` Ingo Molnar
  2013-01-25 11:05       ` Andrew Jones
  0 siblings, 2 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-25 10:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Raghavendra K T, Peter Zijlstra, Avi Kivity, H. Peter Anvin,
	Thomas Gleixner, Gleb Natapov, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

* Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:

> 
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > In case of undercomitted scenarios, especially in large guests
> > yield_to overhead is significantly high. when run queue length of
> > source and target is one, take an opportunity to bail out and return
> > -ESRCH. This return condition can be further exploited to quickly come
> > out of PLE handler.
> > 
> > (History: Raghavendra initially worked on break out of kvm ple handler upon
> >  seeing source runqueue length = 1, but it had to export rq length).
> >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > 
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > Acked-by: Andrew Jones <drjones@redhat.com>
> > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > ---
> > 
> >  kernel/sched/core.c |   25 +++++++++++++++++++------
> >  1 file changed, 19 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2d8927f..fc219a5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >   * It's the caller's job to ensure that the target task struct
> >   * can't go away on us before we can do any checks.
> >   *
> > - * Returns true if we indeed boosted the target task.
> > + * Returns:
> > + *	true (>0) if we indeed boosted the target task.
> > + *	false (0) if we failed to boost the target.
> > + *	-ESRCH if there's no task to yield to.
> >   */
> >  bool __sched yield_to(struct task_struct *p, bool preempt)
> >  {
> > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >  
> >  again:
> >  	p_rq = task_rq(p);
> > +	/*
> > +	 * If we're the only runnable task on the rq and target rq also
> > +	 * has only one task, there's absolutely no point in yielding.
> > +	 */
> > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > +		yielded = -ESRCH;
> > +		goto out_irq;
> > +	}
> 
> Looks good to me in principle.
> 
> Would be nice to get more consistent benchmark numbers. Once 
> those are unambiguously showing that this is a win:
> 
>   Acked-by: Ingo Molnar <mingo@kernel.org>
>

I ran the test with kernbench and sysbench again on 32 core mx3850
machine with 32 vcpu guests. Results shows definite improvements.

ebizzy and dbench show similar improvement for 1x overcommit
(note that stdev for 1x in dbench is lesser improvemet is now seen at
only 20%)

[ all the experiments are taken out of 8 run averages ].

The patches benefit large guest undercommit scenarios, so I believe
with large guest performance improvemnt is even significant. [ Chegu
Vinod results show performance near to no ple cases ]. Unfortunately I
do not have a machine to test larger guest (>32).

Ingo, Please let me know if this is okay to you.

base kernel = 3.8.0-rc4

+-----------+-----------+-----------+------------+-----------+
                kernbench  (time in sec lower is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
               sysbench (time in sec lower is better) 
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
                ebizzy  (records/sec higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
+-----------+-----------+-----------+------------+-----------+

+-----------+-----------+-----------+------------+-----------+
                dbench (throughput MB/sec  higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched    stdev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
+-----------+-----------+-----------+------------+-----------+


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 10:40     ` Raghavendra K T
@ 2013-01-25 10:47       ` Ingo Molnar
  2013-01-25 15:54         ` Raghavendra K T
  2013-01-25 11:05       ` Andrew Jones
  1 sibling, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2013-01-25 10:47 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones


* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> 
> > 
> > * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> > 
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > In case of undercomitted scenarios, especially in large guests
> > > yield_to overhead is significantly high. when run queue length of
> > > source and target is one, take an opportunity to bail out and return
> > > -ESRCH. This return condition can be further exploited to quickly come
> > > out of PLE handler.
> > > 
> > > (History: Raghavendra initially worked on break out of kvm ple handler upon
> > >  seeing source runqueue length = 1, but it had to export rq length).
> > >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > > 
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > > Acked-by: Andrew Jones <drjones@redhat.com>
> > > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > > ---
> > > 
> > >  kernel/sched/core.c |   25 +++++++++++++++++++------
> > >  1 file changed, 19 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 2d8927f..fc219a5 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> > >   * It's the caller's job to ensure that the target task struct
> > >   * can't go away on us before we can do any checks.
> > >   *
> > > - * Returns true if we indeed boosted the target task.
> > > + * Returns:
> > > + *	true (>0) if we indeed boosted the target task.
> > > + *	false (0) if we failed to boost the target.
> > > + *	-ESRCH if there's no task to yield to.
> > >   */
> > >  bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  {
> > > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  
> > >  again:
> > >  	p_rq = task_rq(p);
> > > +	/*
> > > +	 * If we're the only runnable task on the rq and target rq also
> > > +	 * has only one task, there's absolutely no point in yielding.
> > > +	 */
> > > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > > +		yielded = -ESRCH;
> > > +		goto out_irq;
> > > +	}
> > 
> > Looks good to me in principle.
> > 
> > Would be nice to get more consistent benchmark numbers. Once 
> > those are unambiguously showing that this is a win:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> >
> 
> I ran the test with kernbench and sysbench again on 32 core mx3850
> machine with 32 vcpu guests. Results shows definite improvements.
> 
> ebizzy and dbench show similar improvement for 1x overcommit
> (note that stdev for 1x in dbench is lesser improvemet is now seen at
> only 20%)
> 
> [ all the experiments are taken out of 8 run averages ].
> 
> The patches benefit large guest undercommit scenarios, so I believe
> with large guest performance improvemnt is even significant. [ Chegu
> Vinod results show performance near to no ple cases ]. Unfortunately I
> do not have a machine to test larger guest (>32).
> 
> Ingo, Please let me know if this is okay to you.
> 
> base kernel = 3.8.0-rc4
> 
> +-----------+-----------+-----------+------------+-----------+
>                 kernbench  (time in sec lower is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                sysbench (time in sec lower is better) 
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 ebizzy  (records/sec higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 dbench (throughput MB/sec  higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> +-----------+-----------+-----------+------------+-----------+

The numbers look pretty convincing, thanks. The workloads were 
CPU bound most of the time, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 10:40     ` Raghavendra K T
  2013-01-25 10:47       ` Ingo Molnar
@ 2013-01-25 11:05       ` Andrew Jones
  2013-01-25 15:58         ` Raghavendra K T
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Jones @ 2013-01-25 11:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ingo Molnar, Peter Zijlstra, Avi Kivity, H. Peter Anvin,
	Thomas Gleixner, Gleb Natapov, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On Fri, Jan 25, 2013 at 04:10:25PM +0530, Raghavendra K T wrote:
> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> 
> > 
> > * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> > 
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > In case of undercomitted scenarios, especially in large guests
> > > yield_to overhead is significantly high. when run queue length of
> > > source and target is one, take an opportunity to bail out and return
> > > -ESRCH. This return condition can be further exploited to quickly come
> > > out of PLE handler.
> > > 
> > > (History: Raghavendra initially worked on break out of kvm ple handler upon
> > >  seeing source runqueue length = 1, but it had to export rq length).
> > >  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> > > 
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > > Acked-by: Andrew Jones <drjones@redhat.com>
> > > Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> > > ---
> > > 
> > >  kernel/sched/core.c |   25 +++++++++++++++++++------
> > >  1 file changed, 19 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 2d8927f..fc219a5 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> > >   * It's the caller's job to ensure that the target task struct
> > >   * can't go away on us before we can do any checks.
> > >   *
> > > - * Returns true if we indeed boosted the target task.
> > > + * Returns:
> > > + *	true (>0) if we indeed boosted the target task.
> > > + *	false (0) if we failed to boost the target.
> > > + *	-ESRCH if there's no task to yield to.
> > >   */
> > >  bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  {
> > > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> > >  
> > >  again:
> > >  	p_rq = task_rq(p);
> > > +	/*
> > > +	 * If we're the only runnable task on the rq and target rq also
> > > +	 * has only one task, there's absolutely no point in yielding.
> > > +	 */
> > > +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> > > +		yielded = -ESRCH;
> > > +		goto out_irq;
> > > +	}
> > 
> > Looks good to me in principle.
> > 
> > Would be nice to get more consistent benchmark numbers. Once 
> > those are unambiguously showing that this is a win:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> >
> 
> I ran the test with kernbench and sysbench again on 32 core mx3850
> machine with 32 vcpu guests. Results shows definite improvements.
> 
> ebizzy and dbench show similar improvement for 1x overcommit
> (note that stdev for 1x in dbench is lesser improvemet is now seen at
> only 20%)
> 
> [ all the experiments are taken out of 8 run averages ].
> 
> The patches benefit large guest undercommit scenarios, so I believe
> with large guest performance improvemnt is even significant. [ Chegu
> Vinod results show performance near to no ple cases ].

The last results you posted for dbench for the patched 1x case were
showing much better throughput than the no-ple 1x case, which is what
was strange. Is that still happening? You don't have the no-ple 1x
data here this time. The percent errors look a lot better.

Unfortunately I
> do not have a machine to test larger guest (>32).
> 
> Ingo, Please let me know if this is okay to you.
> 
> base kernel = 3.8.0-rc4
> 
> +-----------+-----------+-----------+------------+-----------+
>                 kernbench  (time in sec lower is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
>                sysbench (time in sec lower is better) 
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 ebizzy  (records/sec higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> +-----------+-----------+-----------+------------+-----------+
> 
> +-----------+-----------+-----------+------------+-----------+
>                 dbench (throughput MB/sec  higher is better)
> +-----------+-----------+-----------+------------+-----------+
>     base        stdev        patched    stdev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> +-----------+-----------+-----------+------------+-----------+
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 10:47       ` Ingo Molnar
@ 2013-01-25 15:54         ` Raghavendra K T
  2013-01-25 18:49           ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2013-01-25 15:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones

On 01/25/2013 04:17 PM, Ingo Molnar wrote:
>
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> In case of undercomitted scenarios, especially in large guests
>>>> yield_to overhead is significantly high. when run queue length of
>>>> source and target is one, take an opportunity to bail out and return
>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>> out of PLE handler.
>>>>
>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>
>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>> ---
>>>>
>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8927f..fc219a5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>    * It's the caller's job to ensure that the target task struct
>>>>    * can't go away on us before we can do any checks.
>>>>    *
>>>> - * Returns true if we indeed boosted the target task.
>>>> + * Returns:
>>>> + *	true (>0) if we indeed boosted the target task.
>>>> + *	false (0) if we failed to boost the target.
>>>> + *	-ESRCH if there's no task to yield to.
>>>>    */
>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>   {
>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>
>>>>   again:
>>>>   	p_rq = task_rq(p);
>>>> +	/*
>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>> +	 */
>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +		yielded = -ESRCH;
>>>> +		goto out_irq;
>>>> +	}
>>>
>>> Looks good to me in principle.
>>>
>>> Would be nice to get more consistent benchmark numbers. Once
>>> those are unambiguously showing that this is a win:
>>>
>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>
>>
>> I ran the test with kernbench and sysbench again on 32 core mx3850
>> machine with 32 vcpu guests. Results shows definite improvements.
>>
>> ebizzy and dbench show similar improvement for 1x overcommit
>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>> only 20%)
>>
>> [ all the experiments are taken out of 8 run averages ].
>>
>> The patches benefit large guest undercommit scenarios, so I believe
>> with large guest performance improvemnt is even significant. [ Chegu
>> Vinod results show performance near to no ple cases ]. Unfortunately I
>> do not have a machine to test larger guest (>32).
>>
>> Ingo, Please let me know if this is okay to you.
>>
>> base kernel = 3.8.0-rc4
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  kernbench  (time in sec lower is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
>> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
>> +-----------+-----------+-----------+------------+-----------+
>> +-----------+-----------+-----------+------------+-----------+
>>                 sysbench (time in sec lower is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
>> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  ebizzy  (records/sec higher is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
>> +-----------+-----------+-----------+------------+-----------+
>>
>> +-----------+-----------+-----------+------------+-----------+
>>                  dbench (throughput MB/sec  higher is better)
>> +-----------+-----------+-----------+------------+-----------+
>>      base        stdev        patched    stdev      %improve
>> +-----------+-----------+-----------+------------+-----------+
>> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
>> +-----------+-----------+-----------+------------+-----------+
>
> The numbers look pretty convincing, thanks. The workloads were
> CPU bound most of the time, right?

Yes. CPU bound most of the time. I also used tmpfs to reduce io
overhead (for dbbench).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 11:05       ` Andrew Jones
@ 2013-01-25 15:58         ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-25 15:58 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ingo Molnar, Peter Zijlstra, Avi Kivity, H. Peter Anvin,
	Thomas Gleixner, Gleb Natapov, Ingo Molnar, Marcelo Tosatti,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On 01/25/2013 04:35 PM, Andrew Jones wrote:
> On Fri, Jan 25, 2013 at 04:10:25PM +0530, Raghavendra K T wrote:
>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> In case of undercomitted scenarios, especially in large guests
>>>> yield_to overhead is significantly high. when run queue length of
>>>> source and target is one, take an opportunity to bail out and return
>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>> out of PLE handler.
>>>>
>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>
>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>> ---
>>>>
>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8927f..fc219a5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>    * It's the caller's job to ensure that the target task struct
>>>>    * can't go away on us before we can do any checks.
>>>>    *
>>>> - * Returns true if we indeed boosted the target task.
>>>> + * Returns:
>>>> + *	true (>0) if we indeed boosted the target task.
>>>> + *	false (0) if we failed to boost the target.
>>>> + *	-ESRCH if there's no task to yield to.
>>>>    */
>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>   {
>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>
>>>>   again:
>>>>   	p_rq = task_rq(p);
>>>> +	/*
>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>> +	 */
>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +		yielded = -ESRCH;
>>>> +		goto out_irq;
>>>> +	}
>>>
>>> Looks good to me in principle.
>>>
>>> Would be nice to get more consistent benchmark numbers. Once
>>> those are unambiguously showing that this is a win:
>>>
>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>
>>
>> I ran the test with kernbench and sysbench again on 32 core mx3850
>> machine with 32 vcpu guests. Results shows definite improvements.
>>
>> ebizzy and dbench show similar improvement for 1x overcommit
>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>> only 20%)
>>
>> [ all the experiments are taken out of 8 run averages ].
>>
>> The patches benefit large guest undercommit scenarios, so I believe
>> with large guest performance improvemnt is even significant. [ Chegu
>> Vinod results show performance near to no ple cases ].
>
> The last results you posted for dbench for the patched 1x case were
> showing much better throughput than the no-ple 1x case, which is what
> was strange. Is that still happening? You don't have the no-ple 1x
> data here this time. The percent errors look a lot better.

I re-ran the experiment and almost got 4% (13500 vs 14100) less 
throughput compared to patched for no-ple case. ( I believe this 
variation may be due to having 4 guest with 3 idle.. as no-ple is very 
sensitive after 1x).




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 15:54         ` Raghavendra K T
@ 2013-01-25 18:49           ` Ingo Molnar
  2013-01-27 16:58             ` Raghavendra K T
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2013-01-25 18:49 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Gleb Natapov, Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones


* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> On 01/25/2013 04:17 PM, Ingo Molnar wrote:
> >
> >* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> >
> >>* Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
> >>
> >>>
> >>>* Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> >>>
> >>>>From: Peter Zijlstra <peterz@infradead.org>
> >>>>
> >>>>In case of undercomitted scenarios, especially in large guests
> >>>>yield_to overhead is significantly high. when run queue length of
> >>>>source and target is one, take an opportunity to bail out and return
> >>>>-ESRCH. This return condition can be further exploited to quickly come
> >>>>out of PLE handler.
> >>>>
> >>>>(History: Raghavendra initially worked on break out of kvm ple handler upon
> >>>>  seeing source runqueue length = 1, but it had to export rq length).
> >>>>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> >>>>
> >>>>Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >>>>Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> >>>>Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >>>>Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>>>Acked-by: Andrew Jones <drjones@redhat.com>
> >>>>Tested-by: Chegu Vinod <chegu_vinod@hp.com>
> >>>>---
> >>>>
> >>>>  kernel/sched/core.c |   25 +++++++++++++++++++------
> >>>>  1 file changed, 19 insertions(+), 6 deletions(-)
> >>>>
> >>>>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>>index 2d8927f..fc219a5 100644
> >>>>--- a/kernel/sched/core.c
> >>>>+++ b/kernel/sched/core.c
> >>>>@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >>>>   * It's the caller's job to ensure that the target task struct
> >>>>   * can't go away on us before we can do any checks.
> >>>>   *
> >>>>- * Returns true if we indeed boosted the target task.
> >>>>+ * Returns:
> >>>>+ *	true (>0) if we indeed boosted the target task.
> >>>>+ *	false (0) if we failed to boost the target.
> >>>>+ *	-ESRCH if there's no task to yield to.
> >>>>   */
> >>>>  bool __sched yield_to(struct task_struct *p, bool preempt)
> >>>>  {
> >>>>@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >>>>
> >>>>  again:
> >>>>  	p_rq = task_rq(p);
> >>>>+	/*
> >>>>+	 * If we're the only runnable task on the rq and target rq also
> >>>>+	 * has only one task, there's absolutely no point in yielding.
> >>>>+	 */
> >>>>+	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> >>>>+		yielded = -ESRCH;
> >>>>+		goto out_irq;
> >>>>+	}
> >>>
> >>>Looks good to me in principle.
> >>>
> >>>Would be nice to get more consistent benchmark numbers. Once
> >>>those are unambiguously showing that this is a win:
> >>>
> >>>   Acked-by: Ingo Molnar <mingo@kernel.org>
> >>>
> >>
> >>I ran the test with kernbench and sysbench again on 32 core mx3850
> >>machine with 32 vcpu guests. Results shows definite improvements.
> >>
> >>ebizzy and dbench show similar improvement for 1x overcommit
> >>(note that stdev for 1x in dbench is lesser improvemet is now seen at
> >>only 20%)
> >>
> >>[ all the experiments are taken out of 8 run averages ].
> >>
> >>The patches benefit large guest undercommit scenarios, so I believe
> >>with large guest performance improvemnt is even significant. [ Chegu
> >>Vinod results show performance near to no ple cases ]. Unfortunately I
> >>do not have a machine to test larger guest (>32).
> >>
> >>Ingo, Please let me know if this is okay to you.
> >>
> >>base kernel = 3.8.0-rc4
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 kernbench  (time in sec lower is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
> >>2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
> >>+-----------+-----------+-----------+------------+-----------+
> >>+-----------+-----------+-----------+------------+-----------+
> >>                sysbench (time in sec lower is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
> >>2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
> >>+-----------+-----------+-----------+------------+-----------+
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 ebizzy  (records/sec higher is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
> >>+-----------+-----------+-----------+------------+-----------+
> >>
> >>+-----------+-----------+-----------+------------+-----------+
> >>                 dbench (throughput MB/sec  higher is better)
> >>+-----------+-----------+-----------+------------+-----------+
> >>     base        stdev        patched    stdev      %improve
> >>+-----------+-----------+-----------+------------+-----------+
> >>1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
> >>+-----------+-----------+-----------+------------+-----------+
> >
> >The numbers look pretty convincing, thanks. The workloads were
> >CPU bound most of the time, right?
> 
> Yes. CPU bound most of the time. I also used tmpfs to reduce 
> io overhead (for dbbench).

Ok, cool.

Which tree will this be upstreamed through - the KVM tree? I'd 
suggest the KVM tree because KVM will be the one exposed to the 
effects of this change.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2013-01-25 18:49           ` Ingo Molnar
@ 2013-01-27 16:58             ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2013-01-27 16:58 UTC (permalink / raw)
  To: Ingo Molnar, Gleb Natapov, Marcelo Tosatti
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 01/26/2013 12:19 AM, Ingo Molnar wrote:
>
> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> On 01/25/2013 04:17 PM, Ingo Molnar wrote:
>>>
>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>
>>>> * Ingo Molnar <mingo@kernel.org> [2013-01-24 11:32:13]:
>>>>
>>>>>
>>>>> * Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>>
>>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>>
>>>>>> In case of undercomitted scenarios, especially in large guests
>>>>>> yield_to overhead is significantly high. when run queue length of
>>>>>> source and target is one, take an opportunity to bail out and return
>>>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>>>> out of PLE handler.
>>>>>>
>>>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>>>
>>>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>>>> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>>>>>> ---
>>>>>>
>>>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>>> index 2d8927f..fc219a5 100644
>>>>>> --- a/kernel/sched/core.c
>>>>>> +++ b/kernel/sched/core.c
>>>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>>>    * It's the caller's job to ensure that the target task struct
>>>>>>    * can't go away on us before we can do any checks.
>>>>>>    *
>>>>>> - * Returns true if we indeed boosted the target task.
>>>>>> + * Returns:
>>>>>> + *	true (>0) if we indeed boosted the target task.
>>>>>> + *	false (0) if we failed to boost the target.
>>>>>> + *	-ESRCH if there's no task to yield to.
>>>>>>    */
>>>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>>>   {
>>>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>>>
>>>>>>   again:
>>>>>>   	p_rq = task_rq(p);
>>>>>> +	/*
>>>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>>>> +	 */
>>>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>>>> +		yielded = -ESRCH;
>>>>>> +		goto out_irq;
>>>>>> +	}
>>>>>
>>>>> Looks good to me in principle.
>>>>>
>>>>> Would be nice to get more consistent benchmark numbers. Once
>>>>> those are unambiguously showing that this is a win:
>>>>>
>>>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>>>>
>>>>
>>>> I ran the test with kernbench and sysbench again on 32 core mx3850
>>>> machine with 32 vcpu guests. Results shows definite improvements.
>>>>
>>>> ebizzy and dbench show similar improvement for 1x overcommit
>>>> (note that stdev for 1x in dbench is lesser improvemet is now seen at
>>>> only 20%)
>>>>
>>>> [ all the experiments are taken out of 8 run averages ].
>>>>
>>>> The patches benefit large guest undercommit scenarios, so I believe
>>>> with large guest performance improvemnt is even significant. [ Chegu
>>>> Vinod results show performance near to no ple cases ]. Unfortunately I
>>>> do not have a machine to test larger guest (>32).
>>>>
>>>> Ingo, Please let me know if this is okay to you.
>>>>
>>>> base kernel = 3.8.0-rc4
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  kernbench  (time in sec lower is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x   46.6028     1.8672	    42.4494     1.1390	   8.91234
>>>> 2x   99.9074     9.1859	    90.4050     2.6131	   9.51121
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                 sysbench (time in sec lower is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x   18.7402     0.3764	    17.7431     0.3589	   5.32065
>>>> 2x   13.2238     0.1935	    13.0096     0.3152	   1.61981
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  ebizzy  (records/sec higher is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x  2421.9000    19.1801	  5883.1000   112.7243	 142.91259
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>                  dbench (throughput MB/sec  higher is better)
>>>> +-----------+-----------+-----------+------------+-----------+
>>>>      base        stdev        patched    stdev      %improve
>>>> +-----------+-----------+-----------+------------+-----------+
>>>> 1x  11675.9900   857.4154	 14103.5000   215.8425	  20.79061
>>>> +-----------+-----------+-----------+------------+-----------+
>>>
>>> The numbers look pretty convincing, thanks. The workloads were
>>> CPU bound most of the time, right?
>>
>> Yes. CPU bound most of the time. I also used tmpfs to reduce
>> io overhead (for dbbench).
>
> Ok, cool.
>
> Which tree will this be upstreamed through - the KVM tree? I'd
> suggest the KVM tree because KVM will be the one exposed to the
> effects of this change.

Thanks Ingo.

Marcelo, Could you please take this into kvm tree.. ?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios
  2013-01-22  7:38 [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
                   ` (2 preceding siblings ...)
  2013-01-23 13:57 ` [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Andrew Jones
@ 2013-01-29 14:05 ` Gleb Natapov
  3 siblings, 0 replies; 14+ messages in thread
From: Gleb Natapov @ 2013-01-29 14:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, Avi Kivity, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones

On Tue, Jan 22, 2013 at 01:08:54PM +0530, Raghavendra K T wrote:
>  In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
> 
>  The first patch optimizes all the yield_to by bailing out when there
>  is no need to continue in yield_to (i.e., when there is only one task 
>  in source and target rq).
> 
>  Second patch uses that in PLE handler. Further when a yield_to fails
>  we do not immediately go out of PLE handler instead we try thrice 
>  to have better statistical possibility of false return. Otherwise that
>  would affect moderate overcommit cases.
>  
>  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
>  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
> 
> 
> base = 3.7.0-rc6 
> machine: 32 core mx3850 x5 PLE mc
> 
> --+-----------+-----------+-----------+------------+-----------+
>                ebizzy (rec/sec higher is beter)
> --+-----------+-----------+-----------+------------+-----------+
>     base        stdev       patched     stdev       %improve     
> --+-----------+-----------+-----------+------------+-----------+
> 1x   2511.3000    21.5409    6051.8000   170.2592   140.98276   
> 2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
> 3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
> --+-----------+-----------+-----------+------------+-----------+
> 
> --+-----------+-----------+-----------+------------+-----------+
>         dbench (throughput in MB/sec. higher is better)
> --+-----------+-----------+-----------+------------+-----------+
>     base        stdev       patched     stdev       %improve     
> --+-----------+-----------+-----------+------------+-----------+
> 1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
> 2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
> 3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
> --+-----------+-----------+-----------+------------+-----------+
> 
> Here is the refernce of no ple result.
>  ebizzy-1x_nople 7592.6000 rec/sec
>  dbench_1x_nople 7853.6960 MB/sec
> 
> The result says we can still improve by 60% for ebizzy, but overall we are
> getting impressive performance with the patches.
> 
>  Changes Since V2:
>  - Dropped global measures usage patch (Peter Zilstra)
>  - Do not bail out on first failure (Avi Kivity)
>  - Try thrice for the failure of yield_to to get statistically more correct
>    behaviour.
> 
>  Changes since V1:
>  - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
>  - Use yield() instead of schedule in overcommit scenarios (Rik)
>  - Use loadavg knowledge to detect undercommit/overcommit
> 
>  Peter Zijlstra (1):
>   Bail out of yield_to when source and target runqueue has one task
> 
>  Raghavendra K T (1):
>   Handle yield_to failure return for potential undercommit case
> 
>  Please let me know your comments and suggestions.
> 
>  Link for the discussion of V3 original:
>  https://lkml.org/lkml/2012/11/26/166
> 
>  Link for V2:
>  https://lkml.org/lkml/2012/10/29/287
> 
>  Link for V1:
>  https://lkml.org/lkml/2012/9/21/168
> 
>  kernel/sched/core.c | 25 +++++++++++++++++++------
>  virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
>  2 files changed, 35 insertions(+), 16 deletions(-)
Applied, thanks.

--
			Gleb.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-01-29 14:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22  7:38 [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
2013-01-22  7:39 ` [PATCH V3 RESEND RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
2013-01-24 10:32   ` Ingo Molnar
2013-01-25 10:40     ` Raghavendra K T
2013-01-25 10:47       ` Ingo Molnar
2013-01-25 15:54         ` Raghavendra K T
2013-01-25 18:49           ` Ingo Molnar
2013-01-27 16:58             ` Raghavendra K T
2013-01-25 11:05       ` Andrew Jones
2013-01-25 15:58         ` Raghavendra K T
2013-01-22  7:39 ` [PATCH V3 RESEND RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
2013-01-23 13:57 ` [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios Andrew Jones
2013-01-24  8:27   ` Raghavendra K T
2013-01-29 14:05 ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).