linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
@ 2012-11-26 12:07 Raghavendra K T
  2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-26 12:07 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

 In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

 The first patch optimizes all the yield_to by bailing out when there
 is no need to continue in yield_to (i.e., when there is only one task 
 in source and target rq).

 Second patch uses that in PLE handler. Further when a yield_to fails
 we do not immediately go out of PLE handler instead we try thrice 
 to have better statistical possibility of false return. Otherwise that
 would affect moderate overcommit cases.
 
 Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
 around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6 
machine: 32 core mx3850 x5 PLE mc

--+-----------+-----------+-----------+------------+-----------+
               ebizzy (rec/sec higher is beter)
--+-----------+-----------+-----------+------------+-----------+
    base        stdev       patched     stdev       %improve     
--+-----------+-----------+-----------+------------+-----------+
1x   2511.3000    21.5409    6051.8000   170.2592   140.98276   
2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
4x   1784.3750   102.2699    2018.7500   187.5723    13.13485
--+-----------+-----------+-----------+------------+-----------+

--+-----------+-----------+-----------+------------+-----------+
        dbench (throughput in MB/sec. higher is better)
--+-----------+-----------+-----------+------------+-----------+
    base        stdev       patched     stdev       %improve     
--+-----------+-----------+-----------+------------+-----------+
1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
4x  3043.1725  3243.7281    4664.4662   5946.5741      53.27643
--+-----------+-----------+-----------+------------+-----------+

Here is the refernce of no ple result.
 ebizzy-1x_nople 7592.6000 rec/sec
 dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we are
getting impressive performance with the patches.

 Changes Since V2:
 - Dropped global measures usage patch (Peter Zilstra)
 - Do not bail out on first failure (Avi Kivity)
 - Try thrice for the failure of yield_to to get statistically more correct
   behaviour.

 Changes since V1:
 - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
 - Use yield() instead of schedule in overcommit scenarios (Rik)
 - Use loadavg knowledge to detect undercommit/overcommit

 Peter Zijlstra (1):
  Bail out of yield_to when source and target runqueue has one task

 Raghavendra K T (1):
  Handle yield_to failure return for potential undercommit case

 Please let me know your comments and suggestions.

 Link for V2:
 https://lkml.org/lkml/2012/10/29/287

 Link for V1:
 https://lkml.org/lkml/2012/9/21/168

 kernel/sched/core.c | 25 +++++++++++++++++++------
 virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
 2 files changed, 35 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-26 12:07 [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
@ 2012-11-26 12:07 ` Raghavendra K T
  2012-11-26 13:35   ` Andrew Jones
  2012-12-14  0:29   ` Marcelo Tosatti
  2012-11-26 12:08 ` [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
  2012-11-29  2:07 ` [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Chegu Vinod
  2 siblings, 2 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-26 12:07 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Gleb Natapov,
	Ingo Molnar, Avi Kivity, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

From: Peter Zijlstra <peterz@infradead.org>

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---

 kernel/sched/core.c |   25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *	true (>0) if we indeed boosted the target task.
+ *	false (0) if we failed to boost the target.
+ *	-ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+	/*
+	 * If we're the only runnable task on the rq and target rq also
+	 * has only one task, there's absolutely no point in yielding.
+	 */
+	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+		yielded = -ESRCH;
+		goto out_irq;
+	}
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
 	}
 
 	if (!curr->sched_class->yield_to_task)
-		goto out;
+		goto out_unlock;
 
 	if (curr->sched_class != p->sched_class)
-		goto out;
+		goto out_unlock;
 
 	if (task_running(p_rq, p) || p->state)
-		goto out;
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4329,11 +4341,12 @@ again:
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
-	if (yielded)
+	if (yielded > 0)
 		schedule();
 
 	return yielded;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-26 12:07 [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
  2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
@ 2012-11-26 12:08 ` Raghavendra K T
  2012-11-26 13:43   ` Andrew Jones
  2012-11-28  1:12   ` Marcelo Tosatti
  2012-11-29  2:07 ` [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Chegu Vinod
  2 siblings, 2 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-26 12:08 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel
  Cc: Srikar, Nikunj A. Dadhania, KVM, Raghavendra K T, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..053f494 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 {
 	struct pid *pid;
 	struct task_struct *task = NULL;
+	bool ret = false;
 
 	rcu_read_lock();
 	pid = rcu_dereference(target->pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
 		task = get_pid_task(target->pid, PIDTYPE_PID);
 	rcu_read_unlock();
 	if (!task)
-		return false;
+		return ret;
 	if (task->flags & PF_VCPU) {
 		put_task_struct(task);
-		return false;
-	}
-	if (yield_to(task, 1)) {
-		put_task_struct(task);
-		return true;
+		return ret;
 	}
+	ret = yield_to(task, 1);
 	put_task_struct(task);
-	return false;
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
 
@@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return eligible;
 }
 #endif
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
+	int try = 3;
 	int pass;
 	int i;
 
@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
+	for (pass = 0; pass < 2 && !yielded && try; pass++) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (!pass && i <= last_boosted_vcpu) {
 				i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				continue;
 			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 				continue;
-			if (kvm_vcpu_yield_to(vcpu)) {
+
+			yielded = kvm_vcpu_yield_to(vcpu);
+			if (yielded > 0) {
 				kvm->last_boosted_vcpu = i;
-				yielded = 1;
 				break;
+			} else if (yielded < 0) {
+				try--;
+				if (!try)
+					break;
 			}
 		}
 	}


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
@ 2012-11-26 13:35   ` Andrew Jones
  2012-11-27 10:30     ` Raghavendra K T
  2012-12-14  0:29   ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Jones @ 2012-11-26 13:35 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Marcelo Tosatti, Gleb Natapov,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> In case of undercomitted scenarios, especially in large guests
> yield_to overhead is significantly high. when run queue length of
> source and target is one, take an opportunity to bail out and return
> -ESRCH. This return condition can be further exploited to quickly come
> out of PLE handler.
> 
> (History: Raghavendra initially worked on break out of kvm ple handler upon
>  seeing source runqueue length = 1, but it had to export rq length).
>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched/core.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..fc219a5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *	true (>0) if we indeed boosted the target task.
> + *	false (0) if we failed to boost the target.
> + *	-ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  
>  again:
>  	p_rq = task_rq(p);
> +	/*
> +	 * If we're the only runnable task on the rq and target rq also
> +	 * has only one task, there's absolutely no point in yielding.
> +	 */
> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
> @@ -4310,13 +4322,13 @@ again:
>  	}
>  
>  	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>  
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4329,11 +4341,12 @@ again:
>  			resched_task(p_rq->curr);
>  	}
>  
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
>  
> -	if (yielded)
> +	if (yielded > 0)
>  		schedule();
>  
>  	return yielded;
>

Acked-by: Andrew Jones <drjones@redhat.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-26 12:08 ` [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
@ 2012-11-26 13:43   ` Andrew Jones
  2012-11-26 14:06     ` Andrew Jones
  2012-11-27 10:27     ` Raghavendra K T
  2012-11-28  1:12   ` Marcelo Tosatti
  1 sibling, 2 replies; 27+ messages in thread
From: Andrew Jones @ 2012-11-26 13:43 UTC (permalink / raw)
  To: Raghavendra K T, riel
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri

On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> yield_to returns -ESRCH, When source and target of yield_to
> run queue length is one. When we see three successive failures of
> yield_to we assume we are in potential undercommit case and abort
> from PLE handler.
> The assumption is backed by low probability of wrong decision
> for even worst case scenarios such as average runqueue length
> between 1 and 2.
> 
> note that we do not update last boosted vcpu in failure cases.
> Thank Avi for raising question on aborting after first fail from yield_to.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index be70035..053f494 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>  {
>  	struct pid *pid;
>  	struct task_struct *task = NULL;
> +	bool ret = false;
>  
>  	rcu_read_lock();
>  	pid = rcu_dereference(target->pid);
> @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>  		task = get_pid_task(target->pid, PIDTYPE_PID);
>  	rcu_read_unlock();
>  	if (!task)
> -		return false;
> +		return ret;
>  	if (task->flags & PF_VCPU) {
>  		put_task_struct(task);
> -		return false;
> -	}
> -	if (yield_to(task, 1)) {
> -		put_task_struct(task);
> -		return true;
> +		return ret;
>  	}
> +	ret = yield_to(task, 1);
>  	put_task_struct(task);
> -	return false;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
>  
> @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>  	return eligible;
>  }
>  #endif
> +
>  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>  	struct kvm *kvm = me->kvm;
>  	struct kvm_vcpu *vcpu;
>  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>  	int yielded = 0;
> +	int try = 3;
>  	int pass;
>  	int i;
>  
> @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * VCPU is holding the lock that we need and will release it.
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
> -	for (pass = 0; pass < 2 && !yielded; pass++) {
> +	for (pass = 0; pass < 2 && !yielded && try; pass++) {
>  		kvm_for_each_vcpu(i, vcpu, kvm) {
>  			if (!pass && i <= last_boosted_vcpu) {
>  				i = last_boosted_vcpu;
> @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  				continue;
>  			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>  				continue;
> -			if (kvm_vcpu_yield_to(vcpu)) {
> +
> +			yielded = kvm_vcpu_yield_to(vcpu);
> +			if (yielded > 0) {
>  				kvm->last_boosted_vcpu = i;
> -				yielded = 1;
>  				break;
> +			} else if (yielded < 0) {
> +				try--;
> +				if (!try)
> +					break;
>  			}
>  		}
>  	}
> 

The check done in patch 1/2 is done before the double_rq_lock, so it's
cheap. Now, this patch is to avoid doing too many get_pid_task calls. I
wonder if it would make more sense to change the vcpu state from tracking
the pid to tracking the task. If that was done, then I don't believe this
patch is necessary.

Rik,
for 34bb10b79de7 was there a reason pid was used instead of task?

Drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-26 13:43   ` Andrew Jones
@ 2012-11-26 14:06     ` Andrew Jones
  2012-11-27 10:27     ` Raghavendra K T
  1 sibling, 0 replies; 27+ messages in thread
From: Andrew Jones @ 2012-11-26 14:06 UTC (permalink / raw)
  To: Raghavendra K T, riel
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Mon, Nov 26, 2012 at 02:43:02PM +0100, Andrew Jones wrote:
> On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> > From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > 
> > yield_to returns -ESRCH, When source and target of yield_to
> > run queue length is one. When we see three successive failures of
> > yield_to we assume we are in potential undercommit case and abort
> > from PLE handler.
> > The assumption is backed by low probability of wrong decision
> > for even worst case scenarios such as average runqueue length
> > between 1 and 2.
> > 
> > note that we do not update last boosted vcpu in failure cases.
> > Thank Avi for raising question on aborting after first fail from yield_to.
> > 
> > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > ---
> >  virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
> >  1 file changed, 16 insertions(+), 10 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index be70035..053f494 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
> >  {
> >  	struct pid *pid;
> >  	struct task_struct *task = NULL;
> > +	bool ret = false;
> >  
> >  	rcu_read_lock();
> >  	pid = rcu_dereference(target->pid);
> > @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
> >  		task = get_pid_task(target->pid, PIDTYPE_PID);
> >  	rcu_read_unlock();
> >  	if (!task)
> > -		return false;
> > +		return ret;
> >  	if (task->flags & PF_VCPU) {
> >  		put_task_struct(task);
> > -		return false;
> > -	}
> > -	if (yield_to(task, 1)) {
> > -		put_task_struct(task);
> > -		return true;
> > +		return ret;
> >  	}
> > +	ret = yield_to(task, 1);
> >  	put_task_struct(task);
> > -	return false;
> > +
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
> >  
> > @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
> >  	return eligible;
> >  }
> >  #endif
> > +
> >  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >  {
> >  	struct kvm *kvm = me->kvm;
> >  	struct kvm_vcpu *vcpu;
> >  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> >  	int yielded = 0;
> > +	int try = 3;
> >  	int pass;
> >  	int i;
> >  
> > @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >  	 * VCPU is holding the lock that we need and will release it.
> >  	 * We approximate round-robin by starting at the last boosted VCPU.
> >  	 */
> > -	for (pass = 0; pass < 2 && !yielded; pass++) {
> > +	for (pass = 0; pass < 2 && !yielded && try; pass++) {
> >  		kvm_for_each_vcpu(i, vcpu, kvm) {
> >  			if (!pass && i <= last_boosted_vcpu) {
> >  				i = last_boosted_vcpu;
> > @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >  				continue;
> >  			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> >  				continue;
> > -			if (kvm_vcpu_yield_to(vcpu)) {
> > +
> > +			yielded = kvm_vcpu_yield_to(vcpu);
> > +			if (yielded > 0) {
> >  				kvm->last_boosted_vcpu = i;
> > -				yielded = 1;
> >  				break;
> > +			} else if (yielded < 0) {
> > +				try--;
> > +				if (!try)
> > +					break;
> >  			}
> >  		}
> >  	}
> > 
> 
> The check done in patch 1/2 is done before the double_rq_lock, so it's
> cheap. Now, this patch is to avoid doing too many get_pid_task calls. I
> wonder if it would make more sense to change the vcpu state from tracking
> the pid to tracking the task. If that was done, then I don't believe this
> patch is necessary.
> 
> Rik,
> for 34bb10b79de7 was there a reason pid was used instead of task?

Nevermind, I guess there's no way to validate the task pointer without
checking the pid, since, as your git commit says, there are no guarantee
that the same task always keeps the same vcpu. We'd only know it's valid
if it's running, and if it's running, it's of no interest.

> 
> Drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-26 13:43   ` Andrew Jones
  2012-11-26 14:06     ` Andrew Jones
@ 2012-11-27 10:27     ` Raghavendra K T
  2012-11-27 13:22       ` Andrew Jones
  1 sibling, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2012-11-27 10:27 UTC (permalink / raw)
  To: Andrew Jones
  Cc: riel, Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On 11/26/2012 07:13 PM, Andrew Jones wrote:
> On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> yield_to returns -ESRCH, When source and target of yield_to
>> run queue length is one. When we see three successive failures of
>> yield_to we assume we are in potential undercommit case and abort
>> from PLE handler.
>> The assumption is backed by low probability of wrong decision
>> for even worst case scenarios such as average runqueue length
>> between 1 and 2.
>>
>> note that we do not update last boosted vcpu in failure cases.
>> Thank Avi for raising question on aborting after first fail from
yield_to.
>>
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
>>   1 file changed, 16 insertions(+), 10 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index be70035..053f494 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>>   {
>>   	struct pid *pid;
>>   	struct task_struct *task = NULL;
>> +	bool ret = false;
>>
>>   	rcu_read_lock();
>>   	pid = rcu_dereference(target->pid);
>> @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>>   		task = get_pid_task(target->pid, PIDTYPE_PID);
>>   	rcu_read_unlock();
>>   	if (!task)
>> -		return false;
>> +		return ret;
>>   	if (task->flags & PF_VCPU) {
>>   		put_task_struct(task);
>> -		return false;
>> -	}
>> -	if (yield_to(task, 1)) {
>> -		put_task_struct(task);
>> -		return true;
>> +		return ret;
>>   	}
>> +	ret = yield_to(task, 1);
>>   	put_task_struct(task);
>> -	return false;
>> +
>> +	return ret;
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
>>
>> @@ -1697,12 +1696,14 @@ bool
kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>>   	return eligible;
>>   }
>>   #endif
>> +
>>   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   {
>>   	struct kvm *kvm = me->kvm;
>>   	struct kvm_vcpu *vcpu;
>>   	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>>   	int yielded = 0;
>> +	int try = 3;
>>   	int pass;
>>   	int i;
>>
>> @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   	 * VCPU is holding the lock that we need and will release it.
>>   	 * We approximate round-robin by starting at the last boosted VCPU.
>>   	 */
>> -	for (pass = 0; pass < 2 && !yielded; pass++) {
>> +	for (pass = 0; pass < 2 && !yielded && try; pass++) {
>>   		kvm_for_each_vcpu(i, vcpu, kvm) {
>>   			if (!pass && i <= last_boosted_vcpu) {
>>   				i = last_boosted_vcpu;
>> @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   				continue;
>>   			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>   				continue;
>> -			if (kvm_vcpu_yield_to(vcpu)) {
>> +
>> +			yielded = kvm_vcpu_yield_to(vcpu);
>> +			if (yielded > 0) {
>>   				kvm->last_boosted_vcpu = i;
>> -				yielded = 1;
>>   				break;
>> +			} else if (yielded < 0) {
>> +				try--;
>> +				if (!try)
>> +					break;
>>   			}
>>   		}
>>   	}
>>

Drew, Thanks for reviewing this.
>
> The check done in patch 1/2 is done before the double_rq_lock, so it's
> cheap. Now, this patch is to avoid doing too many get_pid_task calls.I
> wonder if it would make more sense to change the vcpu state from tracking
> the pid to tracking the task. If that was done, then I don't believe this
> patch is necessary.

We would need a logic not to break upon first failure of yield_to. 
(which happens otherwise with patch1 alone). Breaking upon first
failure out of ple handler resulted in degradation in moderate
overcommits due to false exits even when we have more than one task in
other cpu run queues.

But your suggestion triggered an idea to me, what would be the cost of
iterating over all vcpus despite of yield_to failure?

(Where we breakout of PLE handler only if we have successful yield i.e 
yielded > 0) with something like this:

-       for (pass = 0; pass < 2 && !yielded; pass++) {
+       for (pass = 0; pass < 2 && yielded <=0 ; pass++) {
                 kvm_for_each_vcpu(i, vcpu, kvm) {
                         if (!pass && i <= last_boosted_vcpu) {
                                 i = last_boosted_vcpu;
@@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
                                 continue;
                         if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
                                 continue;
-                       if (kvm_vcpu_yield_to(vcpu)) {
+
+                       yielded = kvm_vcpu_yield_to(vcpu);
+                       if (yielded > 0) {
                                 kvm->last_boosted_vcpu = i;
-                               yielded = 1;
                                 break;

Here is the result of the above patch w.r.t to base and current patch
series.

benchmark     improvement w.r.t base   improvement w.r.t current patch
ebizzy_1x      131.22287               -9.76%
ebizzy_4x      -7.97198                -21.1%

dbench_1x       25.67077               -25.55%
dbench_4x       -69.19086              -122.46%


Current patches perform better. So this means iterating over vcpus
has some overhead.  Though we have IMO for bigger machine with large 
guests, this is
significant..

Let me know if this patch sounds good to you..


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-26 13:35   ` Andrew Jones
@ 2012-11-27 10:30     ` Raghavendra K T
  2012-11-27 14:04       ` Andrew Theurer
  2012-11-27 14:23       ` Chegu Vinod
  0 siblings, 2 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-27 10:30 UTC (permalink / raw)
  To: Andrew Jones, Marcelo Tosatti, Gleb Natapov, Chegu Vinod,
	Andrew M. Theurer
  Cc: Peter Zijlstra, H. Peter Anvin, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	LKML, Srivatsa Vaddagiri

On 11/26/2012 07:05 PM, Andrew Jones wrote:
> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> In case of undercomitted scenarios, especially in large guests
>> yield_to overhead is significantly high. when run queue length of
>> source and target is one, take an opportunity to bail out and return
>> -ESRCH. This return condition can be further exploited to quickly come
>> out of PLE handler.
>>
>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>   seeing source runqueue length = 1, but it had to export rq length).
>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>
>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2d8927f..fc219a5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>    * It's the caller's job to ensure that the target task struct
>>    * can't go away on us before we can do any checks.
>>    *
>> - * Returns true if we indeed boosted the target task.
>> + * Returns:
>> + *	true (>0) if we indeed boosted the target task.
>> + *	false (0) if we failed to boost the target.
>> + *	-ESRCH if there's no task to yield to.
>>    */
>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>   {
>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>
>>   again:
>>   	p_rq = task_rq(p);
>> +	/*
>> +	 * If we're the only runnable task on the rq and target rq also
>> +	 * has only one task, there's absolutely no point in yielding.
>> +	 */
>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>> +		yielded = -ESRCH;
>> +		goto out_irq;
>> +	}
>> +
>>   	double_rq_lock(rq, p_rq);
>>   	while (task_rq(p) != p_rq) {
>>   		double_rq_unlock(rq, p_rq);
>> @@ -4310,13 +4322,13 @@ again:
>>   	}
>>
>>   	if (!curr->sched_class->yield_to_task)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	if (curr->sched_class != p->sched_class)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	if (task_running(p_rq, p) || p->state)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>   	if (yielded) {
>> @@ -4329,11 +4341,12 @@ again:
>>   			resched_task(p_rq->curr);
>>   	}
>>
>> -out:
>> +out_unlock:
>>   	double_rq_unlock(rq, p_rq);
>> +out_irq:
>>   	local_irq_restore(flags);
>>
>> -	if (yielded)
>> +	if (yielded > 0)
>>   		schedule();
>>
>>   	return yielded;
>>
>
> Acked-by: Andrew Jones <drjones@redhat.com>
>

Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on the 
patches..

Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-27 10:27     ` Raghavendra K T
@ 2012-11-27 13:22       ` Andrew Jones
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Jones @ 2012-11-27 13:22 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: riel, Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri

On Tue, Nov 27, 2012 at 03:57:25PM +0530, Raghavendra K T wrote:
> On 11/26/2012 07:13 PM, Andrew Jones wrote:
> >On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> >>From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>
> >>yield_to returns -ESRCH, When source and target of yield_to
> >>run queue length is one. When we see three successive failures of
> >>yield_to we assume we are in potential undercommit case and abort
> >>from PLE handler.
> >>The assumption is backed by low probability of wrong decision
> >>for even worst case scenarios such as average runqueue length
> >>between 1 and 2.
> >>
> >>note that we do not update last boosted vcpu in failure cases.
> >>Thank Avi for raising question on aborting after first fail from
> yield_to.
> >>
> >>Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >>Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>---
> >>  virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
> >>  1 file changed, 16 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>index be70035..053f494 100644
> >>--- a/virt/kvm/kvm_main.c
> >>+++ b/virt/kvm/kvm_main.c
> >>@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
> >>  {
> >>  	struct pid *pid;
> >>  	struct task_struct *task = NULL;
> >>+	bool ret = false;
> >>
> >>  	rcu_read_lock();
> >>  	pid = rcu_dereference(target->pid);
> >>@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
> >>  		task = get_pid_task(target->pid, PIDTYPE_PID);
> >>  	rcu_read_unlock();
> >>  	if (!task)
> >>-		return false;
> >>+		return ret;
> >>  	if (task->flags & PF_VCPU) {
> >>  		put_task_struct(task);
> >>-		return false;
> >>-	}
> >>-	if (yield_to(task, 1)) {
> >>-		put_task_struct(task);
> >>-		return true;
> >>+		return ret;
> >>  	}
> >>+	ret = yield_to(task, 1);
> >>  	put_task_struct(task);
> >>-	return false;
> >>+
> >>+	return ret;
> >>  }
> >>  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
> >>
> >>@@ -1697,12 +1696,14 @@ bool
> kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
> >>  	return eligible;
> >>  }
> >>  #endif
> >>+
> >>  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>  {
> >>  	struct kvm *kvm = me->kvm;
> >>  	struct kvm_vcpu *vcpu;
> >>  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> >>  	int yielded = 0;
> >>+	int try = 3;
> >>  	int pass;
> >>  	int i;
> >>
> >>@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>  	 * VCPU is holding the lock that we need and will release it.
> >>  	 * We approximate round-robin by starting at the last boosted VCPU.
> >>  	 */
> >>-	for (pass = 0; pass < 2 && !yielded; pass++) {
> >>+	for (pass = 0; pass < 2 && !yielded && try; pass++) {
> >>  		kvm_for_each_vcpu(i, vcpu, kvm) {
> >>  			if (!pass && i <= last_boosted_vcpu) {
> >>  				i = last_boosted_vcpu;
> >>@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>  				continue;
> >>  			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> >>  				continue;
> >>-			if (kvm_vcpu_yield_to(vcpu)) {
> >>+
> >>+			yielded = kvm_vcpu_yield_to(vcpu);
> >>+			if (yielded > 0) {
> >>  				kvm->last_boosted_vcpu = i;
> >>-				yielded = 1;
> >>  				break;
> >>+			} else if (yielded < 0) {
> >>+				try--;
> >>+				if (!try)
> >>+					break;
> >>  			}
> >>  		}
> >>  	}
> >>
> 
> Drew, Thanks for reviewing this.
> >
> >The check done in patch 1/2 is done before the double_rq_lock, so it's
> >cheap. Now, this patch is to avoid doing too many get_pid_task calls.I
> >wonder if it would make more sense to change the vcpu state from tracking
> >the pid to tracking the task. If that was done, then I don't believe this
> >patch is necessary.
> 
> We would need a logic not to break upon first failure of yield_to.
> (which happens otherwise with patch1 alone). Breaking upon first
> failure out of ple handler resulted in degradation in moderate
> overcommits due to false exits even when we have more than one task in
> other cpu run queues.
> 
> But your suggestion triggered an idea to me, what would be the cost of
> iterating over all vcpus despite of yield_to failure?
> 
> (Where we breakout of PLE handler only if we have successful yield
> i.e yielded > 0) with something like this:
> 
> -       for (pass = 0; pass < 2 && !yielded; pass++) {
> +       for (pass = 0; pass < 2 && yielded <=0 ; pass++) {
>                 kvm_for_each_vcpu(i, vcpu, kvm) {
>                         if (!pass && i <= last_boosted_vcpu) {
>                                 i = last_boosted_vcpu;
> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>                                 continue;
>                         if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>                                 continue;
> -                       if (kvm_vcpu_yield_to(vcpu)) {
> +
> +                       yielded = kvm_vcpu_yield_to(vcpu);
> +                       if (yielded > 0) {
>                                 kvm->last_boosted_vcpu = i;
> -                               yielded = 1;
>                                 break;
> 

OK, I had actually assumed that the first round of testing had been
implemented this way, but then the cost of get_pid_task() forced the
introduction of a try limit.

> Here is the result of the above patch w.r.t to base and current patch
> series.
> 
> benchmark     improvement w.r.t base   improvement w.r.t current patch
> ebizzy_1x      131.22287               -9.76%
> ebizzy_4x      -7.97198                -21.1%
> 
> dbench_1x       25.67077               -25.55%
> dbench_4x       -69.19086              -122.46%
> 
> 
> Current patches perform better. So this means iterating over vcpus
> has some overhead.  Though we have IMO for bigger machine with large
> guests, this is
> significant..
> 
> Let me know if this patch sounds good to you..
> 

It does was it advertises - reduces the impact of the vcpu-on-spin
loop for undercommit scenarios, so I'm ok with it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-27 10:30     ` Raghavendra K T
@ 2012-11-27 14:04       ` Andrew Theurer
  2012-11-28  7:03         ` Raghavendra K T
  2012-11-27 14:23       ` Chegu Vinod
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Theurer @ 2012-11-27 14:04 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Jones, Marcelo Tosatti, Gleb Natapov, Chegu Vinod,
	Peter Zijlstra, H. Peter Anvin, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	LKML, Srivatsa Vaddagiri

On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote:
> On 11/26/2012 07:05 PM, Andrew Jones wrote:
> > On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
> >> From: Peter Zijlstra <peterz@infradead.org>
> >>
> >> In case of undercomitted scenarios, especially in large guests
> >> yield_to overhead is significantly high. when run queue length of
> >> source and target is one, take an opportunity to bail out and return
> >> -ESRCH. This return condition can be further exploited to quickly come
> >> out of PLE handler.
> >>
> >> (History: Raghavendra initially worked on break out of kvm ple handler upon
> >>   seeing source runqueue length = 1, but it had to export rq length).
> >>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
> >>
> >> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> >> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >> ---
> >>
> >>   kernel/sched/core.c |   25 +++++++++++++++++++------
> >>   1 file changed, 19 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8927f..fc219a5 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >>    * It's the caller's job to ensure that the target task struct
> >>    * can't go away on us before we can do any checks.
> >>    *
> >> - * Returns true if we indeed boosted the target task.
> >> + * Returns:
> >> + *	true (>0) if we indeed boosted the target task.
> >> + *	false (0) if we failed to boost the target.
> >> + *	-ESRCH if there's no task to yield to.
> >>    */
> >>   bool __sched yield_to(struct task_struct *p, bool preempt)
> >>   {
> >> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> >>
> >>   again:
> >>   	p_rq = task_rq(p);
> >> +	/*
> >> +	 * If we're the only runnable task on the rq and target rq also
> >> +	 * has only one task, there's absolutely no point in yielding.
> >> +	 */
> >> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> >> +		yielded = -ESRCH;
> >> +		goto out_irq;
> >> +	}
> >> +
> >>   	double_rq_lock(rq, p_rq);
> >>   	while (task_rq(p) != p_rq) {
> >>   		double_rq_unlock(rq, p_rq);
> >> @@ -4310,13 +4322,13 @@ again:
> >>   	}
> >>
> >>   	if (!curr->sched_class->yield_to_task)
> >> -		goto out;
> >> +		goto out_unlock;
> >>
> >>   	if (curr->sched_class != p->sched_class)
> >> -		goto out;
> >> +		goto out_unlock;
> >>
> >>   	if (task_running(p_rq, p) || p->state)
> >> -		goto out;
> >> +		goto out_unlock;
> >>
> >>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >>   	if (yielded) {
> >> @@ -4329,11 +4341,12 @@ again:
> >>   			resched_task(p_rq->curr);
> >>   	}
> >>
> >> -out:
> >> +out_unlock:
> >>   	double_rq_unlock(rq, p_rq);
> >> +out_irq:
> >>   	local_irq_restore(flags);
> >>
> >> -	if (yielded)
> >> +	if (yielded > 0)
> >>   		schedule();
> >>
> >>   	return yielded;
> >>
> >
> > Acked-by: Andrew Jones <drjones@redhat.com>
> >
> 
> Thank you Drew.
> 
> Marcelo Gleb.. Please let me know if you have comments / concerns on the 
> patches..
> 
> Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
> especially for large guests where we do have overhead of vcpu iteration
> of ple handler..

I agree, looks fine for undercommit scenarios.  I do wonder what happens
with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue
of 2 and 1/2 of the host cpus with a runqueue of 1.  Even with this
change that scenario still might be fine, but it would be nice to see a
comparison.

-Andrew



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-27 10:30     ` Raghavendra K T
  2012-11-27 14:04       ` Andrew Theurer
@ 2012-11-27 14:23       ` Chegu Vinod
       [not found]         ` <50B68F94.3080907@hp.com>
       [not found]         ` <50B6B5B5.5060108@hp.com>
  1 sibling, 2 replies; 27+ messages in thread
From: Chegu Vinod @ 2012-11-27 14:23 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Jones, Marcelo Tosatti, Gleb Natapov, Andrew M. Theurer,
	Peter Zijlstra, H. Peter Anvin, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	LKML, Srivatsa Vaddagiri

On 11/27/2012 2:30 AM, Raghavendra K T wrote:
> On 11/26/2012 07:05 PM, Andrew Jones wrote:
>> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> In case of undercomitted scenarios, especially in large guests
>>> yield_to overhead is significantly high. when run queue length of
>>> source and target is one, take an opportunity to bail out and return
>>> -ESRCH. This return condition can be further exploited to quickly come
>>> out of PLE handler.
>>>
>>> (History: Raghavendra initially worked on break out of kvm ple 
>>> handler upon
>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>   Peter came up with the elegant idea of return -ESRCH in scheduler 
>>> core.
>>>
>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> Raghavendra, Checking the rq length of target vcpu condition 
>>> added.(thanks Avi)
>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>> ---
>>>
>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 2d8927f..fc219a5 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>    * It's the caller's job to ensure that the target task struct
>>>    * can't go away on us before we can do any checks.
>>>    *
>>> - * Returns true if we indeed boosted the target task.
>>> + * Returns:
>>> + *    true (>0) if we indeed boosted the target task.
>>> + *    false (0) if we failed to boost the target.
>>> + *    -ESRCH if there's no task to yield to.
>>>    */
>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>   {
>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, 
>>> bool preempt)
>>>
>>>   again:
>>>       p_rq = task_rq(p);
>>> +    /*
>>> +     * If we're the only runnable task on the rq and target rq also
>>> +     * has only one task, there's absolutely no point in yielding.
>>> +     */
>>> +    if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>> +        yielded = -ESRCH;
>>> +        goto out_irq;
>>> +    }
>>> +
>>>       double_rq_lock(rq, p_rq);
>>>       while (task_rq(p) != p_rq) {
>>>           double_rq_unlock(rq, p_rq);
>>> @@ -4310,13 +4322,13 @@ again:
>>>       }
>>>
>>>       if (!curr->sched_class->yield_to_task)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       if (curr->sched_class != p->sched_class)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       if (task_running(p_rq, p) || p->state)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>>       if (yielded) {
>>> @@ -4329,11 +4341,12 @@ again:
>>>               resched_task(p_rq->curr);
>>>       }
>>>
>>> -out:
>>> +out_unlock:
>>>       double_rq_unlock(rq, p_rq);
>>> +out_irq:
>>>       local_irq_restore(flags);
>>>
>>> -    if (yielded)
>>> +    if (yielded > 0)
>>>           schedule();
>>>
>>>       return yielded;
>>>
>>
>> Acked-by: Andrew Jones <drjones@redhat.com>
>>
>
> Thank you Drew.
>
> Marcelo Gleb.. Please let me know if you have comments / concerns on 
> the patches..
>
> Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
> especially for large guests where we do have overhead of vcpu iteration
> of ple handler..
>
> .
>
Thanks Raghu. Will try to get this latest patch set evaluated and get 
back to you.

Vinod


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-26 12:08 ` [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
  2012-11-26 13:43   ` Andrew Jones
@ 2012-11-28  1:12   ` Marcelo Tosatti
  2012-11-28  5:10     ` Raghavendra K T
  1 sibling, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2012-11-28  1:12 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones


Don't understand the reasoning behind why 3 is a good choice.

On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> yield_to returns -ESRCH, When source and target of yield_to
> run queue length is one. When we see three successive failures of
> yield_to we assume we are in potential undercommit case and abort
> from PLE handler.
> The assumption is backed by low probability of wrong decision
> for even worst case scenarios such as average runqueue length
> between 1 and 2.
> 
> note that we do not update last boosted vcpu in failure cases.
> Thank Avi for raising question on aborting after first fail from yield_to.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |   26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index be70035..053f494 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>  {
>  	struct pid *pid;
>  	struct task_struct *task = NULL;
> +	bool ret = false;
>  
>  	rcu_read_lock();
>  	pid = rcu_dereference(target->pid);
> @@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
>  		task = get_pid_task(target->pid, PIDTYPE_PID);
>  	rcu_read_unlock();
>  	if (!task)
> -		return false;
> +		return ret;
>  	if (task->flags & PF_VCPU) {
>  		put_task_struct(task);
> -		return false;
> -	}
> -	if (yield_to(task, 1)) {
> -		put_task_struct(task);
> -		return true;
> +		return ret;
>  	}
> +	ret = yield_to(task, 1);
>  	put_task_struct(task);
> -	return false;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
>  
> @@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>  	return eligible;
>  }
>  #endif
> +
>  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>  	struct kvm *kvm = me->kvm;
>  	struct kvm_vcpu *vcpu;
>  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>  	int yielded = 0;
> +	int try = 3;
>  	int pass;
>  	int i;
>  
> @@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * VCPU is holding the lock that we need and will release it.
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
> -	for (pass = 0; pass < 2 && !yielded; pass++) {
> +	for (pass = 0; pass < 2 && !yielded && try; pass++) {
>  		kvm_for_each_vcpu(i, vcpu, kvm) {
>  			if (!pass && i <= last_boosted_vcpu) {
>  				i = last_boosted_vcpu;
> @@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  				continue;
>  			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>  				continue;
> -			if (kvm_vcpu_yield_to(vcpu)) {
> +
> +			yielded = kvm_vcpu_yield_to(vcpu);
> +			if (yielded > 0) {
>  				kvm->last_boosted_vcpu = i;
> -				yielded = 1;
>  				break;
> +			} else if (yielded < 0) {
> +				try--;
> +				if (!try)
> +					break;
>  			}
>  		}
>  	}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-28  1:12   ` Marcelo Tosatti
@ 2012-11-28  5:10     ` Raghavendra K T
  2012-11-29 12:16       ` Gleb Natapov
  2012-12-03 19:56       ` Marcelo Tosatti
  0 siblings, 2 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-28  5:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
>
> Don't understand the reasoning behind why 3 is a good choice.

Here is where I came from. (explaining from scratch for completeness, 
forgive me :))
In moderate overcommits, we can falsely exit from ple handler even when
we have preempted task of same VM waiting on other cpus. To reduce this
problem, we try few times before exiting.
The problem boils down to:
what is the probability that we exit ple handler even when we have more
than 1 task in other cpus. Theoretical worst case should be around 1.5x
overcommit (As also pointed by Andrew Theurer). [But practical
worstcase may be around 2x,3x overcommits as indicated by the results
for the patch series]

So if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:

  p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs  with length 1 ]

so
num tries:         probability of aborting ple handler (1.5x overcommit)
  1                 1/4
  2                 1/8
  3                 1/16

We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit. [ as in results ].

I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
concluded 3 is enough.

Infact I have also run kernbench and hackbench which are giving 5-20%
improvement.

[ As a side note , I also thought how about having num_tries = f(n) =
ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
overhead and also there is no point in probably making it dependent on
online cpus ]

Please let me know if you are happy with this rationale/ or correct me
if you foresee some problem. (Infact Avi, Rik's concern about false
exiting made me arrive at 'try' logic which I did not have earlier).

I am currently trying out the result for 1.5x overcommit will post the
result.

>
> On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> yield_to returns -ESRCH, When source and target of yield_to
>> run queue length is one. When we see three successive failures of
>> yield_to we assume we are in potential undercommit case and abort
>> from PLE handler.
>> The assumption is backed by low probability of wrong decision
>> for even worst case scenarios such as average runqueue length
>> between 1 and 2.
>>
>> note that we do not update last boosted vcpu in failure cases.
>> Thank Avi for raising question on aborting after first fail from yield_to.
>>
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
[...]


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-27 14:04       ` Andrew Theurer
@ 2012-11-28  7:03         ` Raghavendra K T
  0 siblings, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-28  7:03 UTC (permalink / raw)
  To: habanero
  Cc: Andrew Jones, Marcelo Tosatti, Gleb Natapov, Chegu Vinod,
	Peter Zijlstra, H. Peter Anvin, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	LKML, Srivatsa Vaddagiri

On 11/27/2012 07:34 PM, Andrew Theurer wrote:
> On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote:
>> On 11/26/2012 07:05 PM, Andrew Jones wrote:
>>> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> In case of undercomitted scenarios, especially in large guests
>>>> yield_to overhead is significantly high. when run queue length of
>>>> source and target is one, take an opportunity to bail out and return
>>>> -ESRCH. This return condition can be further exploited to quickly come
>>>> out of PLE handler.
>>>>
>>>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>>>    seeing source runqueue length = 1, but it had to export rq length).
>>>>    Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>>>
>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>>    kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>    1 file changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8927f..fc219a5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>     * It's the caller's job to ensure that the target task struct
>>>>     * can't go away on us before we can do any checks.
>>>>     *
>>>> - * Returns true if we indeed boosted the target task.
>>>> + * Returns:
>>>> + *	true (>0) if we indeed boosted the target task.
>>>> + *	false (0) if we failed to boost the target.
>>>> + *	-ESRCH if there's no task to yield to.
>>>>     */
>>>>    bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>    {
>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>
>>>>    again:
>>>>    	p_rq = task_rq(p);
>>>> +	/*
>>>> +	 * If we're the only runnable task on the rq and target rq also
>>>> +	 * has only one task, there's absolutely no point in yielding.
>>>> +	 */
>>>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +		yielded = -ESRCH;
>>>> +		goto out_irq;
>>>> +	}
>>>> +
>>>>    	double_rq_lock(rq, p_rq);
>>>>    	while (task_rq(p) != p_rq) {
>>>>    		double_rq_unlock(rq, p_rq);
>>>> @@ -4310,13 +4322,13 @@ again:
>>>>    	}
>>>>
>>>>    	if (!curr->sched_class->yield_to_task)
>>>> -		goto out;
>>>> +		goto out_unlock;
>>>>
>>>>    	if (curr->sched_class != p->sched_class)
>>>> -		goto out;
>>>> +		goto out_unlock;
>>>>
>>>>    	if (task_running(p_rq, p) || p->state)
>>>> -		goto out;
>>>> +		goto out_unlock;
>>>>
>>>>    	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>>>    	if (yielded) {
>>>> @@ -4329,11 +4341,12 @@ again:
>>>>    			resched_task(p_rq->curr);
>>>>    	}
>>>>
>>>> -out:
>>>> +out_unlock:
>>>>    	double_rq_unlock(rq, p_rq);
>>>> +out_irq:
>>>>    	local_irq_restore(flags);
>>>>
>>>> -	if (yielded)
>>>> +	if (yielded > 0)
>>>>    		schedule();
>>>>
>>>>    	return yielded;
>>>>
>>>
>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>
>>
>> Thank you Drew.
>>
>> Marcelo Gleb.. Please let me know if you have comments / concerns on the
>> patches..
>>
>> Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
>> especially for large guests where we do have overhead of vcpu iteration
>> of ple handler..
>
> I agree, looks fine for undercommit scenarios.  I do wonder what happens
> with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue
> of 2 and 1/2 of the host cpus with a runqueue of 1.  Even with this
> change that scenario still might be fine, but it would be nice to see a
> comparison.
>

Hi Andrew, yes thanks for pointing out 1.5x case which should have
theoretical  worst case..
I tried with 2 24 vcpu guests and the same 32 core machine.. Here is
the result..

Ebizzy (rec/sec higher is better)
x base
+ patched
     N       Avg        Stddev
x  10     2688.6     347.55917
+  10     2707.6     260.93728

improvement 0.706%

dbench (Throughput MB/sec higher is better)
x base
+ patched
     N         Avg        Stddev
x  10    3164.712     140.24468
+  10    3244.021     185.92434

Improvement 2.5%

So there is no significant improvement / degradation seen in
1.5x.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
       [not found]         ` <50B68F94.3080907@hp.com>
@ 2012-11-29  2:00           ` Andrew Theurer
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Theurer @ 2012-11-29  2:00 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Raghavendra K T, Andrew Jones, Marcelo Tosatti, Gleb Natapov,
	Peter Zijlstra, H. Peter Anvin, Ingo Molnar, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	LKML, Srivatsa Vaddagiri

On Wed, 2012-11-28 at 14:26 -0800, Chegu Vinod wrote:
> On 11/27/2012 6:23 AM, Chegu Vinod wrote:
> 
> > On 11/27/2012 2:30 AM, Raghavendra K T wrote: 
> > > On 11/26/2012 07:05 PM, Andrew Jones wrote: 
> > > > On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T
> > > > wrote: 
> > > > > From: Peter Zijlstra <peterz@infradead.org> 
> > > > > 
> > > > > In case of undercomitted scenarios, especially in large
> > > > > guests 
> > > > > yield_to overhead is significantly high. when run queue length
> > > > > of 
> > > > > source and target is one, take an opportunity to bail out and
> > > > > return 
> > > > > -ESRCH. This return condition can be further exploited to
> > > > > quickly come 
> > > > > out of PLE handler. 
> > > > > 
> > > > > (History: Raghavendra initially worked on break out of kvm ple
> > > > > handler upon 
> > > > >   seeing source runqueue length = 1, but it had to export rq
> > > > > length). 
> > > > >   Peter came up with the elegant idea of return -ESRCH in
> > > > > scheduler core. 
> > > > > 
> > > > > Signed-off-by: Peter Zijlstra <peterz@infradead.org> 
> > > > > Raghavendra, Checking the rq length of target vcpu condition
> > > > > added.(thanks Avi) 
> > > > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> 
> > > > > Signed-off-by: Raghavendra K T
> > > > > <raghavendra.kt@linux.vnet.ibm.com> 
> > > > > --- 
> > > > > 
> > > > >   kernel/sched/core.c |   25 +++++++++++++++++++------ 
> > > > >   1 file changed, 19 insertions(+), 6 deletions(-) 
> > > > > 
> > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c 
> > > > > index 2d8927f..fc219a5 100644 
> > > > > --- a/kernel/sched/core.c 
> > > > > +++ b/kernel/sched/core.c 
> > > > > @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); 
> > > > >    * It's the caller's job to ensure that the target task
> > > > > struct 
> > > > >    * can't go away on us before we can do any checks. 
> > > > >    * 
> > > > > - * Returns true if we indeed boosted the target task. 
> > > > > + * Returns: 
> > > > > + *    true (>0) if we indeed boosted the target task. 
> > > > > + *    false (0) if we failed to boost the target. 
> > > > > + *    -ESRCH if there's no task to yield to. 
> > > > >    */ 
> > > > >   bool __sched yield_to(struct task_struct *p, bool preempt) 
> > > > >   { 
> > > > > @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct
> > > > > task_struct *p, bool preempt) 
> > > > > 
> > > > >   again: 
> > > > >       p_rq = task_rq(p); 
> > > > > +    /* 
> > > > > +     * If we're the only runnable task on the rq and target
> > > > > rq also 
> > > > > +     * has only one task, there's absolutely no point in
> > > > > yielding. 
> > > > > +     */ 
> > > > > +    if (rq->nr_running == 1 && p_rq->nr_running == 1) { 
> > > > > +        yielded = -ESRCH; 
> > > > > +        goto out_irq; 
> > > > > +    } 
> > > > > + 
> > > > >       double_rq_lock(rq, p_rq); 
> > > > >       while (task_rq(p) != p_rq) { 
> > > > >           double_rq_unlock(rq, p_rq); 
> > > > > @@ -4310,13 +4322,13 @@ again: 
> > > > >       } 
> > > > > 
> > > > >       if (!curr->sched_class->yield_to_task) 
> > > > > -        goto out; 
> > > > > +        goto out_unlock; 
> > > > > 
> > > > >       if (curr->sched_class != p->sched_class) 
> > > > > -        goto out; 
> > > > > +        goto out_unlock; 
> > > > > 
> > > > >       if (task_running(p_rq, p) || p->state) 
> > > > > -        goto out; 
> > > > > +        goto out_unlock; 
> > > > > 
> > > > >       yielded = curr->sched_class->yield_to_task(rq, p,
> > > > > preempt); 
> > > > >       if (yielded) { 
> > > > > @@ -4329,11 +4341,12 @@ again: 
> > > > >               resched_task(p_rq->curr); 
> > > > >       } 
> > > > > 
> > > > > -out: 
> > > > > +out_unlock: 
> > > > >       double_rq_unlock(rq, p_rq); 
> > > > > +out_irq: 
> > > > >       local_irq_restore(flags); 
> > > > > 
> > > > > -    if (yielded) 
> > > > > +    if (yielded > 0) 
> > > > >           schedule(); 
> > > > > 
> > > > >       return yielded; 
> > > > > 
> > > > 
> > > > Acked-by: Andrew Jones <drjones@redhat.com> 
> > > > 
> > > 
> > > Thank you Drew. 
> > > 
> > > Marcelo Gleb.. Please let me know if you have comments / concerns
> > > on the patches.. 
> > > 
> > > Andrew, Vinod, IMO, the patch set looks good for undercommit
> > > scenarios 
> > > especially for large guests where we do have overhead of vcpu
> > > iteration 
> > > of ple handler.. 
> > > 
> > > . 
> > > 
> > Thanks Raghu. Will try to get this latest patch set evaluated and
> > get back to you. 
> > 
> > 
> Hi Raghu,
> 
> Here is some preliminary data with your latest set of  PLE patches (&
> also with Andrew's throttled yield_to() change).
> 
> Ran a single guest on a 80 core Westmere platform. [Note: Host and
> Guest had the latest kernel from kvm.git and also using the latest
>  qemu from qemu.git as of yesterday morning]. 
> 
> The guest was running a AIM7 high_systime workload. (Note:
> high_systime is a kernel intensive micro-benchmark but in this case it
> was run just as a workload in the guest to trigger spinlock etc.
> contention in the guest OS and hence PLE (i.e. this is not a real
> benchmark run). 'have run this workload with a constant # (i.e. 2000)
> users with 100 jobs per user. The numbers below represent the # of
> jobs per minute (JPM) -  higher the better) .
> 
>                              40VCPU  60VCPU  80VCPU 
> 
> a) 3.7.0-rc6+ w/ ple_gap=0   ~102K   ~88K    ~81K
> 
> b) 3.7.0-rc6+                 ~53K   ~25K    ~18-20K

> c) 3.7.0-rc6+ w/ PLE patches ~100K   ~81K    ~48K-69K  <- lot of variation from run to run.
> 
> d) 3.7.0-rc6+ w/  throttled  ~101K   ~87K    ~78K
>           yield_to() change
> 

FYI here's the latest throttled yield_to() patch (the one Vinod tested).

Signed-off-by: Andrew Theurer <habanero@linux.vnet.ibm.com>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ecc5543..61d12ea 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -192,6 +192,7 @@ struct kvm_vcpu {
 	int mode;
 	unsigned long requests;
 	unsigned long guest_debug;
+	unsigned long last_yield_to;
 
 	struct mutex mutex;
 	struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..987a339 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/jiffies.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -222,6 +223,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->kvm = kvm;
 	vcpu->vcpu_id = id;
 	vcpu->pid = NULL;
+	vcpu->last_yield_to = 0;
 	init_waitqueue_head(&vcpu->wq);
 	kvm_async_pf_vcpu_init(vcpu);
 
@@ -1708,29 +1710,38 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 
 	kvm_vcpu_set_in_spin_loop(me, true);
 	/*
+	 * A yield_to() can be quite expensive, so we try to limit
+	 * its use to just 1 per jiffie.
+	 */
+	if (me->last_yield_to == jiffies)
+		yield();
+	else {
+	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
 	 * else and called schedule in __vcpu_run.  Hopefully that
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i <= last_boosted_vcpu) {
-				i = last_boosted_vcpu;
-				continue;
-			} else if (pass && i > last_boosted_vcpu)
-				break;
-			if (vcpu == me)
-				continue;
-			if (waitqueue_active(&vcpu->wq))
-				continue;
-			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
-				continue;
-			if (kvm_vcpu_yield_to(vcpu)) {
-				kvm->last_boosted_vcpu = i;
-				yielded = 1;
-				break;
+		for (pass = 0; pass < 2 && !yielded; pass++) {
+			kvm_for_each_vcpu(i, vcpu, kvm) {
+				if (!pass && i <= last_boosted_vcpu) {
+					i = last_boosted_vcpu;
+					continue;
+				} else if (pass && i > last_boosted_vcpu)
+					break;
+				if (vcpu == me)
+					continue;
+				if (waitqueue_active(&vcpu->wq))
+					continue;
+				if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+					continue;
+				if (kvm_vcpu_yield_to(vcpu)) {
+					kvm->last_boosted_vcpu = i;
+					me->last_yield_to = jiffies;
+					yielded = 1;
+					break;
+				}
 			}
 		}
 	}




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
  2012-11-26 12:07 [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
  2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
  2012-11-26 12:08 ` [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
@ 2012-11-29  2:07 ` Chegu Vinod
  2012-11-29  9:49   ` Raghavendra K T
  2 siblings, 1 reply; 27+ messages in thread
From: Chegu Vinod @ 2012-11-29  2:07 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 11/26/2012 4:07 AM, Raghavendra K T wrote:
>   In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
>
>   The first patch optimizes all the yield_to by bailing out when there
>   is no need to continue in yield_to (i.e., when there is only one task
>   in source and target rq).
>
>   Second patch uses that in PLE handler. Further when a yield_to fails
>   we do not immediately go out of PLE handler instead we try thrice
>   to have better statistical possibility of false return. Otherwise that
>   would affect moderate overcommit cases.
>   
>   Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
>   around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
>
>
> base = 3.7.0-rc6
> machine: 32 core mx3850 x5 PLE mc
>
> --+-----------+-----------+-----------+------------+-----------+
>                 ebizzy (rec/sec higher is beter)
> --+-----------+-----------+-----------+------------+-----------+
>      base        stdev       patched     stdev       %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x   2511.3000    21.5409    6051.8000   170.2592   140.98276
> 2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
> 3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
> 4x   1784.3750   102.2699    2018.7500   187.5723    13.13485
> --+-----------+-----------+-----------+------------+-----------+
>
> --+-----------+-----------+-----------+------------+-----------+
>          dbench (throughput in MB/sec. higher is better)
> --+-----------+-----------+-----------+------------+-----------+
>      base        stdev       patched     stdev       %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
> 2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
> 3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
> 4x  3043.1725  3243.7281    4664.4662   5946.5741      53.27643
> --+-----------+-----------+-----------+------------+-----------+
>
> Here is the refernce of no ple result.
>   ebizzy-1x_nople 7592.6000 rec/sec
>   dbench_1x_nople 7853.6960 MB/sec
>
> The result says we can still improve by 60% for ebizzy, but overall we are
> getting impressive performance with the patches.
>
>   Changes Since V2:
>   - Dropped global measures usage patch (Peter Zilstra)
>   - Do not bail out on first failure (Avi Kivity)
>   - Try thrice for the failure of yield_to to get statistically more correct
>     behaviour.
>
>   Changes since V1:
>   - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
>   - Use yield() instead of schedule in overcommit scenarios (Rik)
>   - Use loadavg knowledge to detect undercommit/overcommit
>
>   Peter Zijlstra (1):
>    Bail out of yield_to when source and target runqueue has one task
>
>   Raghavendra K T (1):
>    Handle yield_to failure return for potential undercommit case
>
>   Please let me know your comments and suggestions.
>
>   Link for V2:
>   https://lkml.org/lkml/2012/10/29/287
>
>   Link for V1:
>   https://lkml.org/lkml/2012/9/21/168
>
>   kernel/sched/core.c | 25 +++++++++++++++++++------
>   virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
>   2 files changed, 35 insertions(+), 16 deletions(-)
>
> .
>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
       [not found]         ` <50B6B5B5.5060108@hp.com>
@ 2012-11-29  2:20           ` Chegu Vinod
  0 siblings, 0 replies; 27+ messages in thread
From: Chegu Vinod @ 2012-11-29  2:20 UTC (permalink / raw)
  Cc: KVM, LKML

On 11/28/2012 5:09 PM, Chegu Vinod wrote:
> On 11/27/2012 6:23 AM, Chegu Vinod wrote:
>> On 11/27/2012 2:30 AM, Raghavendra K T wrote:
>>> On 11/26/2012 07:05 PM, Andrew Jones wrote:
>>>> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>
>>>>> In case of undercomitted scenarios, especially in large guests
>>>>> yield_to overhead is significantly high. when run queue length of
>>>>> source and target is one, take an opportunity to bail out and return
>>>>> -ESRCH. This return condition can be further exploited to quickly 
>>>>> come
>>>>> out of PLE handler.
>>>>>
>>>>> (History: Raghavendra initially worked on break out of kvm ple 
>>>>> handler upon
>>>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>>>   Peter came up with the elegant idea of return -ESRCH in 
>>>>> scheduler core.
>>>>>
>>>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>>>> Raghavendra, Checking the rq length of target vcpu condition 
>>>>> added.(thanks Avi)
>>>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>> ---
>>>>>
>>>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>> index 2d8927f..fc219a5 100644
>>>>> --- a/kernel/sched/core.c
>>>>> +++ b/kernel/sched/core.c
>>>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>>>    * It's the caller's job to ensure that the target task struct
>>>>>    * can't go away on us before we can do any checks.
>>>>>    *
>>>>> - * Returns true if we indeed boosted the target task.
>>>>> + * Returns:
>>>>> + *    true (>0) if we indeed boosted the target task.
>>>>> + *    false (0) if we failed to boost the target.
>>>>> + *    -ESRCH if there's no task to yield to.
>>>>>    */
>>>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>>>   {
>>>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct 
>>>>> *p, bool preempt)
>>>>>
>>>>>   again:
>>>>>       p_rq = task_rq(p);
>>>>> +    /*
>>>>> +     * If we're the only runnable task on the rq and target rq also
>>>>> +     * has only one task, there's absolutely no point in yielding.
>>>>> +     */
>>>>> +    if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>>> +        yielded = -ESRCH;
>>>>> +        goto out_irq;
>>>>> +    }
>>>>> +
>>>>>       double_rq_lock(rq, p_rq);
>>>>>       while (task_rq(p) != p_rq) {
>>>>>           double_rq_unlock(rq, p_rq);
>>>>> @@ -4310,13 +4322,13 @@ again:
>>>>>       }
>>>>>
>>>>>       if (!curr->sched_class->yield_to_task)
>>>>> -        goto out;
>>>>> +        goto out_unlock;
>>>>>
>>>>>       if (curr->sched_class != p->sched_class)
>>>>> -        goto out;
>>>>> +        goto out_unlock;
>>>>>
>>>>>       if (task_running(p_rq, p) || p->state)
>>>>> -        goto out;
>>>>> +        goto out_unlock;
>>>>>
>>>>>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>>>>       if (yielded) {
>>>>> @@ -4329,11 +4341,12 @@ again:
>>>>>               resched_task(p_rq->curr);
>>>>>       }
>>>>>
>>>>> -out:
>>>>> +out_unlock:
>>>>>       double_rq_unlock(rq, p_rq);
>>>>> +out_irq:
>>>>>       local_irq_restore(flags);
>>>>>
>>>>> -    if (yielded)
>>>>> +    if (yielded > 0)
>>>>>           schedule();
>>>>>
>>>>>       return yielded;
>>>>>
>>>>
>>>> Acked-by: Andrew Jones <drjones@redhat.com>
>>>>
>>>
>>> Thank you Drew.
>>>
>>> Marcelo Gleb.. Please let me know if you have comments / concerns on 
>>> the patches..
>>>
>>> Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
>>> especially for large guests where we do have overhead of vcpu iteration
>>> of ple handler..
>>>
>>> .
>>>
>> Thanks Raghu. Will try to get this latest patch set evaluated and get 
>> back to you.
>>
>> Vinod
>>
>

< Resending as prev. email to the kvm and lkml email aliases bounced 
twice... Apologies for any repeats! >

> Hi Raghu,
>
> Here is some preliminary data with your latest set ofPLE patches (& 
> also with Andrew's throttled yield_to() change).
>
> Ran a single guest on a 80 core Westmere platform. [Note: Host and 
> Guest had the latest kernel from kvm.git and also using the latestqemu 
> from qemu.git as of yesterday morning].
>
> The guest was running a AIM7 high_systime workload. (Note: 
> high_systime is a kernel intensive micro-benchmark but in this case it 
> was run just as a workload in the guest to trigger spinlock etc. 
> contention in the guest OS and hence PLE (i.e. this is not a real 
> benchmark run). 'have run this workload with a constant # (i.e. 2000) 
> users with 100 jobs per user. The numbers below represent the # of 
> jobs per minute (JPM) -higher the better) .
>
> 40VCPU60VCPU80VCPU
>
> a) 3.7.0-rc6+ w/ ple_gap=0~102K~88K~81K
>
>
> b) 3.7.0-rc6+~53K~25K~18-20K
>
> c) 3.7.0-rc6+ w/ PLE patches~100K~81K~48K-69K<- lot of variation from 
> run to run.
>
> d) 3.7.0-rc6+ w/throttled
>
> yield_to() change~101K~87K~78K
>
> ---
>
> The PLE patches case (c) does show improvements in this non-overcommit 
> large guest case when compared to the case (b). However at 80way 
> started to observe quite a bit of variation from run to run and the 
> JPM was lower when compared with the throttled yield_to() change case (d).
>
> For this 80way in case (c) also noticed that average time spent in the 
> PLE exit (as reported by a small samplings from perf kvm stat) was 
> varying quite a bit and was at times much greater when compared with 
> the case of throttled yield_to() change case (d). More details are 
> included below.
>
> --
>
> Thanks
>
> Vinod
>
> Case c :PLE patches(80-way)
>
> -------------------------------
>
> Analyze events for all VCPUs:
>
> VM-EXITSamplesSamples%Time%Avg time
>
> PAUSE_INSTRUCTION247814491.97%96.71%88.38us ( +-1.63% )
>
> MSR_WRITE1593845.91%1.05%14.90us ( +-1.07% )
>
> EXTERNAL_INTERRUPT395071.47%1.31%74.91us ( +-25.57% )
>
> PENDING_INTERRUPT136070.50%0.12%20.60us ( +-7.64% )
>
> HLT16730.06%0.73%985.40us ( +-1.30% )
>
> CPUID15080.06%0.01%10.48us ( +-3.64% )
>
> EXCEPTION_NMI5130.02%0.01%50.90us ( +-12.10% )
>
> EPT_MISCONFIG2200.01%0.06%598.15us ( +-23.24% )
>
> MSR_READ600.00%0.00%101.37us ( +-78.48% )
>
> RDPMC220.00%0.00%14.30us ( +- 22.46% )
>
> CR_ACCESS20.00%0.00%18.07us ( +-55.64% )
>
> NMI_WINDOW10.00%0.00%6.81us ( +--nan% )
>
> Total Samples:2694641, Total events handled time:226458587.95us.
>
> Case d:throttled yield_to() change (80-way)
>
> ----------------------------------------------
>
> Analyze events for all VCPUs:
>
> VM-EXITSamplesSamples%Time%Avg time
>
> MSR_WRITE133508034.82%0.52%5.70us ( +-0.08% )
>
> HLT94545824.66%98.67%1513.60us ( +-1.04% )
>
> PAUSE_INSTRUCTION79223620.66%0.42%7.66us ( +-0.18% )
>
> EXTERNAL_INTERRUPT44680311.65%0.34%11.01us ( +-0.16% )
>
> CPUID1589864.15%0.02%1.78us ( +-0.25% )
>
> PENDING_INTERRUPT1111642.90%0.02%2.32us ( +-0.22% )
>
> EXCEPTION_NMI417701.09%0.01%3.83us ( +-0.69% )
>
> EPT_MISCONFIG 16520.04%0.00%29.02us ( +-3.56% )
>
> MSR_READ6180.02%0.00%3.30us ( +-4.16% )
>
> RDPMC2280.01%0.00%2.16us ( +-1.38% )
>
> CR_ACCESS90.00%0.00%4.94us ( +-8.58% )
>
> NMI_WINDOW80.00%0.00%1.95us ( +-4.33% )
>
> IO_INSTRUCTION10.00%0.00%15.48us ( +--nan% )
>
> EPT_VIOLATION10.00%0.00%752.38us ( +--nan% )
>
> Total Samples:3834014, Total events handled time:1450387642.32us.
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios
  2012-11-29  2:07 ` [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Chegu Vinod
@ 2012-11-29  9:49   ` Raghavendra K T
  0 siblings, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-29  9:49 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Marcelo Tosatti, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 11/29/2012 07:37 AM, Chegu Vinod wrote:
> On 11/26/2012 4:07 AM, Raghavendra K T wrote:
>>   In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>>   The first patch optimizes all the yield_to by bailing out when there
>>   is no need to continue in yield_to (i.e., when there is only one task
>>   in source and target rq).
>>
>>   Second patch uses that in PLE handler. Further when a yield_to fails
>>   we do not immediately go out of PLE handler instead we try thrice
>>   to have better statistical possibility of false return. Otherwise that
>>   would affect moderate overcommit cases.
>>   Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy
>> 1x and
>>   around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
>>
>>
>> base = 3.7.0-rc6
>> machine: 32 core mx3850 x5 PLE mc
>>
>> --+-----------+-----------+-----------+------------+-----------+
>>                 ebizzy (rec/sec higher is beter)
>> --+-----------+-----------+-----------+------------+-----------+
>>      base        stdev       patched     stdev       %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x   2511.3000    21.5409    6051.8000   170.2592   140.98276
>> 2x   2679.4000   332.4482    2692.3000   251.4005     0.48145
>> 3x   2253.5000   266.4243    2192.1667   178.9753    -2.72169
>> 4x   1784.3750   102.2699    2018.7500   187.5723    13.13485
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> --+-----------+-----------+-----------+------------+-----------+
>>          dbench (throughput in MB/sec. higher is better)
>> --+-----------+-----------+-----------+------------+-----------+
>>      base        stdev       patched     stdev       %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x  6677.4080   638.5048    10098.0060   3449.7026     51.22643
>> 2x  2012.6760    64.7642    2019.0440     62.6702       0.31639
>> 3x  1302.0783    40.8336    1292.7517     27.0515      -0.71629
>> 4x  3043.1725  3243.7281    4664.4662   5946.5741      53.27643
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> Here is the refernce of no ple result.
>>   ebizzy-1x_nople 7592.6000 rec/sec
>>   dbench_1x_nople 7853.6960 MB/sec
>>
>> The result says we can still improve by 60% for ebizzy, but overall we
>> are
>> getting impressive performance with the patches.
>>
>>   Changes Since V2:
>>   - Dropped global measures usage patch (Peter Zilstra)
>>   - Do not bail out on first failure (Avi Kivity)
>>   - Try thrice for the failure of yield_to to get statistically more
>> correct
>>     behaviour.
>>
>>   Changes since V1:
>>   - Discard the idea of exporting nrrunning and optimize in core
>> scheduler (Peter)
>>   - Use yield() instead of schedule in overcommit scenarios (Rik)
>>   - Use loadavg knowledge to detect undercommit/overcommit
>>
>>   Peter Zijlstra (1):
>>    Bail out of yield_to when source and target runqueue has one task
>>
>>   Raghavendra K T (1):
>>    Handle yield_to failure return for potential undercommit case
>>
>>   Please let me know your comments and suggestions.
>>
>>   Link for V2:
>>   https://lkml.org/lkml/2012/10/29/287
>>
>>   Link for V1:
>>   https://lkml.org/lkml/2012/9/21/168
>>
>>   kernel/sched/core.c | 25 +++++++++++++++++++------
>>   virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
>>   2 files changed, 35 insertions(+), 16 deletions(-)
>>
>> .
>>
> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>

Thanks for testing..


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-28  5:10     ` Raghavendra K T
@ 2012-11-29 12:16       ` Gleb Natapov
  2012-11-30  5:04         ` Raghavendra K T
  2012-12-03 19:56       ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Gleb Natapov @ 2012-11-29 12:16 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Marcelo Tosatti, Peter Zijlstra, H. Peter Anvin, Avi Kivity,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
> >
> >Don't understand the reasoning behind why 3 is a good choice.
> 
> Here is where I came from. (explaining from scratch for
> completeness, forgive me :))
> In moderate overcommits, we can falsely exit from ple handler even when
> we have preempted task of same VM waiting on other cpus. To reduce this
> problem, we try few times before exiting.
> The problem boils down to:
> what is the probability that we exit ple handler even when we have more
> than 1 task in other cpus. Theoretical worst case should be around 1.5x
> overcommit (As also pointed by Andrew Theurer). [But practical
> worstcase may be around 2x,3x overcommits as indicated by the results
> for the patch series]
> 
> So if p is the probability of finding rq length one on a particular cpu,
> and if we do n tries, then probability of exiting ple handler is:
> 
>  p^(n+1) [ because we would have come across one source with rq length
> 1 and n target cpu rqs  with length 1 ]
> 
> so
> num tries:         probability of aborting ple handler (1.5x overcommit)
>  1                 1/4
>  2                 1/8
>  3                 1/16
> 
> We can increase this probability with more tries, but the problem is
> the overhead.
IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread
is preempted we do kvm->preempted_vcpus++, when it runs again we do
kvm->preempted_vcpus--. PLE handler can try harder if kvm->preempted_vcpus
is big or do not try at all if it is zero.

> Also, If we have tried three times that means we would have iterated
> over 3 good eligible vcpus along with many non-eligible candidates. In
> worst case if we iterate all the vcpus, we reduce 1x performance and
> overcommit performance get hit. [ as in results ].
> 
> I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
> concluded 3 is enough.
> 
> Infact I have also run kernbench and hackbench which are giving 5-20%
> improvement.
> 
> [ As a side note , I also thought how about having num_tries = f(n) =
> ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
> overhead and also there is no point in probably making it dependent on
> online cpus ]
> 
> Please let me know if you are happy with this rationale/ or correct me
> if you foresee some problem. (Infact Avi, Rik's concern about false
> exiting made me arrive at 'try' logic which I did not have earlier).
> 
> I am currently trying out the result for 1.5x overcommit will post the
> result.
> 
> >
> >On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> >>From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> >>
> >>yield_to returns -ESRCH, When source and target of yield_to
> >>run queue length is one. When we see three successive failures of
> >>yield_to we assume we are in potential undercommit case and abort
> >>from PLE handler.
> >>The assumption is backed by low probability of wrong decision
> >>for even worst case scenarios such as average runqueue length
> >>between 1 and 2.
> >>
> >>note that we do not update last boosted vcpu in failure cases.
> >>Thank Avi for raising question on aborting after first fail from yield_to.
> >>
> >>Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >>Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> [...]

--
			Gleb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-29 12:16       ` Gleb Natapov
@ 2012-11-30  5:04         ` Raghavendra K T
  0 siblings, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-11-30  5:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, Peter Zijlstra, H. Peter Anvin, Avi Kivity,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 11/29/2012 05:46 PM, Gleb Natapov wrote:
> On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
>> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
>>>
>>> Don't understand the reasoning behind why 3 is a good choice.
>>
>> Here is where I came from. (explaining from scratch for
>> completeness, forgive me :))
>> In moderate overcommits, we can falsely exit from ple handler even when
>> we have preempted task of same VM waiting on other cpus. To reduce this
>> problem, we try few times before exiting.
>> The problem boils down to:
>> what is the probability that we exit ple handler even when we have more
>> than 1 task in other cpus. Theoretical worst case should be around 1.5x
>> overcommit (As also pointed by Andrew Theurer). [But practical
>> worstcase may be around 2x,3x overcommits as indicated by the results
>> for the patch series]
>>
>> So if p is the probability of finding rq length one on a particular cpu,
>> and if we do n tries, then probability of exiting ple handler is:
>>
>>   p^(n+1) [ because we would have come across one source with rq length
>> 1 and n target cpu rqs  with length 1 ]
>>
>> so
>> num tries:         probability of aborting ple handler (1.5x overcommit)
>>   1                 1/4
>>   2                 1/8
>>   3                 1/16
>>
>> We can increase this probability with more tries, but the problem is
>> the overhead.
> IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread
> is preempted we do kvm->preempted_vcpus++, when it runs again we do
> kvm->preempted_vcpus--. PLE handler can try harder if kvm->preempted_vcpus
> is big or do not try at all if it is zero.

Thanks for the reply Gleb.

Yes.. It was on my next TODO as you know and it make sense to weigh all 
these approaches (undercommit patches/throttled yield/preempt
notifier/pvspinlock and their combination) to good extent before going 
further. I am happy if these patches are now in 'good shape to compare'
state. (same reason I had posted dynamic PLE appaoch too).





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-11-28  5:10     ` Raghavendra K T
  2012-11-29 12:16       ` Gleb Natapov
@ 2012-12-03 19:56       ` Marcelo Tosatti
  2012-12-04 17:49         ` Raghavendra K T
  2012-12-06  6:59         ` Raghavendra K T
  1 sibling, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2012-12-03 19:56 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
> >
> >Don't understand the reasoning behind why 3 is a good choice.
> 
> Here is where I came from. (explaining from scratch for
> completeness, forgive me :))
> In moderate overcommits, we can falsely exit from ple handler even when
> we have preempted task of same VM waiting on other cpus. To reduce this
> problem, we try few times before exiting.
> The problem boils down to:
> what is the probability that we exit ple handler even when we have more
> than 1 task in other cpus. Theoretical worst case should be around 1.5x
> overcommit (As also pointed by Andrew Theurer). [But practical
> worstcase may be around 2x,3x overcommits as indicated by the results
> for the patch series]
> 
> So if p is the probability of finding rq length one on a particular cpu,
> and if we do n tries, then probability of exiting ple handler is:
> 
>  p^(n+1) [ because we would have come across one source with rq length
> 1 and n target cpu rqs  with length 1 ]
> 
> so
> num tries:         probability of aborting ple handler (1.5x overcommit)
>  1                 1/4
>  2                 1/8
>  3                 1/16
> 
> We can increase this probability with more tries, but the problem is
> the overhead.
> Also, If we have tried three times that means we would have iterated
> over 3 good eligible vcpus along with many non-eligible candidates. In
> worst case if we iterate all the vcpus, we reduce 1x performance and
> overcommit performance get hit. [ as in results ].
> 
> I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
> concluded 3 is enough.
> 
> Infact I have also run kernbench and hackbench which are giving 5-20%
> improvement.
> 
> [ As a side note , I also thought how about having num_tries = f(n) =
> ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
> overhead and also there is no point in probably making it dependent on
> online cpus ]
> 
> Please let me know if you are happy with this rationale/ or correct me
> if you foresee some problem. (Infact Avi, Rik's concern about false
> exiting made me arrive at 'try' logic which I did not have earlier).
> 
> I am currently trying out the result for 1.5x overcommit will post the
> result.

Raghavendra

Makes sense to me. Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-12-03 19:56       ` Marcelo Tosatti
@ 2012-12-04 17:49         ` Raghavendra K T
  2012-12-06  6:59         ` Raghavendra K T
  1 sibling, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-12-04 17:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 12/04/2012 01:26 AM, Marcelo Tosatti wrote:
> On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
>> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
>>>
>>> Don't understand the reasoning behind why 3 is a good choice.
>>
>> Here is where I came from. (explaining from scratch for
>> completeness, forgive me :))
>> In moderate overcommits, we can falsely exit from ple handler even when
>> we have preempted task of same VM waiting on other cpus. To reduce this
>> problem, we try few times before exiting.
>> The problem boils down to:
>> what is the probability that we exit ple handler even when we have more
>> than 1 task in other cpus. Theoretical worst case should be around 1.5x
>> overcommit (As also pointed by Andrew Theurer). [But practical
>> worstcase may be around 2x,3x overcommits as indicated by the results
>> for the patch series]
>>
>> So if p is the probability of finding rq length one on a particular cpu,
>> and if we do n tries, then probability of exiting ple handler is:
>>
>>   p^(n+1) [ because we would have come across one source with rq length
>> 1 and n target cpu rqs  with length 1 ]
>>
>> so
>> num tries:         probability of aborting ple handler (1.5x overcommit)
>>   1                 1/4
>>   2                 1/8
>>   3                 1/16
>>
>> We can increase this probability with more tries, but the problem is
>> the overhead.
>> Also, If we have tried three times that means we would have iterated
>> over 3 good eligible vcpus along with many non-eligible candidates. In
>> worst case if we iterate all the vcpus, we reduce 1x performance and
>> overcommit performance get hit. [ as in results ].
>>
>> I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
>> concluded 3 is enough.
>>
>> Infact I have also run kernbench and hackbench which are giving 5-20%
>> improvement.
>>
>> [ As a side note , I also thought how about having num_tries = f(n) =
>> ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
>> overhead and also there is no point in probably making it dependent on
>> online cpus ]
>>
>> Please let me know if you are happy with this rationale/ or correct me
>> if you foresee some problem. (Infact Avi, Rik's concern about false
>> exiting made me arrive at 'try' logic which I did not have earlier).
>>
>> I am currently trying out the result for 1.5x overcommit will post the
>> result.
>
> Raghavendra
>
> Makes sense to me. Thanks.
>

Hi Marcelo, Thanks for looking into patches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-12-03 19:56       ` Marcelo Tosatti
  2012-12-04 17:49         ` Raghavendra K T
@ 2012-12-06  6:59         ` Raghavendra K T
  2012-12-08  0:49           ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2012-12-06  6:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On 12/04/2012 01:26 AM, Marcelo Tosatti wrote:
> On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
>> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
>>>
>>> Don't understand the reasoning behind why 3 is a good choice.
>>
>> Here is where I came from. (explaining from scratch for
>> completeness, forgive me :))
>> In moderate overcommits, we can falsely exit from ple handler even when
>> we have preempted task of same VM waiting on other cpus. To reduce this
>> problem, we try few times before exiting.
>> The problem boils down to:
>> what is the probability that we exit ple handler even when we have more
>> than 1 task in other cpus. Theoretical worst case should be around 1.5x
>> overcommit (As also pointed by Andrew Theurer). [But practical
>> worstcase may be around 2x,3x overcommits as indicated by the results
>> for the patch series]
>>
>> So if p is the probability of finding rq length one on a particular cpu,
>> and if we do n tries, then probability of exiting ple handler is:
>>
>>   p^(n+1) [ because we would have come across one source with rq length
>> 1 and n target cpu rqs  with length 1 ]
>>
>> so
>> num tries:         probability of aborting ple handler (1.5x overcommit)
>>   1                 1/4
>>   2                 1/8
>>   3                 1/16
>>
>> We can increase this probability with more tries, but the problem is
>> the overhead.
>> Also, If we have tried three times that means we would have iterated
>> over 3 good eligible vcpus along with many non-eligible candidates. In
>> worst case if we iterate all the vcpus, we reduce 1x performance and
>> overcommit performance get hit. [ as in results ].
>>
>> I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
>> concluded 3 is enough.
>>
>> Infact I have also run kernbench and hackbench which are giving 5-20%
>> improvement.
>>
>> [ As a side note , I also thought how about having num_tries = f(n) =
>> ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
>> overhead and also there is no point in probably making it dependent on
>> online cpus ]
>>
>> Please let me know if you are happy with this rationale/ or correct me
>> if you foresee some problem. (Infact Avi, Rik's concern about false
>> exiting made me arrive at 'try' logic which I did not have earlier).
>>
>> I am currently trying out the result for 1.5x overcommit will post the
>> result.
>
> Raghavendra
>
> Makes sense to me. Thanks.
>

Marcelo,
Do you think this can be considered for next merge window? or you are
expecting anything else on this patchset.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case
  2012-12-06  6:59         ` Raghavendra K T
@ 2012-12-08  0:49           ` Marcelo Tosatti
  0 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2012-12-08  0:49 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Peter Zijlstra, H. Peter Anvin, Avi Kivity, Gleb Natapov,
	Ingo Molnar, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

On Thu, Dec 06, 2012 at 12:29:02PM +0530, Raghavendra K T wrote:
> On 12/04/2012 01:26 AM, Marcelo Tosatti wrote:
> >On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
> >>On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
> >>>
> >>>Don't understand the reasoning behind why 3 is a good choice.
> >>
> >>Here is where I came from. (explaining from scratch for
> >>completeness, forgive me :))
> >>In moderate overcommits, we can falsely exit from ple handler even when
> >>we have preempted task of same VM waiting on other cpus. To reduce this
> >>problem, we try few times before exiting.
> >>The problem boils down to:
> >>what is the probability that we exit ple handler even when we have more
> >>than 1 task in other cpus. Theoretical worst case should be around 1.5x
> >>overcommit (As also pointed by Andrew Theurer). [But practical
> >>worstcase may be around 2x,3x overcommits as indicated by the results
> >>for the patch series]
> >>
> >>So if p is the probability of finding rq length one on a particular cpu,
> >>and if we do n tries, then probability of exiting ple handler is:
> >>
> >>  p^(n+1) [ because we would have come across one source with rq length
> >>1 and n target cpu rqs  with length 1 ]
> >>
> >>so
> >>num tries:         probability of aborting ple handler (1.5x overcommit)
> >>  1                 1/4
> >>  2                 1/8
> >>  3                 1/16
> >>
> >>We can increase this probability with more tries, but the problem is
> >>the overhead.
> >>Also, If we have tried three times that means we would have iterated
> >>over 3 good eligible vcpus along with many non-eligible candidates. In
> >>worst case if we iterate all the vcpus, we reduce 1x performance and
> >>overcommit performance get hit. [ as in results ].
> >>
> >>I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
> >>concluded 3 is enough.
> >>
> >>Infact I have also run kernbench and hackbench which are giving 5-20%
> >>improvement.
> >>
> >>[ As a side note , I also thought how about having num_tries = f(n) =
> >>ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
> >>overhead and also there is no point in probably making it dependent on
> >>online cpus ]
> >>
> >>Please let me know if you are happy with this rationale/ or correct me
> >>if you foresee some problem. (Infact Avi, Rik's concern about false
> >>exiting made me arrive at 'try' logic which I did not have earlier).
> >>
> >>I am currently trying out the result for 1.5x overcommit will post the
> >>result.
> >
> >Raghavendra
> >
> >Makes sense to me. Thanks.
> >
> 
> Marcelo,
> Do you think this can be considered for next merge window? or you are
> expecting anything else on this patchset.

Nope, not expecting anything else. About merge window, depends on
upstream.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
  2012-11-26 13:35   ` Andrew Jones
@ 2012-12-14  0:29   ` Marcelo Tosatti
  2012-12-14 15:40     ` Raghavendra K T
  1 sibling, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2012-12-14  0:29 UTC (permalink / raw)
  To: Raghavendra K T, Ingo Molnar
  Cc: Peter Zijlstra, H. Peter Anvin, Gleb Natapov, Ingo Molnar,
	Avi Kivity, Rik van Riel, Srikar, Nikunj A. Dadhania, KVM,
	Jiannan Ouyang, Chegu Vinod, Andrew M. Theurer, LKML,
	Srivatsa Vaddagiri, Andrew Jones

Raghavendra,

Please get this integrate through x86 tree (Ingo CC'ed).

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> In case of undercomitted scenarios, especially in large guests
> yield_to overhead is significantly high. when run queue length of
> source and target is one, take an opportunity to bail out and return
> -ESRCH. This return condition can be further exploited to quickly come
> out of PLE handler.
> 
> (History: Raghavendra initially worked on break out of kvm ple handler upon
>  seeing source runqueue length = 1, but it had to export rq length).
>  Peter came up with the elegant idea of return -ESRCH in scheduler core.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched/core.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..fc219a5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *	true (>0) if we indeed boosted the target task.
> + *	false (0) if we failed to boost the target.
> + *	-ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  
>  again:
>  	p_rq = task_rq(p);
> +	/*
> +	 * If we're the only runnable task on the rq and target rq also
> +	 * has only one task, there's absolutely no point in yielding.
> +	 */
> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
> @@ -4310,13 +4322,13 @@ again:
>  	}
>  
>  	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>  
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4329,11 +4341,12 @@ again:
>  			resched_task(p_rq->curr);
>  	}
>  
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
>  
> -	if (yielded)
> +	if (yielded > 0)
>  		schedule();
>  
>  	return yielded;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-12-14  0:29   ` Marcelo Tosatti
@ 2012-12-14 15:40     ` Raghavendra K T
  2012-12-19  5:35       ` Raghavendra K T
  0 siblings, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2012-12-14 15:40 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Ingo Molnar, Peter Zijlstra, H. Peter Anvin, Gleb Natapov,
	Ingo Molnar, Avi Kivity, Rik van Riel, Srikar,
	Nikunj A. Dadhania, KVM, Jiannan Ouyang, Chegu Vinod,
	Andrew M. Theurer, LKML, Srivatsa Vaddagiri, Andrew Jones

Hi Ingo,

Could you please take this into x86 tree?

Thanks,
On 12/14/2012 05:59 AM, Marcelo Tosatti wrote:
> Raghavendra,
>
> Please get this integrate through x86 tree (Ingo CC'ed).
>
> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> In case of undercomitted scenarios, especially in large guests
>> yield_to overhead is significantly high. when run queue length of
>> source and target is one, take an opportunity to bail out and return
>> -ESRCH. This return condition can be further exploited to quickly come
>> out of PLE handler.
>>
>> (History: Raghavendra initially worked on break out of kvm ple handler upon
>>   seeing source runqueue length = 1, but it had to export rq length).
>>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
>>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>
>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2d8927f..fc219a5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>    * It's the caller's job to ensure that the target task struct
>>    * can't go away on us before we can do any checks.
>>    *
>> - * Returns true if we indeed boosted the target task.
>> + * Returns:
>> + *	true (>0) if we indeed boosted the target task.
>> + *	false (0) if we failed to boost the target.
>> + *	-ESRCH if there's no task to yield to.
>>    */
>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>   {
>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>>
>>   again:
>>   	p_rq = task_rq(p);
>> +	/*
>> +	 * If we're the only runnable task on the rq and target rq also
>> +	 * has only one task, there's absolutely no point in yielding.
>> +	 */
>> +	if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>> +		yielded = -ESRCH;
>> +		goto out_irq;
>> +	}
>> +
>>   	double_rq_lock(rq, p_rq);
>>   	while (task_rq(p) != p_rq) {
>>   		double_rq_unlock(rq, p_rq);
>> @@ -4310,13 +4322,13 @@ again:
>>   	}
>>
>>   	if (!curr->sched_class->yield_to_task)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	if (curr->sched_class != p->sched_class)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	if (task_running(p_rq, p) || p->state)
>> -		goto out;
>> +		goto out_unlock;
>>
>>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>   	if (yielded) {
>> @@ -4329,11 +4341,12 @@ again:
>>   			resched_task(p_rq->curr);
>>   	}
>>
>> -out:
>> +out_unlock:
>>   	double_rq_unlock(rq, p_rq);
>> +out_irq:
>>   	local_irq_restore(flags);
>>
>> -	if (yielded)
>> +	if (yielded > 0)
>>   		schedule();
>>
>>   	return yielded;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
  2012-12-14 15:40     ` Raghavendra K T
@ 2012-12-19  5:35       ` Raghavendra K T
  0 siblings, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2012-12-19  5:35 UTC (permalink / raw)
  To: Marcelo Tosatti, Ingo Molnar, Ingo Molnar
  Cc: Peter Zijlstra, H. Peter Anvin, Gleb Natapov, Avi Kivity,
	Rik van Riel, Srikar, Nikunj A. Dadhania, KVM, Jiannan Ouyang,
	Chegu Vinod, Andrew M. Theurer, LKML, Srivatsa Vaddagiri,
	Andrew Jones

[I forgot to do TO to Ingo last time]

Ingo,
  Could you please take this into x86 tree.
This is
Acked-by: Andrew Jones <drjones@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>

Marcelo, do you want to add your Acked-by/Reviewed-by?

On 12/14/2012 09:10 PM, Raghavendra K T wrote:
> Hi Ingo,
>
> Could you please take this into x86 tree?
>
> Thanks,
> On 12/14/2012 05:59 AM, Marcelo Tosatti wrote:
>> Raghavendra,
>>
>> Please get this integrate through x86 tree (Ingo CC'ed).
>>
>> On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> In case of undercomitted scenarios, especially in large guests
>>> yield_to overhead is significantly high. when run queue length of
>>> source and target is one, take an opportunity to bail out and return
>>> -ESRCH. This return condition can be further exploited to quickly come
>>> out of PLE handler.
>>>
>>> (History: Raghavendra initially worked on break out of kvm ple
>>> handler upon
>>>   seeing source runqueue length = 1, but it had to export rq length).
>>>   Peter came up with the elegant idea of return -ESRCH in scheduler
>>> core.
>>>
>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> Raghavendra, Checking the rq length of target vcpu condition
>>> added.(thanks Avi)
>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>> ---
>>>
>>>   kernel/sched/core.c |   25 +++++++++++++++++++------
>>>   1 file changed, 19 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 2d8927f..fc219a5 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
>>>    * It's the caller's job to ensure that the target task struct
>>>    * can't go away on us before we can do any checks.
>>>    *
>>> - * Returns true if we indeed boosted the target task.
>>> + * Returns:
>>> + *    true (>0) if we indeed boosted the target task.
>>> + *    false (0) if we failed to boost the target.
>>> + *    -ESRCH if there's no task to yield to.
>>>    */
>>>   bool __sched yield_to(struct task_struct *p, bool preempt)
>>>   {
>>> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p,
>>> bool preempt)
>>>
>>>   again:
>>>       p_rq = task_rq(p);
>>> +    /*
>>> +     * If we're the only runnable task on the rq and target rq also
>>> +     * has only one task, there's absolutely no point in yielding.
>>> +     */
>>> +    if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>> +        yielded = -ESRCH;
>>> +        goto out_irq;
>>> +    }
>>> +
>>>       double_rq_lock(rq, p_rq);
>>>       while (task_rq(p) != p_rq) {
>>>           double_rq_unlock(rq, p_rq);
>>> @@ -4310,13 +4322,13 @@ again:
>>>       }
>>>
>>>       if (!curr->sched_class->yield_to_task)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       if (curr->sched_class != p->sched_class)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       if (task_running(p_rq, p) || p->state)
>>> -        goto out;
>>> +        goto out_unlock;
>>>
>>>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>>>       if (yielded) {
>>> @@ -4329,11 +4341,12 @@ again:
>>>               resched_task(p_rq->curr);
>>>       }
>>>
>>> -out:
>>> +out_unlock:
>>>       double_rq_unlock(rq, p_rq);
>>> +out_irq:
>>>       local_irq_restore(flags);
>>>
>>> -    if (yielded)
>>> +    if (yielded > 0)
>>>           schedule();
>>>
>>>       return yielded;
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-12-19  5:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-26 12:07 [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Raghavendra K T
2012-11-26 12:07 ` [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
2012-11-26 13:35   ` Andrew Jones
2012-11-27 10:30     ` Raghavendra K T
2012-11-27 14:04       ` Andrew Theurer
2012-11-28  7:03         ` Raghavendra K T
2012-11-27 14:23       ` Chegu Vinod
     [not found]         ` <50B68F94.3080907@hp.com>
2012-11-29  2:00           ` Andrew Theurer
     [not found]         ` <50B6B5B5.5060108@hp.com>
2012-11-29  2:20           ` Chegu Vinod
2012-12-14  0:29   ` Marcelo Tosatti
2012-12-14 15:40     ` Raghavendra K T
2012-12-19  5:35       ` Raghavendra K T
2012-11-26 12:08 ` [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
2012-11-26 13:43   ` Andrew Jones
2012-11-26 14:06     ` Andrew Jones
2012-11-27 10:27     ` Raghavendra K T
2012-11-27 13:22       ` Andrew Jones
2012-11-28  1:12   ` Marcelo Tosatti
2012-11-28  5:10     ` Raghavendra K T
2012-11-29 12:16       ` Gleb Natapov
2012-11-30  5:04         ` Raghavendra K T
2012-12-03 19:56       ` Marcelo Tosatti
2012-12-04 17:49         ` Raghavendra K T
2012-12-06  6:59         ` Raghavendra K T
2012-12-08  0:49           ` Marcelo Tosatti
2012-11-29  2:07 ` [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios Chegu Vinod
2012-11-29  9:49   ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).