linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Give idle_balance() a break when it does not move tasks.
@ 2013-08-12  8:42 Jason Low
  2013-08-12 11:00 ` Srikar Dronamraju
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Low @ 2013-08-12  8:42 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Jason Low
  Cc: LKML, Linus Torvalds, Mike Galbraith, Thomas Gleixner,
	Paul Turner, Alex Shi, Preeti U Murthy, Vincent Guittot,
	Morten Rasmussen, Namhyung Kim, Andrew Morton, Kees Cook,
	Mel Gorman, Rik van Riel, aswin, scott.norton, chegu_vinod,
	Srikar Dronamraju, Bui, Tuan, Waiman Long, Makphaibulchoke,
	Thavatchai, Bueso, Davidlohr

When running benchmarks on an 8 socket (80 core) machine, newidle load balance
was being attempted frequently, but no tasks were moved most of the time. The
most common reason was due to find_busiest_group returning NULL (for example,
groups were already balanced within domains). Another common reason was that
move tasks was attempted but it couldn't find a suitable task to move.

In situations where idle balance does not need or is not able to move tasks,
attempting it frequently may lower performance by consuming kernel time while
also ending up not doing anything. Functions such as update_sd_lb_stats(),
idle_cpu(), acquiring rq lock, ect... show up as hot kernel functions as
reported by perf during these situations.

In this patch, idle balance attempts for a CPU will be blocked for a
brief duration (10 ms) after newidle load balance is attempted for the CPU but
did not move tasks. The idea is that if a CPU just attempted a new idle
load balance and found that it either does not need to do any balancing or is
unable to move a task over for all sched domains, then it should
avoid re-attempting it on the same CPU too soon once again.

This patch is able to provide good performance boosts of many AIM7 workloads
on an 8 socket machine 80 core while still preserving performance of a few
workloads that I have also been running which benefit from idle balancing.

The table below compares the average jobs per minute at 10-100, 200-1000, and
1100-2000 users between the vanilla 3.11-rc4 kernel and 3.11-rc4 kernel with
this patch with Hyperthreading enabled.

----------------------------------------------------------------
workload     | % improvement   | % improvement	| % improvement
             | with patch      | with patch	| with patch
             | 1100-2000 users | 200-1000 users	| 10-100 users
----------------------------------------------------------------
alltests     | +13.9%          | +5.6%          | +0.9%
----------------------------------------------------------------
custom       | +20.3%          | +20.7%         | +12.2%
----------------------------------------------------------------
disk         | +22.3%          | +18.7%         | +18.3%
----------------------------------------------------------------
fserver      | +62.6%          | +27.3%         | -0.9%
----------------------------------------------------------------
high_systime | +17.7%          | +10.7%         | +0.6%
----------------------------------------------------------------
new_fserver  | +53.1%          | +20.2%         | -0.2%
----------------------------------------------------------------
shared       | +10.3%          | +10.8%         | +10.7%
----------------------------------------------------------------


Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/fair.c  |   10 +++++++++-
 kernel/sched/sched.h |    5 +++++
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..b1fcf6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6458,6 +6458,7 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->next_newidle_balance = jiffies;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9565645..70d96d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5277,10 +5277,12 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
+	bool load_balance_attempted = false;
 
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost)
+	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
+	    time_before(jiffies, this_rq->next_newidle_balance))
 		return;
 
 	/*
@@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 			continue;
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
+			load_balance_attempted = true;
+
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
 						   sd, CPU_NEWLY_IDLE, &balance);
@@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 		 */
 		this_rq->next_balance = next_balance;
 	}
+
+	/* Give idle balance on this CPU a break when it isn't moving tasks */
+	if (load_balance_attempted && !pulled_task)
+		this_rq->next_newidle_balance = jiffies + (HZ / 100);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..6a70be9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,11 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+
+#ifdef CONFIG_SMP
+	/* set when we want to block idle balance */
+	unsigned long next_newidle_balance;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.1




^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched: Give idle_balance() a break when it does not move tasks.
  2013-08-12  8:42 [PATCH] sched: Give idle_balance() a break when it does not move tasks Jason Low
@ 2013-08-12 11:00 ` Srikar Dronamraju
  2013-08-12 20:19   ` Jason Low
  0 siblings, 1 reply; 3+ messages in thread
From: Srikar Dronamraju @ 2013-08-12 11:00 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Linus Torvalds,
	Mike Galbraith, Thomas Gleixner, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Andrew Morton, Kees Cook, Mel Gorman, Rik van Riel, aswin,
	scott.norton, chegu_vinod, Bui, Tuan, Waiman Long,
	Makphaibulchoke, Thavatchai, Bueso, Davidlohr

>  	/*
> @@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
>  			continue;
> 
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
> +			load_balance_attempted = true;
> +
>  			/* If we've pulled tasks over stop searching: */
>  			pulled_task = load_balance(this_cpu, this_rq,
>  						   sd, CPU_NEWLY_IDLE, &balance);
> @@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
>  		 */
>  		this_rq->next_balance = next_balance;
>  	}
> +
> +	/* Give idle balance on this CPU a break when it isn't moving tasks */
> +	if (load_balance_attempted && !pulled_task)
> +		this_rq->next_newidle_balance = jiffies + (HZ / 100);
>  }

Looks reasonable. However should we do this per sd and not per rq. i.e
move the next_newidle_balance to sched_domain. So if we find a
load_balance in newly_idle context that wasn't successful, we skip
load_balance for that sd in the next newly idle balance.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched: Give idle_balance() a break when it does not move tasks.
  2013-08-12 11:00 ` Srikar Dronamraju
@ 2013-08-12 20:19   ` Jason Low
  0 siblings, 0 replies; 3+ messages in thread
From: Jason Low @ 2013-08-12 20:19 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Linus Torvalds,
	Mike Galbraith, Thomas Gleixner, Paul Turner, Alex Shi,
	Preeti U Murthy, Vincent Guittot, Morten Rasmussen, Namhyung Kim,
	Andrew Morton, Kees Cook, Mel Gorman, Rik van Riel, aswin,
	scott.norton, chegu_vinod, Bui, Tuan, Waiman Long,
	Makphaibulchoke, Thavatchai, Bueso, Davidlohr

On Mon, 2013-08-12 at 16:30 +0530, Srikar Dronamraju wrote:
> >  	/*
> > @@ -5298,6 +5300,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> >  			continue;
> > 
> >  		if (sd->flags & SD_BALANCE_NEWIDLE) {
> > +			load_balance_attempted = true;
> > +
> >  			/* If we've pulled tasks over stop searching: */
> >  			pulled_task = load_balance(this_cpu, this_rq,
> >  						   sd, CPU_NEWLY_IDLE, &balance);
> > @@ -5322,6 +5326,10 @@ void idle_balance(int this_cpu, struct rq *this_rq)
> >  		 */
> >  		this_rq->next_balance = next_balance;
> >  	}
> > +
> > +	/* Give idle balance on this CPU a break when it isn't moving tasks */
> > +	if (load_balance_attempted && !pulled_task)
> > +		this_rq->next_newidle_balance = jiffies + (HZ / 100);
> >  }
> 
> Looks reasonable. However should we do this per sd and not per rq. i.e
> move the next_newidle_balance to sched_domain. So if we find a
> load_balance in newly_idle context that wasn't successful, we skip
> load_balance for that sd in the next newly idle balance.

I wonder, if we skip newidle balance for a domain after a newidle
balance attempt for a CPU did not move tasks, would that potentially
cause some "unfairness" for all the other CPUS within the domain?

Perhaps we can reduce the duration that idle balance is blocked from 10
ms to a much smaller duration if we were to block on a per domain basis.

Peter, any thoughts on which method is preferable?

Thanks for the suggestion,
Jason



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-08-12 20:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-12  8:42 [PATCH] sched: Give idle_balance() a break when it does not move tasks Jason Low
2013-08-12 11:00 ` Srikar Dronamraju
2013-08-12 20:19   ` Jason Low

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).