All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
@ 2012-06-19 12:13 Prashanth Nageshappa
  2012-06-20 17:38 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Prashanth Nageshappa @ 2012-06-19 12:13 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, LKML, roland, Srivatsa Vaddagiri, efault,
	Ingo Molnar

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

Current load balance scheme requires only one cpu in a sched_group (balance_cpu)
to look at other peer sched_groups for imbalance and pull tasks towards
itself from a busy cpu. Tasks thus pulled by balance_cpu could later get
picked up by cpus that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to run on
balance_cpu (but are allowed to run on other cpus in its sched_group).
That can affect fairness and in some worst case scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

F0 = SCHED_FIFO task (pinned to C0)
F1 = SCHED_FIFO task (pinned to C2)
T1 = SCHED_OTHER task (pinned to C1)
T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls it. Between C0
and C1 however, C0 is required to look for imbalance between cores, which will
fail to pull T2 towards Core0. T2 will starve eternally in this case. The same
scenario can arise in presence of non-rt tasks as well (say we replace F1 with
high irq load or with a very high priority SCHED_OTHER task that can't move out
of C2).

We tackle this problem by having balance_cpu move pinned tasks to one of its
sibling cpus (where they can run). We first check if load balance goal can be
met by ignoring pinned tasks, failing which we retry move_tasks() with a new
env->dst_cpu.

This patch modifies load balance semantics on who can move load towards a given
cpu in a given sched_domain. Before this patch, a given_cpu or a ilb_cpu acting
on behalf of an idle given_cpu is responsible for moving load to given_cpu.
With this patch applied, balance_cpu can in addition decide on moving some load
to a given_cpu. There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding *independently*
and at *same* time to move some load to a given_cpu). However we should see
less of such conflicting decisions in practice and moreover subsequent load
balance cycles should correct the excess load moved to given_cpu.


Changes since v2 (https://lkml.org/lkml/2012/6/6/252):
 - updated change log to make it more consise
 - modified comments in the code
 - modified code to remember only one new_dst_cpu per attempt
 - modified code to limit the number of load balance attempts to number of
    cpus in the sched_group

Changes since v1 (https://lkml.org/lkml/2012/6/4/52):

 - updated change log to describe the problem in a more generic sense and
    different soultions considered
 - used cur_ld_moved instead of old_ld_moved
 - modified comments in the code
 - reset env.loop_break before retrying

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 939fd63..cff3659 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3108,6 +3109,8 @@ struct lb_env {
 	int			dst_cpu;
 	struct rq		*dst_rq;
 
+	struct cpumask		*dst_grpmask;
+	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
 	unsigned int		flags;
@@ -3198,9 +3201,31 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) are cache-hot on their current CPU.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+		int new_dst_cpu;
+
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+		/*
+		 * Remember if this task can be migrated to any other cpu in
+		 * our sched_group. We may want to revisit it if we couldn't
+		 * meet load balance goals by pulling other tasks on src_cpu.
+		 *
+		 * Also avoid computing new_dst_cpu if we have already computed
+		 * one in current iteration.
+		 */
+		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+			return 0;
+
+		new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+						tsk_cpus_allowed(p));
+		if (new_dst_cpu < nr_cpu_ids) {
+			env->flags |= LBF_SOME_PINNED;
+			env->new_dst_cpu = new_dst_cpu;
+		}
 		return 0;
 	}
+
+	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_running(env->src_rq, p)) {
@@ -4440,7 +4465,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *balance)
 {
-	int ld_moved, active_balance = 0;
+	int ld_moved, cur_ld_moved, active_balance = 0;
+	int lb_iterations, max_lb_iterations;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -4450,12 +4476,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.sd		    = sd,
 		.dst_cpu	    = this_cpu,
 		.dst_rq		    = this_rq,
+		.dst_grpmask	    = sched_group_cpus(sd->groups),
 		.idle		    = idle,
 		.loop_break	    = sched_nr_migrate_break,
 		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
+	max_lb_iterations = cpumask_weight(env.dst_grpmask);
 
 	schedstat_inc(sd, lb_count[idle]);
 
@@ -4483,6 +4511,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
 	ld_moved = 0;
+	lb_iterations = 1;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -4502,7 +4531,13 @@ more_balance:
 		double_rq_lock(this_rq, busiest);
 		if (!env.loop)
 			update_h_load(env.src_cpu);
-		ld_moved += move_tasks(&env);
+
+		/*
+		 * cur_ld_moved = load moved in current iteration
+		 * ld_moved = cumulative load moved across iterations
+		 */
+		cur_ld_moved = move_tasks(&env);
+		ld_moved += cur_ld_moved;
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
@@ -4514,8 +4549,42 @@ more_balance:
 		/*
 		 * some other cpu did the load balance for us.
 		 */
-		if (ld_moved && this_cpu != smp_processor_id())
-			resched_cpu(this_cpu);
+		if (cur_ld_moved && env.dst_cpu != smp_processor_id())
+			resched_cpu(env.dst_cpu);
+
+		/*
+		 * Revisit (pinned) tasks on src_cpu that couldn't be moved to
+		 * us and move them to an alternate dst_cpu in our sched_group
+		 * where they can run. The upper limit on how many times we
+		 * iterate on same src_cpu is dependent on number of cpus in our
+		 * sched_group.
+		 *
+		 * This changes load balance semantics a bit on who can move
+		 * load to a given_cpu. In addition to the given_cpu itself
+		 * (or a ilb_cpu acting on its behalf where given_cpu is
+		 * nohz-idle), we now have balance_cpu in a position to move
+		 * load to given_cpu. In rare situations, this may cause
+		 * conflicts (balance_cpu and given_cpu/ilb_cpu deciding
+		 * _independently_ and at _same_ time to move some load to
+		 * given_cpu) causing exceess load to be moved to given_cpu.
+		 * This however should not happen so much in practice and
+		 * moreover subsequent load balance cycles should correct the
+		 * excess load moved.
+		 */
+		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0 &&
+				lb_iterations++ < max_lb_iterations) {
+			this_rq = cpu_rq(env.new_dst_cpu);
+			env.dst_rq = this_rq;
+			env.dst_cpu = env.new_dst_cpu;
+			env.flags &= ~LBF_SOME_PINNED;
+			env.loop = 0;
+			env.loop_break = sched_nr_migrate_break;
+			/*
+			 * Go back to "more_balance" rather than "redo" since we
+			 * need to continue with same src_cpu.
+			 */
+			goto more_balance;
+		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-19 12:13 [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task Prashanth Nageshappa
@ 2012-06-20 17:38 ` Peter Zijlstra
  2012-06-21  3:40   ` Srivatsa Vaddagiri
  2012-06-20 17:40 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2012-06-20 17:38 UTC (permalink / raw)
  To: Prashanth Nageshappa
  Cc: mingo, LKML, roland, Srivatsa Vaddagiri, efault, Ingo Molnar

On Tue, 2012-06-19 at 17:43 +0530, Prashanth Nageshappa wrote:
> T2 will starve eternally in this case. The same
> scenario can arise in presence of non-rt tasks as well (say we replace F1 with
> high irq load or with a very high priority SCHED_OTHER task that can't move out
> of C2). 

Uhm, no. In the case where both F1 and T2 are SCHED_OTHER starvation is
impossible.

What can happen with pure SCHED_OTHER affinities is being less fair than
desired.

Anyway, I took the patch with a few minor edits -- ie. we don't need to
reset loop_break, its never changed (same for the ALL_PINNED patch you
sent).



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-19 12:13 [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task Prashanth Nageshappa
  2012-06-20 17:38 ` Peter Zijlstra
@ 2012-06-20 17:40 ` Peter Zijlstra
  2012-07-06  6:23 ` [tip:sched/core] sched: Improve balance_cpu() " tip-bot for Srivatsa Vaddagiri
  2012-07-24 14:20 ` tip-bot for Srivatsa Vaddagiri
  3 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2012-06-20 17:40 UTC (permalink / raw)
  To: Prashanth Nageshappa
  Cc: mingo, LKML, roland, Srivatsa Vaddagiri, efault, Ingo Molnar

On Tue, 2012-06-19 at 17:43 +0530, Prashanth Nageshappa wrote:
> +               new_dst_cpu = cpumask_first_and(env->dst_grpmask,
> +                                               tsk_cpus_allowed(p));


> +               if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0 &&
> +                               lb_iterations++ < max_lb_iterations) { 

Hmm, too bad dst_grpmask isn't an actual mask, otherwise we could simply
clear the cpu we just had, just like the ALL_PINNED does with cpus.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-20 17:38 ` Peter Zijlstra
@ 2012-06-21  3:40   ` Srivatsa Vaddagiri
  2012-06-21  4:23     ` Srivatsa Vaddagiri
  2012-06-21  8:34     ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2012-06-21  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Prashanth Nageshappa, mingo, LKML, roland, efault, Ingo Molnar

* Peter Zijlstra <peterz@infradead.org> [2012-06-20 19:38:21]:

> On Tue, 2012-06-19 at 17:43 +0530, Prashanth Nageshappa wrote:
> > T2 will starve eternally in this case. The same
> > scenario can arise in presence of non-rt tasks as well (say we replace F1 with
> > high irq load or with a very high priority SCHED_OTHER task that can't move out
> > of C2). 
> 
> Uhm, no. In the case where both F1 and T2 are SCHED_OTHER starvation is
> impossible.
> 
> What can happen with pure SCHED_OTHER affinities is being less fair than
> desired.

Right ..sorry should have made that more explicit in the description.

> Anyway, I took the patch with a few minor edits

Thanks!

> -- ie. we don't need to
> reset loop_break, its never changed (same for the ALL_PINNED patch you
> sent).

Hmm ..I can see loop_break being incremented here:

                /* take a breather every nr_migrate tasks */
                if (env->loop > env->loop_break) {
                        env->loop_break += sched_nr_migrate_break;
                        env->flags |= LBF_NEED_BREAK;
                        goto out;
                }

As a result, when we redo with a different src_cpu, both loop and
loop_break could be at non-default values.  Am I missing something here?

- vatsa


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-21  3:40   ` Srivatsa Vaddagiri
@ 2012-06-21  4:23     ` Srivatsa Vaddagiri
  2012-06-21  5:24       ` Prashanth Nageshappa
  2012-06-21  8:34     ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2012-06-21  4:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Prashanth Nageshappa, mingo, LKML, roland, efault, Ingo Molnar

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-06-21 09:10:59]:

> Hmm ..I can see loop_break being incremented here:
> 
>                 /* take a breather every nr_migrate tasks */
>                 if (env->loop > env->loop_break) {
>                         env->loop_break += sched_nr_migrate_break;

One possibility is to reset env->loop here, rather than loop_break, in
which case loop_break can remain constant across "redos" and
"more_balances"


>                         env->flags |= LBF_NEED_BREAK;
>                         goto out;
>                 }

- vatsa


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-21  4:23     ` Srivatsa Vaddagiri
@ 2012-06-21  5:24       ` Prashanth Nageshappa
  0 siblings, 0 replies; 9+ messages in thread
From: Prashanth Nageshappa @ 2012-06-21  5:24 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, mingo, LKML, roland, efault, Ingo Molnar

On 06/21/2012 09:53 AM, Srivatsa Vaddagiri wrote:

> * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-06-21 09:10:59]:
> 
>> Hmm ..I can see loop_break being incremented here:
>>
>>                 /* take a breather every nr_migrate tasks */
>>                 if (env->loop > env->loop_break) {
>>                         env->loop_break += sched_nr_migrate_break;
> 
> One possibility is to reset env->loop here, rather than loop_break, in
> which case loop_break can remain constant across "redos" and
> "more_balances"

We cannot do this as env->loop is compared with env->loop_max. So, I
believe resetting env->loop_break along with env->loop before "redos"
and "more_balances" is the right thing to do.

- Prashanth


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task
  2012-06-21  3:40   ` Srivatsa Vaddagiri
  2012-06-21  4:23     ` Srivatsa Vaddagiri
@ 2012-06-21  8:34     ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2012-06-21  8:34 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Prashanth Nageshappa, mingo, LKML, roland, efault, Ingo Molnar

On Thu, 2012-06-21 at 09:10 +0530, Srivatsa Vaddagiri wrote:

> Hmm ..I can see loop_break being incremented here:
> 
>                 /* take a breather every nr_migrate tasks */
>                 if (env->loop > env->loop_break) {
>                         env->loop_break += sched_nr_migrate_break;
>                         env->flags |= LBF_NEED_BREAK;
>                         goto out;
>                 }
> 
> As a result, when we redo with a different src_cpu, both loop and
> loop_break could be at non-default values.  Am I missing something here?

D'0h right you are.. /me quickly edits patch again.

I thought I removed all that, guess I didn't.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
  2012-06-19 12:13 [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task Prashanth Nageshappa
  2012-06-20 17:38 ` Peter Zijlstra
  2012-06-20 17:40 ` Peter Zijlstra
@ 2012-07-06  6:23 ` tip-bot for Srivatsa Vaddagiri
  2012-07-24 14:20 ` tip-bot for Srivatsa Vaddagiri
  3 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Srivatsa Vaddagiri @ 2012-07-06  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, a.p.zijlstra, tglx, prashanth

Commit-ID:  9b523ea8f7dcfebd954200d276cbdc203913ba30
Gitweb:     http://git.kernel.org/tip/9b523ea8f7dcfebd954200d276cbdc203913ba30
Author:     Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
AuthorDate: Tue, 19 Jun 2012 17:43:15 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jul 2012 21:09:11 +0200

sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f94840..77f1ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3084,6 +3084,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3094,6 +3095,8 @@ struct lb_env {
 	int			dst_cpu;
 	struct rq		*dst_rq;
 
+	struct cpumask		*dst_grpmask;
+	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
 
@@ -3184,9 +3187,31 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) are cache-hot on their current CPU.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+		int new_dst_cpu;
+
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+		/*
+		 * Remember if this task can be migrated to any other cpu in
+		 * our sched_group. We may want to revisit it if we couldn't
+		 * meet load balance goals by pulling other tasks on src_cpu.
+		 *
+		 * Also avoid computing new_dst_cpu if we have already computed
+		 * one in current iteration.
+		 */
+		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+			return 0;
+
+		new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+						tsk_cpus_allowed(p));
+		if (new_dst_cpu < nr_cpu_ids) {
+			env->flags |= LBF_SOME_PINNED;
+			env->new_dst_cpu = new_dst_cpu;
+		}
 		return 0;
 	}
+
+	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_running(env->src_rq, p)) {
@@ -4417,7 +4442,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *balance)
 {
-	int ld_moved, active_balance = 0;
+	int ld_moved, cur_ld_moved, active_balance = 0;
+	int lb_iterations, max_lb_iterations;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -4427,12 +4453,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.sd		    = sd,
 		.dst_cpu	    = this_cpu,
 		.dst_rq		    = this_rq,
+		.dst_grpmask	    = sched_group_cpus(sd->groups),
 		.idle		    = idle,
 		.loop_break	    = sched_nr_migrate_break,
 		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
+	max_lb_iterations = cpumask_weight(env.dst_grpmask);
 
 	schedstat_inc(sd, lb_count[idle]);
 
@@ -4460,6 +4488,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
 	ld_moved = 0;
+	lb_iterations = 1;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -4479,7 +4508,13 @@ more_balance:
 		double_rq_lock(this_rq, busiest);
 		if (!env.loop)
 			update_h_load(env.src_cpu);
-		ld_moved += move_tasks(&env);
+
+		/*
+		 * cur_ld_moved - load moved in current iteration
+		 * ld_moved     - cumulative load moved across iterations
+		 */
+		cur_ld_moved = move_tasks(&env);
+		ld_moved += cur_ld_moved;
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
@@ -4491,8 +4526,43 @@ more_balance:
 		/*
 		 * some other cpu did the load balance for us.
 		 */
-		if (ld_moved && this_cpu != smp_processor_id())
-			resched_cpu(this_cpu);
+		if (cur_ld_moved && env.dst_cpu != smp_processor_id())
+			resched_cpu(env.dst_cpu);
+
+		/*
+		 * Revisit (affine) tasks on src_cpu that couldn't be moved to
+		 * us and move them to an alternate dst_cpu in our sched_group
+		 * where they can run. The upper limit on how many times we
+		 * iterate on same src_cpu is dependent on number of cpus in our
+		 * sched_group.
+		 *
+		 * This changes load balance semantics a bit on who can move
+		 * load to a given_cpu. In addition to the given_cpu itself
+		 * (or a ilb_cpu acting on its behalf where given_cpu is
+		 * nohz-idle), we now have balance_cpu in a position to move
+		 * load to given_cpu. In rare situations, this may cause
+		 * conflicts (balance_cpu and given_cpu/ilb_cpu deciding
+		 * _independently_ and at _same_ time to move some load to
+		 * given_cpu) causing exceess load to be moved to given_cpu.
+		 * This however should not happen so much in practice and
+		 * moreover subsequent load balance cycles should correct the
+		 * excess load moved.
+		 */
+		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0 &&
+				lb_iterations++ < max_lb_iterations) {
+
+			this_rq		 = cpu_rq(env.new_dst_cpu);
+			env.dst_rq	 = this_rq;
+			env.dst_cpu	 = env.new_dst_cpu;
+			env.flags	&= ~LBF_SOME_PINNED;
+			env.loop	 = 0;
+			env.loop_break	 = sched_nr_migrate_break;
+			/*
+			 * Go back to "more_balance" rather than "redo" since we
+			 * need to continue with same src_cpu.
+			 */
+			goto more_balance;
+		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
  2012-06-19 12:13 [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task Prashanth Nageshappa
                   ` (2 preceding siblings ...)
  2012-07-06  6:23 ` [tip:sched/core] sched: Improve balance_cpu() " tip-bot for Srivatsa Vaddagiri
@ 2012-07-24 14:20 ` tip-bot for Srivatsa Vaddagiri
  3 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Srivatsa Vaddagiri @ 2012-07-24 14:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vatsa, hpa, mingo, a.p.zijlstra, tglx, prashanth

Commit-ID:  88b8dac0a14c511ff41486b83a8c3d688936eec0
Gitweb:     http://git.kernel.org/tip/88b8dac0a14c511ff41486b83a8c3d688936eec0
Author:     Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
AuthorDate: Tue, 19 Jun 2012 17:43:15 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jul 2012 13:58:06 +0200

sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9f9aa0..22321db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3054,6 +3054,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_SOME_PINNED 0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3064,6 +3065,8 @@ struct lb_env {
 	int			dst_cpu;
 	struct rq		*dst_rq;
 
+	struct cpumask		*dst_grpmask;
+	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
 	unsigned int		flags;
@@ -3131,9 +3134,31 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) are cache-hot on their current CPU.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+		int new_dst_cpu;
+
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
+		/*
+		 * Remember if this task can be migrated to any other cpu in
+		 * our sched_group. We may want to revisit it if we couldn't
+		 * meet load balance goals by pulling other tasks on src_cpu.
+		 *
+		 * Also avoid computing new_dst_cpu if we have already computed
+		 * one in current iteration.
+		 */
+		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+			return 0;
+
+		new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+						tsk_cpus_allowed(p));
+		if (new_dst_cpu < nr_cpu_ids) {
+			env->flags |= LBF_SOME_PINNED;
+			env->new_dst_cpu = new_dst_cpu;
+		}
 		return 0;
 	}
+
+	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_running(env->src_rq, p)) {
@@ -4213,7 +4238,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *balance)
 {
-	int ld_moved, active_balance = 0;
+	int ld_moved, cur_ld_moved, active_balance = 0;
+	int lb_iterations, max_lb_iterations;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -4223,11 +4249,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.sd		= sd,
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
+		.dst_grpmask    = sched_group_cpus(sd->groups),
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
+	max_lb_iterations = cpumask_weight(env.dst_grpmask);
 
 	schedstat_inc(sd, lb_count[idle]);
 
@@ -4253,6 +4281,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
 	ld_moved = 0;
+	lb_iterations = 1;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -4270,7 +4299,13 @@ more_balance:
 		double_rq_lock(this_rq, busiest);
 		if (!env.loop)
 			update_h_load(env.src_cpu);
-		ld_moved += move_tasks(&env);
+
+		/*
+		 * cur_ld_moved - load moved in current iteration
+		 * ld_moved     - cumulative load moved across iterations
+		 */
+		cur_ld_moved = move_tasks(&env);
+		ld_moved += cur_ld_moved;
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
@@ -4282,8 +4317,43 @@ more_balance:
 		/*
 		 * some other cpu did the load balance for us.
 		 */
-		if (ld_moved && this_cpu != smp_processor_id())
-			resched_cpu(this_cpu);
+		if (cur_ld_moved && env.dst_cpu != smp_processor_id())
+			resched_cpu(env.dst_cpu);
+
+		/*
+		 * Revisit (affine) tasks on src_cpu that couldn't be moved to
+		 * us and move them to an alternate dst_cpu in our sched_group
+		 * where they can run. The upper limit on how many times we
+		 * iterate on same src_cpu is dependent on number of cpus in our
+		 * sched_group.
+		 *
+		 * This changes load balance semantics a bit on who can move
+		 * load to a given_cpu. In addition to the given_cpu itself
+		 * (or a ilb_cpu acting on its behalf where given_cpu is
+		 * nohz-idle), we now have balance_cpu in a position to move
+		 * load to given_cpu. In rare situations, this may cause
+		 * conflicts (balance_cpu and given_cpu/ilb_cpu deciding
+		 * _independently_ and at _same_ time to move some load to
+		 * given_cpu) causing exceess load to be moved to given_cpu.
+		 * This however should not happen so much in practice and
+		 * moreover subsequent load balance cycles should correct the
+		 * excess load moved.
+		 */
+		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0 &&
+				lb_iterations++ < max_lb_iterations) {
+
+			this_rq		 = cpu_rq(env.new_dst_cpu);
+			env.dst_rq	 = this_rq;
+			env.dst_cpu	 = env.new_dst_cpu;
+			env.flags	&= ~LBF_SOME_PINNED;
+			env.loop	 = 0;
+			env.loop_break	 = sched_nr_migrate_break;
+			/*
+			 * Go back to "more_balance" rather than "redo" since we
+			 * need to continue with same src_cpu.
+			 */
+			goto more_balance;
+		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-07-24 14:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-19 12:13 [PATCH v3] sched: balance_cpu to consider other cpus in its group as target of (pinned) task Prashanth Nageshappa
2012-06-20 17:38 ` Peter Zijlstra
2012-06-21  3:40   ` Srivatsa Vaddagiri
2012-06-21  4:23     ` Srivatsa Vaddagiri
2012-06-21  5:24       ` Prashanth Nageshappa
2012-06-21  8:34     ` Peter Zijlstra
2012-06-20 17:40 ` Peter Zijlstra
2012-07-06  6:23 ` [tip:sched/core] sched: Improve balance_cpu() " tip-bot for Srivatsa Vaddagiri
2012-07-24 14:20 ` tip-bot for Srivatsa Vaddagiri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.