All of lore.kernel.org
 help / color / mirror / Atom feed
* load balancing regression since commit 367456c7
@ 2012-04-11  1:06 Tim Chen
  2012-04-17 11:43 ` Peter Zijlstra
  2012-04-17 12:09 ` Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Tim Chen @ 2012-04-11  1:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Huang, Ying, linux-kernel

Peter,

We noticed in a hackbench test (./hackbench 100 process 2000)
on a Sandy bridge 2 socket server, there has been a slow down
by a factor of 4 since commit 367456c7 was applied
(sched: Ditch per cgroup task lists for load-balancing).

The commit 5d6523e (sched: Fix load-balance wreckage) did
not fix the regression.

In the profile, there is heavy spin lock contention in the load_balance path of 3.4-rc2
where it was less than .003% of cpu before commit 367456c7.

When we looked into /proc/schedstat for 3.4-rc2 for the run duration, 
on cpu0 schedule was called 13x more often, and schedule call which
left the processor idle was 530x as much.

There was also a big increase in try to wake up remote (sd->ttwu_wake_remote) count.
		
		increase in sd->ttwu_wake_remote for cpu0
domain 0	540%
domain 1 	7570%
domain 2	4426%

Wonder if there is unnecessary load balancing to remote cpu?

Tim


profile for 3.4-rc2

     7.16%     hackbench  [kernel.kallsyms]      [k] _raw_spin_lock
               |
               --- _raw_spin_lock
                  |
                  |--56.52%-- load_balance
                  |          idle_balance
                  |          __schedule
                  |          schedule
                  |          |
                  |          |--98.73%-- schedule_timeout
                  |          |          |
                  |          |          |--97.80%-- unix_stream_recvmsg
                  |          |          |          sock_aio_read.part.7
                  |          |          |          sock_aio_read
                  |          |          |          do_sync_read
                  |          |          |          vfs_read
                  |          |          |          sys_read
                  |          |          |          system_call
                  |          |          |          __read_nocancel
                  |          |          |          create_worker
                  |          |          |          group
                  |          |          |          main
                  |          |          |          __libc_start_main
                  |          |          |







^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-11  1:06 load balancing regression since commit 367456c7 Tim Chen
@ 2012-04-17 11:43 ` Peter Zijlstra
  2012-04-17 12:09 ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-17 11:43 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Huang, Ying, linux-kernel

On Tue, 2012-04-10 at 18:06 -0700, Tim Chen wrote:
> Peter,
> 
> We noticed in a hackbench test (./hackbench 100 process 2000)
> on a Sandy bridge 2 socket server, there has been a slow down
> by a factor of 4 since commit 367456c7 was applied
> (sched: Ditch per cgroup task lists for load-balancing).
> 
> The commit 5d6523e (sched: Fix load-balance wreckage) did
> not fix the regression.
> 
> In the profile, there is heavy spin lock contention in the load_balance path of 3.4-rc2
> where it was less than .003% of cpu before commit 367456c7.

I can't actually reproduce but does the below help? 

If not, can you shoot your .config over?

---
 kernel/sched/fair.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d97ebd..e1da5c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3215,6 +3215,8 @@ static int move_one_task(struct lb_env *env)
 
 static unsigned long task_h_load(struct task_struct *p);
 
+static const unsigned int sched_nr_migrate_break = IS_ENABLED(CONFIG_PREEMPT) ? 8 : 32;
+
 /*
  * move_tasks tries to move up to load_move weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3242,7 +3244,7 @@ static int move_tasks(struct lb_env *env)
 
 		/* take a breather every nr_migrate tasks */
 		if (env->loop > env->loop_break) {
-			env->loop_break += sysctl_sched_nr_migrate;
+			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
 			break;
 		}
@@ -4407,7 +4409,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
 		.idle		= idle,
-		.loop_break	= sysctl_sched_nr_migrate,
+		.loop_break	= sched_nr_migrate_break,
+		.loop_max	= sysctl_sched_nr_migrate,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4448,7 +4451,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		env.load_move = imbalance;
 		env.src_cpu = busiest->cpu;
 		env.src_rq = busiest;
-		env.loop_max = busiest->nr_running;
 
 more_balance:
 		local_irq_save(flags);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-11  1:06 load balancing regression since commit 367456c7 Tim Chen
  2012-04-17 11:43 ` Peter Zijlstra
@ 2012-04-17 12:09 ` Peter Zijlstra
  2012-04-17 16:44   ` Tim Chen
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-17 12:09 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Tue, 2012-04-10 at 18:06 -0700, Tim Chen wrote:
>                   |--56.52%-- load_balance
>                   |          idle_balance
>                   |          __schedule
>                   |          schedule 

Ahh, I know why I didn't see it, I have a CONFIG_PREEMPT kernel and
idle_balancing stops once its gotten a single task over instead of
achieving proper balance.

And since hackbench generates insanely long runqueues and the patch that
caused your regression 'fixed' the lock-breaking it will now iterate the
entire runqueue if needed to achieve balance, which hurts.

I think the patch I send ought to work, let me try disabling
CONFIG_PREEMPT.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-17 12:09 ` Peter Zijlstra
@ 2012-04-17 16:44   ` Tim Chen
  2012-04-20 14:00     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Tim Chen @ 2012-04-17 16:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Tue, 2012-04-17 at 14:09 +0200, Peter Zijlstra wrote:
> On Tue, 2012-04-10 at 18:06 -0700, Tim Chen wrote:
> >                   |--56.52%-- load_balance
> >                   |          idle_balance
> >                   |          __schedule
> >                   |          schedule 
> 
> Ahh, I know why I didn't see it, I have a CONFIG_PREEMPT kernel and
> idle_balancing stops once its gotten a single task over instead of
> achieving proper balance.
> 
> And since hackbench generates insanely long runqueues and the patch that
> caused your regression 'fixed' the lock-breaking it will now iterate the
> entire runqueue if needed to achieve balance, which hurts.
> 
> I think the patch I send ought to work, let me try disabling
> CONFIG_PREEMPT.
> --

yes, CONFIG_PREEMPT is turned off on my side.  With the patch that you
sent, the slowed down went from a factor of 4 down to a factor 2. 

So the run time is now twice as long vs four time as long vs v3.3
kernel.

Tim


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-17 16:44   ` Tim Chen
@ 2012-04-20 14:00     ` Peter Zijlstra
  2012-04-20 16:40       ` Tim Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-20 14:00 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Tue, 2012-04-17 at 09:44 -0700, Tim Chen wrote:
> On Tue, 2012-04-17 at 14:09 +0200, Peter Zijlstra wrote:
> > On Tue, 2012-04-10 at 18:06 -0700, Tim Chen wrote:
> > >                   |--56.52%-- load_balance
> > >                   |          idle_balance
> > >                   |          __schedule
> > >                   |          schedule 
> > 
> > Ahh, I know why I didn't see it, I have a CONFIG_PREEMPT kernel and
> > idle_balancing stops once its gotten a single task over instead of
> > achieving proper balance.
> > 
> > And since hackbench generates insanely long runqueues and the patch that
> > caused your regression 'fixed' the lock-breaking it will now iterate the
> > entire runqueue if needed to achieve balance, which hurts.
> > 
> > I think the patch I send ought to work, let me try disabling
> > CONFIG_PREEMPT.
> > --
> 
> yes, CONFIG_PREEMPT is turned off on my side.  With the patch that you
> sent, the slowed down went from a factor of 4 down to a factor 2. 
> 
> So the run time is now twice as long vs four time as long vs v3.3
> kernel.

Ok, so I can't reproduce this on my WSM-EP.. even !PREEMPT kernels are
consistent with hackbench times with or without that patch.

Can you still send your full .config? Also, do you have cpu-cgroup muck
enabled and are you using that systemd shite?

What does the below patch (on top of the previous) do?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,7 @@ account_entity_enqueue(struct cfs_rq *cf
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
 	if (entity_is_task(se))
-		list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-20 14:00     ` Peter Zijlstra
@ 2012-04-20 16:40       ` Tim Chen
  2012-04-20 16:53         ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Tim Chen @ 2012-04-20 16:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Fri, 2012-04-20 at 16:00 +0200, Peter Zijlstra wrote:

> 
> Ok, so I can't reproduce this on my WSM-EP.. even !PREEMPT kernels are
> consistent with hackbench times with or without that patch.
> 
> Can you still send your full .config? Also, do you have cpu-cgroup muck
> enabled and are you using that systemd shite?
> 
> What does the below patch (on top of the previous) do?
> 

There is a slight 10% to 15% improvement with the patch.  However,
the change in performance difficult to quantify precisely as the 
hackbench runtime has large variations (up to 50%) on our Sandy Bridge EP
server machines since commit 367456c7.

We also do not see regression for hackbench on WSM-EP, but only 
on machines with Sandy-Bridge EP (2 socket, 8 cores/socket, HT enabled).  
We are not running hackbench in cgroup for this test.

The Sandy Bridge EP machines installed has FC16, so I think it uses 
systemd. The machine on our WSM EP has FC15, which also has systemd.

I'm sending you the .config in a separate mail.  Thanks for taking a look.
The .config aligns closely with standard Fedora settings.

Tim



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-20 16:40       ` Tim Chen
@ 2012-04-20 16:53         ` Peter Zijlstra
  2012-04-20 17:13           ` Tim Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-20 16:53 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Fri, 2012-04-20 at 09:40 -0700, Tim Chen wrote:
> We also do not see regression for hackbench on WSM-EP, but only 
> on machines with Sandy-Bridge EP (2 socket, 8 cores/socket, HT enabled).  

Argh, Suresh, any idea what could be different and relevant to this
issue between WSM and SNB -EP ?

Tim, can you see the problem on the desktop SNB part? That's the only
SNB I have available.

> We are not running hackbench in cgroup for this test.
> 
> The Sandy Bridge EP machines installed has FC16, so I think it uses 
> systemd. The machine on our WSM EP has FC15, which also has systemd.

Ah, but your .config has:

CONFIG_CGROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y

and systemd, when cpu-cgroup is available, will automagically use it.

Could you disable those two CONFIG knobs and see if it persists?

If it goes away, its the cgroup balancing, if its still there its
elsewhere.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-20 16:53         ` Peter Zijlstra
@ 2012-04-20 17:13           ` Tim Chen
  2012-04-20 17:33             ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Tim Chen @ 2012-04-20 17:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Fri, 2012-04-20 at 18:53 +0200, Peter Zijlstra wrote:
> On Fri, 2012-04-20 at 09:40 -0700, Tim Chen wrote:
> > We also do not see regression for hackbench on WSM-EP, but only 
> > on machines with Sandy-Bridge EP (2 socket, 8 cores/socket, HT enabled).  
> 
> Argh, Suresh, any idea what could be different and relevant to this
> issue between WSM and SNB -EP ?
> 
> Tim, can you see the problem on the desktop SNB part? That's the only
> SNB I have available.
> 
> > We are not running hackbench in cgroup for this test.
> > 
> > The Sandy Bridge EP machines installed has FC16, so I think it uses 
> > systemd. The machine on our WSM EP has FC15, which also has systemd.
> 
> Ah, but your .config has:
> 
> CONFIG_CGROUP_SCHED=y
> CONFIG_SCHED_AUTOGROUP=y
> 
> and systemd, when cpu-cgroup is available, will automagically use it.
> 
> Could you disable those two CONFIG knobs and see if it persists?
> 

Turning those off did recover the regression.

Tim


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-20 17:13           ` Tim Chen
@ 2012-04-20 17:33             ` Peter Zijlstra
  2012-04-25 14:56               ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-20 17:33 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Fri, 2012-04-20 at 10:13 -0700, Tim Chen wrote:
> On Fri, 2012-04-20 at 18:53 +0200, Peter Zijlstra wrote:
> > On Fri, 2012-04-20 at 09:40 -0700, Tim Chen wrote:
> > > We also do not see regression for hackbench on WSM-EP, but only 
> > > on machines with Sandy-Bridge EP (2 socket, 8 cores/socket, HT enabled).  
> > 
> > Argh, Suresh, any idea what could be different and relevant to this
> > issue between WSM and SNB -EP ?
> > 
> > Tim, can you see the problem on the desktop SNB part? That's the only
> > SNB I have available.
> > 
> > > We are not running hackbench in cgroup for this test.
> > > 
> > > The Sandy Bridge EP machines installed has FC16, so I think it uses 
> > > systemd. The machine on our WSM EP has FC15, which also has systemd.
> > 
> > Ah, but your .config has:
> > 
> > CONFIG_CGROUP_SCHED=y
> > CONFIG_SCHED_AUTOGROUP=y
> > 
> > and systemd, when cpu-cgroup is available, will automagically use it.
> > 
> > Could you disable those two CONFIG knobs and see if it persists?
> > 
> 
> Turning those off did recover the regression.

OK, I'll go stare at the cgroup part then.. Thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-20 17:33             ` Peter Zijlstra
@ 2012-04-25 14:56               ` Peter Zijlstra
  2012-04-25 17:38                 ` Tim Chen
  2012-04-26 11:56                 ` [tip:sched/urgent] sched: Fix more load-balancing fallout tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-25 14:56 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Fri, 2012-04-20 at 19:33 +0200, Peter Zijlstra wrote:
> 
> OK, I'll go stare at the cgroup part then.. Thanks!
> 
Ok, I could reproduce when using cgroups, the below fixes it for me, can
you confirm?

---
Subject: sched: Fix more load-balance fallout
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Apr 17 13:38:40 CEST 2012

Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.

Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/fair.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,7 @@ account_entity_enqueue(struct cfs_rq *cf
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
 	if (entity_is_task(se))
-		list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }
@@ -3215,6 +3215,14 @@ static int move_one_task(struct lb_env *
 
 static unsigned long task_h_load(struct task_struct *p);
 
+static const unsigned int sched_nr_migrate_break =
+#ifdef CONFIG_PREEMPT
+	8
+#else
+	32
+#endif
+	;
+
 /*
  * move_tasks tries to move up to load_move weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3242,7 +3250,7 @@ static int move_tasks(struct lb_env *env
 
 		/* take a breather every nr_migrate tasks */
 		if (env->loop > env->loop_break) {
-			env->loop_break += sysctl_sched_nr_migrate;
+			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
 			break;
 		}
@@ -3252,7 +3260,7 @@ static int move_tasks(struct lb_env *env
 
 		load = task_h_load(p);
 
-		if (load < 16 && !env->sd->nr_balance_failed)
+		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
 
 		if ((load / 2) > env->load_move)
@@ -4407,7 +4415,7 @@ static int load_balance(int this_cpu, st
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
 		.idle		= idle,
-		.loop_break	= sysctl_sched_nr_migrate,
+		.loop_break	= sched_nr_migrate_break,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4448,7 +4456,8 @@ static int load_balance(int this_cpu, st
 		env.load_move = imbalance;
 		env.src_cpu = busiest->cpu;
 		env.src_rq = busiest;
-		env.loop_max = busiest->nr_running;
+		env.loop_max = min_t(unsigned long,
+				sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
 		local_irq_save(flags);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-25 14:56               ` Peter Zijlstra
@ 2012-04-25 17:38                 ` Tim Chen
  2012-04-25 17:43                   ` Peter Zijlstra
  2012-04-26 11:56                 ` [tip:sched/urgent] sched: Fix more load-balancing fallout tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Tim Chen @ 2012-04-25 17:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Wed, 2012-04-25 at 16:56 +0200, Peter Zijlstra wrote:
> On Fri, 2012-04-20 at 19:33 +0200, Peter Zijlstra wrote:
> > 
> > OK, I'll go stare at the cgroup part then.. Thanks!
> > 
> Ok, I could reproduce when using cgroups, the below fixes it for me, can
> you confirm?
> 
> ---

Mmm... I got the following error when I try to compile the patch on latest kernel tip:

kernel/sched/fair.c:3263:3: error: implicit declaration of function ‘static_branch_LB_MIN’ [-Werror=implicit-function-declaration]
kernel/sched/fair.c:3263:46: error: ‘__SCHED_FEAT_LB_MIN’ undeclared (first use in this function)
kernel/sched/fair.c:3263:46: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/fair.o] Error 1
make[1]: *** [kernel/sched] Error 2
make: *** [kernel] Error 2

Tim


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-25 17:38                 ` Tim Chen
@ 2012-04-25 17:43                   ` Peter Zijlstra
  2012-04-25 17:58                     ` Tim Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-04-25 17:43 UTC (permalink / raw)
  To: Tim Chen; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel


Gargh.. lost the change to kernel/sched/features.h, now included.

Sorry for that.

---
Subject: sched: Fix more load-balance fallout
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Apr 17 13:38:40 CEST 2012

Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.

Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
---
 kernel/sched/fair.c     |   19 ++++++++++++++-----
 kernel/sched/features.h |    1 +
 2 files changed, 15 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,7 @@ account_entity_enqueue(struct cfs_rq *cf
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
 	if (entity_is_task(se))
-		list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }
@@ -3215,6 +3215,14 @@ static int move_one_task(struct lb_env *
 
 static unsigned long task_h_load(struct task_struct *p);
 
+static const unsigned int sched_nr_migrate_break =
+#ifdef CONFIG_PREEMPT
+	8
+#else
+	32
+#endif
+	;
+
 /*
  * move_tasks tries to move up to load_move weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3242,7 +3250,7 @@ static int move_tasks(struct lb_env *env
 
 		/* take a breather every nr_migrate tasks */
 		if (env->loop > env->loop_break) {
-			env->loop_break += sysctl_sched_nr_migrate;
+			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
 			break;
 		}
@@ -3252,7 +3260,7 @@ static int move_tasks(struct lb_env *env
 
 		load = task_h_load(p);
 
-		if (load < 16 && !env->sd->nr_balance_failed)
+		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
 
 		if ((load / 2) > env->load_move)
@@ -4407,7 +4415,7 @@ static int load_balance(int this_cpu, st
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
 		.idle		= idle,
-		.loop_break	= sysctl_sched_nr_migrate,
+		.loop_break	= sched_nr_migrate_break,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4448,7 +4456,8 @@ static int load_balance(int this_cpu, st
 		env.load_move = imbalance;
 		env.src_cpu = busiest->cpu;
 		env.src_rq = busiest;
-		env.loop_max = busiest->nr_running;
+		env.loop_max = min_t(unsigned long,
+				sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
 		local_irq_save(flags);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,3 +68,4 @@ SCHED_FEAT(TTWU_QUEUE, true)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(LB_MIN, false)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: load balancing regression since commit 367456c7
  2012-04-25 17:43                   ` Peter Zijlstra
@ 2012-04-25 17:58                     ` Tim Chen
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Chen @ 2012-04-25 17:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, Alex Shi, Ying, linux-kernel

On Wed, 2012-04-25 at 19:43 +0200, Peter Zijlstra wrote:
> Gargh.. lost the change to kernel/sched/features.h, now included.
> 
> Sorry for that.
> 
> ---
> Subject: sched: Fix more load-balance fallout
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Apr 17 13:38:40 CEST 2012
> 
> Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
> load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
> left some more wreckage.
> 
> By setting loop_max unconditionally to ->nr_running load-balancing
> could take a lot of time on very long runqueues (hackbench!). So keep
> the sysctl as max limit of the amount of tasks we'll iterate.
> 
> Furthermore, the min load filter for migration completely fails with
> cgroups since inequality in per-cpu state can easily lead to such
> small loads :/
> 
> Furthermore the change to add new tasks to the tail of the queue
> instead of the head seems to have some effect.. not quite sure I
> understand why.
> 
> Combined these fixes solve the huge hackbench regression reported by
> Tim when hackbench is ran in a cgroup.
> 
> Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins

The patch fixed the regression for me.  

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/urgent] sched: Fix more load-balancing fallout
  2012-04-25 14:56               ` Peter Zijlstra
  2012-04-25 17:38                 ` Tim Chen
@ 2012-04-26 11:56                 ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-04-26 11:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, tim.c.chen, akpm, tglx

Commit-ID:  eb95308ee2a69403909e111837b9068c64cfc349
Gitweb:     http://git.kernel.org/tip/eb95308ee2a69403909e111837b9068c64cfc349
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 17 Apr 2012 13:38:40 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 26 Apr 2012 12:54:52 +0200

sched: Fix more load-balancing fallout

Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.

Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     |   18 ++++++++++--------
 kernel/sched/features.h |    1 +
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d97ebd..e955364 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
 	if (entity_is_task(se))
-		list_add_tail(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }
@@ -3215,6 +3215,8 @@ static int move_one_task(struct lb_env *env)
 
 static unsigned long task_h_load(struct task_struct *p);
 
+static const unsigned int sched_nr_migrate_break = 32;
+
 /*
  * move_tasks tries to move up to load_move weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3242,7 +3244,7 @@ static int move_tasks(struct lb_env *env)
 
 		/* take a breather every nr_migrate tasks */
 		if (env->loop > env->loop_break) {
-			env->loop_break += sysctl_sched_nr_migrate;
+			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
 			break;
 		}
@@ -3252,7 +3254,7 @@ static int move_tasks(struct lb_env *env)
 
 		load = task_h_load(p);
 
-		if (load < 16 && !env->sd->nr_balance_failed)
+		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
 
 		if ((load / 2) > env->load_move)
@@ -4407,7 +4409,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.dst_cpu	= this_cpu,
 		.dst_rq		= this_rq,
 		.idle		= idle,
-		.loop_break	= sysctl_sched_nr_migrate,
+		.loop_break	= sched_nr_migrate_break,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4445,10 +4447,10 @@ redo:
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.load_move = imbalance;
-		env.src_cpu = busiest->cpu;
-		env.src_rq = busiest;
-		env.loop_max = busiest->nr_running;
+		env.load_move	= imbalance;
+		env.src_cpu	= busiest->cpu;
+		env.src_rq	= busiest;
+		env.loop_max	= min_t(unsigned long, sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
 		local_irq_save(flags);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e61fd73..de00a48 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,3 +68,4 @@ SCHED_FEAT(TTWU_QUEUE, true)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(LB_MIN, false)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-04-26 11:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-11  1:06 load balancing regression since commit 367456c7 Tim Chen
2012-04-17 11:43 ` Peter Zijlstra
2012-04-17 12:09 ` Peter Zijlstra
2012-04-17 16:44   ` Tim Chen
2012-04-20 14:00     ` Peter Zijlstra
2012-04-20 16:40       ` Tim Chen
2012-04-20 16:53         ` Peter Zijlstra
2012-04-20 17:13           ` Tim Chen
2012-04-20 17:33             ` Peter Zijlstra
2012-04-25 14:56               ` Peter Zijlstra
2012-04-25 17:38                 ` Tim Chen
2012-04-25 17:43                   ` Peter Zijlstra
2012-04-25 17:58                     ` Tim Chen
2012-04-26 11:56                 ` [tip:sched/urgent] sched: Fix more load-balancing fallout tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.