linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] scheduler fixes
@ 2011-04-02 10:31 Ingo Molnar
  2011-04-04 15:45 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2011-04-02 10:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Andrew Morton

Linus,

Please pull the latest sched-fixes-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus

 Thanks,

	Ingo

------------------>
Borislav Petkov (1):
      sched, doc: Beef up load balancing description

Dario Faggioli (1):
      sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks

Sisir Koppaka (1):
      sched: Fix rebalance interval calculation


 Documentation/scheduler/sched-domains.txt |   32 ++++++++++++++++++++--------
 kernel/sched.c                            |   11 ++++++++++
 kernel/sched_fair.c                       |    5 ++-
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt
index 373ceac..b7ee379 100644
--- a/Documentation/scheduler/sched-domains.txt
+++ b/Documentation/scheduler/sched-domains.txt
@@ -1,8 +1,7 @@
-Each CPU has a "base" scheduling domain (struct sched_domain). These are
-accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain
+Each CPU has a "base" scheduling domain (struct sched_domain). The domain
 hierarchy is built from these base domains via the ->parent pointer. ->parent
-MUST be NULL terminated, and domain structures should be per-CPU as they
-are locklessly updated.
+MUST be NULL terminated, and domain structures should be per-CPU as they are
+locklessly updated.
 
 Each scheduling domain spans a number of CPUs (stored in the ->span field).
 A domain's span MUST be a superset of it child's span (this restriction could
@@ -26,11 +25,26 @@ is treated as one entity. The load of a group is defined as the sum of the
 load of each of its member CPUs, and only when the load of a group becomes
 out of balance are tasks moved between groups.
 
-In kernel/sched.c, rebalance_tick is run periodically on each CPU. This
-function takes its CPU's base sched domain and checks to see if has reached
-its rebalance interval. If so, then it will run load_balance on that domain.
-rebalance_tick then checks the parent sched_domain (if it exists), and the
-parent of the parent and so forth.
+In kernel/sched.c, trigger_load_balance() is run periodically on each CPU
+through scheduler_tick(). It raises a softirq after the next regularly scheduled
+rebalancing event for the current runqueue has arrived. The actual load
+balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
+in softirq context (SCHED_SOFTIRQ).
+
+The latter function takes two arguments: the current CPU and whether it was idle
+at the time the scheduler_tick() happened and iterates over all sched domains
+our CPU is on, starting from its base domain and going up the ->parent chain.
+While doing that, it checks to see if the current domain has exhausted its
+rebalance interval. If so, it runs load_balance() on that domain. It then checks
+the parent sched_domain (if it exists), and the parent of the parent and so
+forth.
+
+Initially, load_balance() finds the busiest group in the current sched domain.
+If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
+that group. If it manages to find such a runqueue, it locks both our initial
+CPU's runqueue and the newly found busiest one and starts moving tasks from it
+to our runqueue. The exact number of tasks amounts to an imbalance previously
+computed while iterating over this sched domain's groups.
 
 *** Implementing sched domains ***
 The "base" domain will "span" the first level of the hierarchy. In the case
diff --git a/kernel/sched.c b/kernel/sched.c
index f592ce6..a884551 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5011,6 +5011,17 @@ recheck:
 		return -EINVAL;
 	}
 
+	/*
+	 * If not changing anything there's no need to proceed further:
+	 */
+	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
+			param->sched_priority == p->rt_priority))) {
+
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return 0;
+	}
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
 		/*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3f7ec9e..c7ec5c8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -22,6 +22,7 @@
 
 #include <linux/latencytop.h>
 #include <linux/sched.h>
+#include <linux/cpumask.h>
 
 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -3850,8 +3851,8 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 		interval = msecs_to_jiffies(interval);
 		if (unlikely(!interval))
 			interval = 1;
-		if (interval > HZ*NR_CPUS/10)
-			interval = HZ*NR_CPUS/10;
+		if (interval > HZ*num_online_cpus()/10)
+			interval = HZ*num_online_cpus()/10;
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2011-04-02 10:31 [GIT PULL] scheduler fixes Ingo Molnar
@ 2011-04-04 15:45 ` Linus Torvalds
  2011-04-04 16:08   ` Sisir Koppaka
                     ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Linus Torvalds @ 2011-04-04 15:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Andrew Morton

So I pulled this, but I think this:

On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> -               if (interval > HZ*NR_CPUS/10)
> -                       interval = HZ*NR_CPUS/10;
> +               if (interval > HZ*num_online_cpus()/10)
> +                       interval = HZ*num_online_cpus()/10;

is a horrible patch.

Think about what that expands to. It's going to potentially be two
function calls. And the code is just ugly.

So please, when doing search-and-replace changes like this, just clean
up the code at the same time. Back when it was about pure constants,
there was only a typing/ugly overhead from duplicating the constant,
but the compiler would see a single constant.

Now it's _possible_ that the compiler could do the analysis and fold
it all back to a single thing. But it's unlikely to happen except for
configurations that end up making it all trivial.

So just add something like a

   int max_interval = HZ*num_online_cpus()/10;

possibly even with a comment about _why_ that is the max interval allowed.  Ok?

                        Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2011-04-04 15:45 ` Linus Torvalds
@ 2011-04-04 16:08   ` Sisir Koppaka
  2011-04-04 16:16   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Sisir Koppaka @ 2011-04-04 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton

On Mon, Apr 4, 2011 at 9:15 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I pulled this, but I think this:
>
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
>>
>> -               if (interval > HZ*NR_CPUS/10)
>> -                       interval = HZ*NR_CPUS/10;
>> +               if (interval > HZ*num_online_cpus()/10)
>> +                       interval = HZ*num_online_cpus()/10;
>
> is a horrible patch.
>
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
>
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
>
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
>
> So just add something like a
>
>   int max_interval = HZ*num_online_cpus()/10;
>
> possibly even with a comment about _why_ that is the max interval allowed.  Ok?
>
>                        Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Ok sure. You're right, I'll resend the patch.
Sisir

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2011-04-04 15:45 ` Linus Torvalds
  2011-04-04 16:08   ` Sisir Koppaka
@ 2011-04-04 16:16   ` Andrew Morton
  2011-04-04 19:19   ` Ingo Molnar
  2011-04-05  8:14   ` Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2011-04-04 16:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner

On Mon, 4 Apr 2011 08:45:34 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I pulled this, but I think this:
> 
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > -               if (interval > HZ*NR_CPUS/10)
> > -                       interval = HZ*NR_CPUS/10;
> > +               if (interval > HZ*num_online_cpus()/10)
> > +                       interval = HZ*num_online_cpus()/10;
> 
> is a horrible patch.
> 
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
> 
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
> 
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
> 
> So just add something like a
> 
>    int max_interval = HZ*num_online_cpus()/10;
> 
> possibly even with a comment about _why_ that is the max interval allowed.  Ok?
> 

max() conveniently avoids the double-evaluation:

#define max(x, y) ({				\
	typeof(x) _max1 = (x);			\
	typeof(y) _max2 = (y);			\
	(void) (&_max1 == &_max2);		\
	_max1 > _max2 ? _max1 : _max2; })


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2011-04-04 15:45 ` Linus Torvalds
  2011-04-04 16:08   ` Sisir Koppaka
  2011-04-04 16:16   ` Andrew Morton
@ 2011-04-04 19:19   ` Ingo Molnar
  2011-04-05  8:14   ` Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2011-04-04 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I pulled this, but I think this:
> 
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > -               if (interval > HZ*NR_CPUS/10)
> > -                       interval = HZ*NR_CPUS/10;
> > +               if (interval > HZ*num_online_cpus()/10)
> > +                       interval = HZ*num_online_cpus()/10;
> 
> is a horrible patch.

Indeed, will fix it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL] scheduler fixes
  2011-04-04 15:45 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2011-04-04 19:19   ` Ingo Molnar
@ 2011-04-05  8:14   ` Peter Zijlstra
  2011-04-05  8:48     ` [tip:sched/urgent] sched: Clean up rebalance_domains() load-balance interval calculation tip-bot for Peter Zijlstra
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2011-04-05  8:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Andrew Morton

On Mon, 2011-04-04 at 08:45 -0700, Linus Torvalds wrote:
> So I pulled this, but I think this:
> 
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > -               if (interval > HZ*NR_CPUS/10)
> > -                       interval = HZ*NR_CPUS/10;
> > +               if (interval > HZ*num_online_cpus()/10)
> > +                       interval = HZ*num_online_cpus()/10;
> 
> is a horrible patch.
> 
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
> 
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
> 
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
> 
> So just add something like a
> 
>    int max_interval = HZ*num_online_cpus()/10;
> 
> possibly even with a comment about _why_ that is the max interval allowed.  Ok?

How about something like the below?

---
Subject: sched: Clean up load-balance interval calculation

Instead of the possible multiple-evaluation of num_online_cpus() avoid
it all together in the normal case since its implemented with a hamming
weight function over a cpu bitmask which can be darn expensive for those
with big iron.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    3 +++
 kernel/sched_fair.c |   16 ++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a884551..17b4d22 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6331,6 +6331,9 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 #endif
 	}
+
+	update_max_interval();
+
 	return NOTIFY_OK;
 }
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c7ec5c8..a14a04e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3820,6 +3820,17 @@ void select_nohz_load_balancer(int stop_tick)
 
 static DEFINE_SPINLOCK(balancing);
 
+static unsigned long max_load_balance_interval = HZ/10;
+
+/*
+ * Scale the max load_balance interval with the number of CPUs in the system.
+ * This trades load-balance latency on larger machines for less cross talk.
+ */
+static void update_max_interval(void)
+{
+	max_load_balance_interval = HZ*num_online_cpus()/10;
+}
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -3849,10 +3860,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 
 		/* scale ms to jiffies */
 		interval = msecs_to_jiffies(interval);
-		if (unlikely(!interval))
-			interval = 1;
-		if (interval > HZ*num_online_cpus()/10)
-			interval = HZ*num_online_cpus()/10;
+		interval = clamp(interval, 1UL, max_load_balance_interval);
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip:sched/urgent] sched: Clean up rebalance_domains() load-balance interval calculation
  2011-04-05  8:14   ` Peter Zijlstra
@ 2011-04-05  8:48     ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-04-05  8:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, tglx, mingo

Commit-ID:  49c022e657fbe661460d191fbe776a387132e2b3
Gitweb:     http://git.kernel.org/tip/49c022e657fbe661460d191fbe776a387132e2b3
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 5 Apr 2011 10:14:25 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 5 Apr 2011 10:29:36 +0200

sched: Clean up rebalance_domains() load-balance interval calculation

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    3 +++
 kernel/sched_fair.c |   16 ++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a884551..17b4d22 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6331,6 +6331,9 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 #endif
 	}
+
+	update_max_interval();
+
 	return NOTIFY_OK;
 }
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c7ec5c8..80ecd09 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3820,6 +3820,17 @@ void select_nohz_load_balancer(int stop_tick)
 
 static DEFINE_SPINLOCK(balancing);
 
+static unsigned long __read_mostly max_load_balance_interval = HZ/10;
+
+/*
+ * Scale the max load_balance interval with the number of CPUs in the system.
+ * This trades load-balance latency on larger machines for less cross talk.
+ */
+static void update_max_interval(void)
+{
+	max_load_balance_interval = HZ*num_online_cpus()/10;
+}
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -3849,10 +3860,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 
 		/* scale ms to jiffies */
 		interval = msecs_to_jiffies(interval);
-		if (unlikely(!interval))
-			interval = 1;
-		if (interval > HZ*num_online_cpus()/10)
-			interval = HZ*num_online_cpus()/10;
+		interval = clamp(interval, 1UL, max_load_balance_interval);
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-04-05  8:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-02 10:31 [GIT PULL] scheduler fixes Ingo Molnar
2011-04-04 15:45 ` Linus Torvalds
2011-04-04 16:08   ` Sisir Koppaka
2011-04-04 16:16   ` Andrew Morton
2011-04-04 19:19   ` Ingo Molnar
2011-04-05  8:14   ` Peter Zijlstra
2011-04-05  8:48     ` [tip:sched/urgent] sched: Clean up rebalance_domains() load-balance interval calculation tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).