[PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain
@ 2009-10-24 19:58 Arjan van de Ven
  2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
  2009-10-25  8:03 ` [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-24 19:58 UTC (permalink / raw)
  To: Peter Zijlstra, mingo

Subject: sched: Enable wake balancing for the SMT/HT domain
From: Arjan van de Ven <arjan@linux.intel.com>

Logical CPUs that are part of a hyperthreading/SMT set are equivalent
in terms of where to execute a task; after all they share pretty much
all resources including the L1 cache.

This means that if task A wakes up task B, we should really consider
all logical CPUs in the SMT/HT set to run task B, not just the CPU that
task A is running on; in case task A keeps running, task B now gets to 
execute with no latency. In the case where task A then immediately goes
to wait for a response from task B, nothing is lost due to the aforementioned
equivalency.

This patch turns on the "balance on wakup" and turns of "affine wakeups"
for the SMT/HT scheduler domain to get this lower latency behavior.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index fc0bf3e..3665dc2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -95,8 +95,8 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
+				| 1*SD_BALANCE_WAKE			\
+				| 0*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
 				| 0*SD_SHARE_PKG_RESOURCES		\

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched: Add aggressive load balancing for certain situations
  2009-10-24 19:58 [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Arjan van de Ven
@ 2009-10-24 20:04 ` Arjan van de Ven
  2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
  2009-10-25  8:01   ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Peter Zijlstra
  2009-10-25  8:03 ` [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-24 20:04 UTC (permalink / raw)
  To: Arjan van de Ven, Peter Zijlstra; +Cc: mingo, linux-kernel

Subject: sched: Add aggressive load balancing for certain situations
From: Arjan van de Ven <arjan@linux.intel.com>

The scheduler, in it's "find idlest group" function currently has an unconditional
threshold for an imbalance, before it will consider moving a task.

However, there are situations where this is undesireable, and we want to opt in to a
more aggressive load balancing algorithm to minimize latencies.

This patch adds the infrastructure for this and also adds two cases for which
we select the aggressive approach
1) From interrupt context. Events that happen in irq context are very likely,
   as a heuristic, to show latency sensitive behavior
2) When doing a wake_up() and the scheduler domain we're investigating has the
   flag set that opts in to load balancing during wake_up()
   (for example the SMT/HT domain)


Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..fe9b95b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+		  int this_cpu, int load_idx, int agressive)
 {
 	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
@@ -1290,7 +1290,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest)
+		return NULL;
+	if (!agressive &&  100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
 }
@@ -1412,6 +1414,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
 		int weight;
+		int agressive;
 
 		if (!(sd->flags & sd_flag)) {
 			sd = sd->child;
@@ -1421,7 +1424,13 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
 
-		group = find_idlest_group(sd, p, cpu, load_idx);
+		agressive = 0;
+		if (in_irq())
+			agressive = 1;
+		if (sd_flag & SD_BALANCE_WAKE)
+			agressive = 1;
+
+		group = find_idlest_group(sd, p, cpu, load_idx, agressive);
 		if (!group) {
 			sd = sd->child;
 			continue;



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
@ 2009-10-24 20:07   ` Arjan van de Ven
  2009-10-25  6:55     ` Mike Galbraith
  2009-10-25  8:01     ` Peter Zijlstra
  2009-10-25  8:01   ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-24 20:07 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

Subject: sched: Disable affine wakeups by default
From: Arjan van de Ven <arjan@linux.intel.com>

The global affine wakeup scheduler feature sounds nice, but there is a problem
with this: This is ALSO a per scheduler domain feature already.
By having the global scheduler feature enabled by default, the scheduler domains
no longer have the option to opt out.

There are domains (for example the HT/SMT domain) that have good reason to want
to opt out of this feature.

With this patch they can opt out, while all other domains currently default to
the affine setting anyway.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 0d94083..58c2ea7 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -72,7 +72,7 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
  * improve cache locality. Typically used with SYNC wakeups as
  * generated by pipes and the like, see also SYNC_WAKEUPS.
  */
-SCHED_FEAT(AFFINE_WAKEUPS, 1)
+SCHED_FEAT(AFFINE_WAKEUPS, 0)
 
 /*
  * Weaken SYNC hint based on overlap



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
@ 2009-10-25  6:55     ` Mike Galbraith
  2009-10-25 16:51       ` Arjan van de Ven
  2009-10-25  8:01     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-25  6:55 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sat, 2009-10-24 at 13:07 -0700, Arjan van de Ven wrote:
> Subject: sched: Disable affine wakeups by default
> From: Arjan van de Ven <arjan@linux.intel.com>
> 
> The global affine wakeup scheduler feature sounds nice, but there is a problem
> with this: This is ALSO a per scheduler domain feature already.
> By having the global scheduler feature enabled by default, the scheduler domains
> no longer have the option to opt out.

? The affine decision is qualified by SD_WAKE_AFFINE.

                if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
                    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {

                        affine_sd = tmp;
                        want_affine = 0;
                }

> There are domains (for example the HT/SMT domain) that have good reason to want
> to opt out of this feature.

Even if you're sharing a cache, there are reasons to wake affine.  If
the wakee can preempt the waker while it's still eligible to run, wakee
not only eats toasty warm data, it can hand the cpu back to the waker so
it can make more and repeat this procedure for a while without someone
else getting in between, and trashing cache.  Also, for a task which
wakes another, then checks to see if it has more work, sleeps if not,
this preemption can keep that task running, saving wakeups.  If you put
the wakee on a runqueue where it may have to wait even a tiny bit, buddy
goes to sleep, so that benefit is gone.  These things have a HUGE effect
on scalability, as you can see below.

There are times when not waking affine is good, eg immediately after
fork(), it's _generally_ a good idea to not wake affine, because there
may be more no the way, a work generator like make, for example doing
it's thing, and fork() also frequently means an exec is on the way.
That's not usually a producer/consumer situation.

At low load, with producer/consumer, iff you can hit a shared cache,
it's a good idea to not wake affine, any waker/wakee overlap is pure
performance loss in that case.  On my Q6600, there's a 1:3 chance of
hitting if left to random chance.  You can see that case happening in
the pgsql+oltp numbers below.  That wants further examination.

> With this patch they can opt out, while all other domains currently default to
> the affine setting anyway.

Patch globally disabled affine wakeups.  Not good.

Oh, btw, wrt affinity vs interrupt, a long time ago, I tried disabling
affine wakeups in hard/soft and both contexts.  In all cases, it was a
losing proposition here.

One thing that would be nice for some mixed loads, including the desktop
is, if a cpu is doing high frequency sync/affine wakeups, try to keep
other things away from that cpu by considering synchronous tasks to
count as two instead of one load balancing wise.

(damn, i'm rambling.. time to shut up;)

Sorry for verbosity, numbers probably would have sufficed.  I've been
overdosing on boring affinity/scalability testing ;-)

tip v2.6.32-rc5-1691-g9a8523b

tbench 4
tip           936.314 MB/sec 8 procs
tip+patches   869.153 MB/sec 8 procs
                 .928

vmark
tip           125307 messages per second
tip+patches   103743 messages per second
                .827

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10013.90   18526.84   34900.38   34420.14   33069.83   32083.40   30578.30   28010.71   25605.47
tip+patches   8436.34   17826.34   34524.32   31471.92   29188.59   27896.10   26036.43   23774.57   19524.33
                 .842       .962       .989       .914       .882       .869       .851       .848       .762

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13907.85   27135.87   52951.98   52514.04   51742.52   50705.43   49947.97   48374.19   46227.94
tip+patches  15277.63   23050.99   51943.13   51937.16   42246.60   38397.86   34998.71   31154.21   26335.68
                1.098       .849       .980       .989       .816       .757       .700       .644       .569

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
  2009-10-25  6:55     ` Mike Galbraith
@ 2009-10-25  8:01     ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2009-10-25  8:01 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, linux-kernel

On Sat, 2009-10-24 at 13:07 -0700, Arjan van de Ven wrote:
> Subject: sched: Disable affine wakeups by default
> From: Arjan van de Ven <arjan@linux.intel.com>
> 
> The global affine wakeup scheduler feature sounds nice, but there is a problem
> with this: This is ALSO a per scheduler domain feature already.
> By having the global scheduler feature enabled by default, the scheduler domains
> no longer have the option to opt out.
> 
> There are domains (for example the HT/SMT domain) that have good reason to want
> to opt out of this feature.
> 
> With this patch they can opt out, while all other domains currently default to
> the affine setting anyway.
> 
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> 

Hell no, that'll destroy many workloads. What you could possibly do is
disable it for sched domains that are known to share cache, maybe.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched: Add aggressive load balancing for certain situations
  2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
  2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
@ 2009-10-25  8:01   ` Peter Zijlstra
  2009-10-25 11:48     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2009-10-25  8:01 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, linux-kernel

On Sat, 2009-10-24 at 13:04 -0700, Arjan van de Ven wrote:
> Subject: sched: Add aggressive load balancing for certain situations
> From: Arjan van de Ven <arjan@linux.intel.com>
> 
> The scheduler, in it's "find idlest group" function currently has an unconditional
> threshold for an imbalance, before it will consider moving a task.
> 
> However, there are situations where this is undesireable, and we want to opt in to a
> more aggressive load balancing algorithm to minimize latencies.
> 
> This patch adds the infrastructure for this and also adds two cases for which
> we select the aggressive approach
> 1) From interrupt context. Events that happen in irq context are very likely,
>    as a heuristic, to show latency sensitive behavior
> 2) When doing a wake_up() and the scheduler domain we're investigating has the
>    flag set that opts in to load balancing during wake_up()
>    (for example the SMT/HT domain)
> 
> 
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>



> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 4e777b4..fe9b95b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>   */
>  static struct sched_group *
>  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> -		  int this_cpu, int load_idx)
> +		  int this_cpu, int load_idx, int agressive)
>  {

can't we fold that into load_idx? like -1 or something?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain
  2009-10-24 19:58 [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Arjan van de Ven
  2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
@ 2009-10-25  8:03 ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2009-10-25  8:03 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, lkml, Mike Galbraith, suresh.b.siddha

On Sat, 2009-10-24 at 12:58 -0700, Arjan van de Ven wrote:
> Subject: sched: Enable wake balancing for the SMT/HT domain
> From: Arjan van de Ven <arjan@linux.intel.com>
> 
> Logical CPUs that are part of a hyperthreading/SMT set are equivalent
> in terms of where to execute a task; after all they share pretty much
> all resources including the L1 cache.
> 
> This means that if task A wakes up task B, we should really consider
> all logical CPUs in the SMT/HT set to run task B, not just the CPU that
> task A is running on; in case task A keeps running, task B now gets to 
> execute with no latency. In the case where task A then immediately goes
> to wait for a response from task B, nothing is lost due to the aforementioned
> equivalency.
> 
> This patch turns on the "balance on wakup" and turns of "affine wakeups"
> for the SMT/HT scheduler domain to get this lower latency behavior.
> 
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index fc0bf3e..3665dc2 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -95,8 +95,8 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> -				| 1*SD_WAKE_AFFINE			\
> +				| 1*SD_BALANCE_WAKE			\
> +				| 0*SD_WAKE_AFFINE			\
>  				| 1*SD_SHARE_CPUPOWER			\
>  				| 0*SD_POWERSAVINGS_BALANCE		\
>  				| 0*SD_SHARE_PKG_RESOURCES		\
> 

So you're poking at SD_SIBLING_INIT, right?

That seems to make sense. Now doing the same for a cache level domain
(MC is almost that) might also make sense.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched: Add aggressive load balancing for certain situations
  2009-10-25  8:01   ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Peter Zijlstra
@ 2009-10-25 11:48     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2009-10-25 11:48 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, linux-kernel

On Sun, 2009-10-25 at 09:01 +0100, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 4e777b4..fe9b95b 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -1246,7 +1246,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >   */
> >  static struct sched_group *
> >  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > -               int this_cpu, int load_idx)
> > +               int this_cpu, int load_idx, int agressive)
> >  {
> 
> can't we fold that into load_idx? like -1 or something?

A better alternative might be passing imbalance along instead.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25  6:55     ` Mike Galbraith
@ 2009-10-25 16:51       ` Arjan van de Ven
  2009-10-25 17:38         ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-25 16:51 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 25 Oct 2009 07:55:25 +0100
Mike Galbraith <efault@gmx.de> wrote:
> Even if you're sharing a cache, there are reasons to wake affine.  If
> the wakee can preempt the waker while it's still eligible to run,
> wakee not only eats toasty warm data, it can hand the cpu back to the
> waker so it can make more and repeat this procedure for a while
> without someone else getting in between, and trashing cache. 

and on the flipside, and this is the workload I'm looking at,
this is halving your performance roughly due to one core being totally
busy while the other one is idle.

My workload is a relatively simple situation: firefox is starting up
and talking to X. I suspect this is representative for many X using
applications in the field. The application sends commands to X, but is
not (yet) going to wait for a response, it has more work to do. 
In this case the affine behavior does not only cause latency, but it
also eats the throughput performance.

This is due to a few things that compound, but a key one is this code:

        if (sd_flag & SD_BALANCE_WAKE) {
                if (sched_feat(AFFINE_WAKEUPS) &&
                    cpumask_test_cpu(cpu, &p->cpus_allowed))
                        want_affine = 1;
                new_cpu = prev_cpu;
        }

the problem is that 

        if (affine_sd && wake_affine(affine_sd, p, sync)) {
                new_cpu = cpu;
                goto out;
        }

this then will trigger later, as long as there is any domain that has
SD_WAKE_AFFINE set ;(

(part of that problem is that the code that sets affine_sd is done
before the
               if (!(tmp->flags & sd_flag))
                        continue;
test)

The numbers you posted are for a database, and only measure throughput.
There's more to the world than just databases / throughput-only
computing, and I'm trying to find low impact ways to reduce the latency
aspect of things. One obvious candidate is hyperthreading/SMT where it
IS basically free to switch to a sibbling, so wake-affine does not
really make sense there.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25 16:51       ` Arjan van de Ven
@ 2009-10-25 17:38         ` Mike Galbraith
  2009-10-25 19:33           ` Arjan van de Ven
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-25 17:38 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 2009-10-25 at 09:51 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 07:55:25 +0100
> Mike Galbraith <efault@gmx.de> wrote:
> > Even if you're sharing a cache, there are reasons to wake affine.  If
> > the wakee can preempt the waker while it's still eligible to run,
> > wakee not only eats toasty warm data, it can hand the cpu back to the
> > waker so it can make more and repeat this procedure for a while
> > without someone else getting in between, and trashing cache. 
> 
> and on the flipside, and this is the workload I'm looking at,
> this is halving your performance roughly due to one core being totally
> busy while the other one is idle.

Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
problem really well.  If you can hit an idle shared cache at low load,
go for it every time.  The rest of the numbers just show how big the
penalty is if you solve affinity problems with an 8" howitzer :)

> My workload is a relatively simple situation: firefox is starting up
> and talking to X. I suspect this is representative for many X using
> applications in the field. The application sends commands to X, but is
> not (yet) going to wait for a response, it has more work to do. 
> In this case the affine behavior does not only cause latency, but it
> also eats the throughput performance.

Yeah.  Damned if you do, damned if you don't.

> This is due to a few things that compound, but a key one is this code:
> 
>         if (sd_flag & SD_BALANCE_WAKE) {
>                 if (sched_feat(AFFINE_WAKEUPS) &&
>                     cpumask_test_cpu(cpu, &p->cpus_allowed))
>                         want_affine = 1;
>                 new_cpu = prev_cpu;
>         }
> 
> the problem is that 
> 
>         if (affine_sd && wake_affine(affine_sd, p, sync)) {
>                 new_cpu = cpu;
>                 goto out;
>         }
> 
> this then will trigger later, as long as there is any domain that has
> SD_WAKE_AFFINE set ;(

And the task looks like a synchronous task.

> (part of that problem is that the code that sets affine_sd is done
> before the
>                if (!(tmp->flags & sd_flag))
>                         continue;
> test)

Hm.  That looks like a bug, but after any task has scheduled a few
times, if it looks like a synchronous task, it'll glue itself to it's
waker's runqueue regardless.  Initial wakeup may disperse, but it will
come back if it's not overlapping.

> The numbers you posted are for a database, and only measure throughput.
> There's more to the world than just databases / throughput-only
> computing, and I'm trying to find low impact ways to reduce the latency
> aspect of things. One obvious candidate is hyperthreading/SMT where it
> IS basically free to switch to a sibbling, so wake-affine does not
> really make sense there.

It's also almost free on my Q6600 if we aimed for idle shared cache.

I agree fully that affinity decisions could be more perfect than they
are.  Getting it wrong is very expensive either way.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25 17:38         ` Mike Galbraith
@ 2009-10-25 19:33           ` Arjan van de Ven
  2009-10-25 22:04             ` Mike Galbraith
  2009-10-26  5:21             ` [PATCH 3/3] sched: Disable affine wakeups by default Mike Galbraith
  0 siblings, 2 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-25 19:33 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 25 Oct 2009 18:38:09 +0100
Mike Galbraith <efault@gmx.de> wrote:
> > > Even if you're sharing a cache, there are reasons to wake
> > > affine.  If the wakee can preempt the waker while it's still
> > > eligible to run, wakee not only eats toasty warm data, it can
> > > hand the cpu back to the waker so it can make more and repeat
> > > this procedure for a while without someone else getting in
> > > between, and trashing cache. 
> > 
> > and on the flipside, and this is the workload I'm looking at,
> > this is halving your performance roughly due to one core being
> > totally busy while the other one is idle.
> 
> Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> problem really well.  If you can hit an idle shared cache at low load,
> go for it every time. 

sadly the current code does not do this ;(
my patch might be too big an axe for it, but it does solve this part ;)

I'll keep digging to see if we can do a more micro-incursion.

> Hm.  That looks like a bug, but after any task has scheduled a few
> times, if it looks like a synchronous task, it'll glue itself to it's
> waker's runqueue regardless.  Initial wakeup may disperse, but it will
> come back if it's not overlapping.

the problem is the "synchronous to WHAT" question.
It may be synchronous to the disk for example; in the testcase I'm
looking at, we get "send message to X. do some more code. hit a page
cache miss and do IO" quite a bit.

> > The numbers you posted are for a database, and only measure
> > throughput. There's more to the world than just databases /
> > throughput-only computing, and I'm trying to find low impact ways
> > to reduce the latency aspect of things. One obvious candidate is
> > hyperthreading/SMT where it IS basically free to switch to a
> > sibbling, so wake-affine does not really make sense there.
> 
> It's also almost free on my Q6600 if we aimed for idle shared cache.

yeah multicore with shared cache falls for me in the same bucket.
 
> I agree fully that affinity decisions could be more perfect than they
> are.  Getting it wrong is very expensive either way.

Looks like we agree on a key principle:
If there is a free cpu "close enough" (SMT or MC basically), the
wakee should just run on that. 

we may not agree on what to do if there's no completely free logical
cpu, but a much lighter loaded one instead.
but first we need to let code speak ;)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25 19:33           ` Arjan van de Ven
@ 2009-10-25 22:04             ` Mike Galbraith
  2009-10-26  1:53               ` Peter Zijlstra
  2009-10-26  5:21             ` [PATCH 3/3] sched: Disable affine wakeups by default Mike Galbraith
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-25 22:04 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:
> On Sun, 25 Oct 2009 18:38:09 +0100
> Mike Galbraith <efault@gmx.de> wrote:
> > > > Even if you're sharing a cache, there are reasons to wake
> > > > affine.  If the wakee can preempt the waker while it's still
> > > > eligible to run, wakee not only eats toasty warm data, it can
> > > > hand the cpu back to the waker so it can make more and repeat
> > > > this procedure for a while without someone else getting in
> > > > between, and trashing cache. 
> > > 
> > > and on the flipside, and this is the workload I'm looking at,
> > > this is halving your performance roughly due to one core being
> > > totally busy while the other one is idle.
> > 
> > Yeah, the "one pgsql+oltp pair" in the numbers I posted show that
> > problem really well.  If you can hit an idle shared cache at low load,
> > go for it every time. 
> 
> sadly the current code does not do this ;(
> my patch might be too big an axe for it, but it does solve this part ;)

The below fixed up pgsql+oltp low end, but has negative effect on high
end.  Must be some stuttering going on.

> I'll keep digging to see if we can do a more micro-incursion.
> 
> > Hm.  That looks like a bug, but after any task has scheduled a few
> > times, if it looks like a synchronous task, it'll glue itself to it's
> > waker's runqueue regardless.  Initial wakeup may disperse, but it will
> > come back if it's not overlapping.
> 
> the problem is the "synchronous to WHAT" question.
> It may be synchronous to the disk for example; in the testcase I'm
> looking at, we get "send message to X. do some more code. hit a page
> cache miss and do IO" quite a bit.

Hm.  Yes, disk could be problematic. It's going to be exactly what the
affinity code looks for, you wake somebody, and almost immediately go to
sleep.  OTOH, even a house keeper threads make warm data.

> > > The numbers you posted are for a database, and only measure
> > > throughput. There's more to the world than just databases /
> > > throughput-only computing, and I'm trying to find low impact ways
> > > to reduce the latency aspect of things. One obvious candidate is
> > > hyperthreading/SMT where it IS basically free to switch to a
> > > sibbling, so wake-affine does not really make sense there.
> > 
> > It's also almost free on my Q6600 if we aimed for idle shared cache.
> 
> yeah multicore with shared cache falls for me in the same bucket.

Anyone with a non-shared cache multicore would be most unhappy with my
little test hack.
 
> > I agree fully that affinity decisions could be more perfect than they
> > are.  Getting it wrong is very expensive either way.
> 
> Looks like we agree on a key principle:
> If there is a free cpu "close enough" (SMT or MC basically), the
> wakee should just run on that. 
> 
> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10013.90   18526.84   34900.38   34420.14   33069.83   32083.40   30578.30   28010.71   25605.47 3x avg
tip+         10071.16   18498.33   34697.17   34275.20   32761.96   31657.10   30223.70   27363.50   24698.71
              9971.57   18290.17   34632.46   34204.59   32588.94   31513.19   30081.51   27504.66   24832.24
              9884.04   18502.26   34650.08   34250.13   32707.81   31566.86   29954.19   27417.09   24811.75
             

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13907.85   27135.87   52951.98   52514.04   51742.52   50705.43   49947.97   48374.19   46227.94 3x avg
tip+         15163.56   28882.70   52374.32   52469.79   51739.79   50602.02   49827.18   48029.84   46191.90
             15258.65   28778.77   52716.46   52405.32   51434.21   50440.66   49718.89   48082.22   46124.56
             15278.02   28178.55   52815.82   52609.98   51729.17   50652.10   49800.19   48126.95   46286.58


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 37087a7..fa534f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1374,6 +1374,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
+		int level = tmp->level;
+
 		/*
 		 * If power savings logic is enabled for a domain, see if we
 		 * are not overloaded, if so, don't balance wider.
@@ -1398,11 +1400,28 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 				want_sd = 0;
 		}
 
+		/*
+		 * look for an idle shared cache before looking at last CPU.
+		 */
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+				(level == SD_LV_SIBLING || level == SD_LV_MC)) {
+			int i;
 
+			for_each_cpu(i, sched_domain_span(tmp)) {
+				if (!cpu_rq(i)->cfs.nr_running) {
+					affine_sd = tmp;
+					want_affine = 0;
+					cpu = i;
+				}
+			}
+		} else if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			affine_sd = tmp;
 			want_affine = 0;
+
+			if ((level == SD_LV_SIBLING || level == SD_LV_MC) &&
+					!cpu_rq(prev_cpu)->cfs.nr_running)
+				cpu = prev_cpu;
 		}
 
 		if (!want_sd && !want_affine)



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25 22:04             ` Mike Galbraith
@ 2009-10-26  1:53               ` Peter Zijlstra
  2009-10-26  4:38                 ` Mike Galbraith
  2009-10-27 14:35                 ` Mike Galbraith
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2009-10-26  1:53 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Arjan van de Ven, mingo, linux-kernel

On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
>                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> -                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> +                               (level == SD_LV_SIBLING || level == SD_LV_MC)) {

quick comment without actually having looked at the patch, we should
really get rid of sd->level and encode properties of the sched domains
in sd->flags.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  1:53               ` Peter Zijlstra
@ 2009-10-26  4:38                 ` Mike Galbraith
  2009-10-26  4:52                   ` Arjan van de Ven
  2009-10-27 14:35                 ` Mike Galbraith
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-26  4:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Arjan van de Ven, mingo, linux-kernel

On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> >                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > -                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > +                               (level == SD_LV_SIBLING || level == SD_LV_MC)) {
> 
> quick comment without actually having looked at the patch, we should
> really get rid of sd->level and encode properties of the sched domains
> in sd->flags.

Yeah, sounds right, while writing that, it looked kinda ugly.  I suppose
arch land needs to encode cache property somehow if I really want to be
able to target cache on multicore.  Booting becomes.. exciting when I
tinker down there.

While tinkering with this, I noticed that when mysql+oltp starts
tripping over itself, if you move to any momentarily idle cpu, it helps
get the load moving again, the tail improves.  Not hugely, but quite
measurable.  There seems to be benefit to be had throughout the load
spectrum, just gotta figure out how to retrieve it without losing
anything.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  4:38                 ` Mike Galbraith
@ 2009-10-26  4:52                   ` Arjan van de Ven
  2009-10-26  5:08                     ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-26  4:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, mingo, linux-kernel

On Mon, 26 Oct 2009 05:38:27 +0100
Mike Galbraith <efault@gmx.de> wrote:

> On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > >                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE)
> > > &&
> > > -                   cpumask_test_cpu(prev_cpu,
> > > sched_domain_span(tmp))) {
> > > +                               (level == SD_LV_SIBLING || level
> > > == SD_LV_MC)) {
> > 
> > quick comment without actually having looked at the patch, we should
> > really get rid of sd->level and encode properties of the sched
> > domains in sd->flags.
> 
> Yeah, sounds right, while writing that, it looked kinda ugly.  I
> suppose arch land needs to encode cache property somehow if I really
> want to be able to target cache on multicore.  Booting becomes..
> exciting when I tinker down there.

or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  4:52                   ` Arjan van de Ven
@ 2009-10-26  5:08                     ` Mike Galbraith
  2009-10-26  5:36                       ` Arjan van de Ven
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-26  5:08 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 2009-10-25 at 21:52 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 05:38:27 +0100
> Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > >                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE)
> > > > &&
> > > > -                   cpumask_test_cpu(prev_cpu,
> > > > sched_domain_span(tmp))) {
> > > > +                               (level == SD_LV_SIBLING || level
> > > > == SD_LV_MC)) {
> > > 
> > > quick comment without actually having looked at the patch, we should
> > > really get rid of sd->level and encode properties of the sched
> > > domains in sd->flags.
> > 
> > Yeah, sounds right, while writing that, it looked kinda ugly.  I
> > suppose arch land needs to encode cache property somehow if I really
> > want to be able to target cache on multicore.  Booting becomes..
> > exciting when I tinker down there.
> 
> or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...

I don't see how.  Oh, you mean another domain level, top level being
cache property, and turn off when degenerating?  That looks like it'd be
a problem, but adding SD_CACHE_SIBLING or whatnot should work.  Problem
is how to gain knowledge of whether multicores share a cache or not.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-25 19:33           ` Arjan van de Ven
  2009-10-25 22:04             ` Mike Galbraith
@ 2009-10-26  5:21             ` Mike Galbraith
  1 sibling, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-10-26  5:21 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 2009-10-25 at 12:33 -0700, Arjan van de Ven wrote:

> we may not agree on what to do if there's no completely free logical
> cpu, but a much lighter loaded one instead.
> but first we need to let code speak ;)

BTW, we agree here too, it's just that cache traffic generated by
ripping high frequency buddies apart can be hugely expensive, so much
caution required.  Someone needs to get off his duff, and invent ram
that doesn't suck.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  5:08                     ` Mike Galbraith
@ 2009-10-26  5:36                       ` Arjan van de Ven
  2009-10-26  5:47                         ` Mike Galbraith
  2009-11-10 21:59                         ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-26  5:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, mingo, linux-kernel

On Mon, 26 Oct 2009 06:08:54 +0100
Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-10-25 at 21:52 -0700, Arjan van de Ven wrote:
> > On Mon, 26 Oct 2009 05:38:27 +0100
> > Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > > > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > > > >                 if (want_affine && (tmp->flags &
> > > > > SD_WAKE_AFFINE) &&
> > > > > -                   cpumask_test_cpu(prev_cpu,
> > > > > sched_domain_span(tmp))) {
> > > > > +                               (level == SD_LV_SIBLING ||
> > > > > level == SD_LV_MC)) {
> > > > 
> > > > quick comment without actually having looked at the patch, we
> > > > should really get rid of sd->level and encode properties of the
> > > > sched domains in sd->flags.
> > > 
> > > Yeah, sounds right, while writing that, it looked kinda ugly.  I
> > > suppose arch land needs to encode cache property somehow if I
> > > really want to be able to target cache on multicore.  Booting
> > > becomes.. exciting when I tinker down there.
> > 
> > or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...
> 
> I don't see how.  Oh, you mean another domain level, top level being
> cache property, and turn off when degenerating?  That looks like it'd
> be a problem, but adding SD_CACHE_SIBLING or whatnot should work.
> Problem is how to gain knowledge of whether multicores share a cache
> or not.

Actually I meant setting the SD_BALANCE_WAKE flag for the SMT and MC
domains (and then making sure that "MC" really means "shares LLC" in
the arch code), and then using this as indication in the sched code..

if you're a multicore domain you better have a shared cache.. that's
what it should mean. If it does not we should fix that.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  5:36                       ` Arjan van de Ven
@ 2009-10-26  5:47                         ` Mike Galbraith
  2009-10-26  5:57                           ` Mike Galbraith
  2009-11-10 21:59                         ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-26  5:47 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 06:08:54 +0100
> Mike Galbraith <efault@gmx.de> wrote:

> > > 
> > > or we just use SD_WAKE_AFFINE / SD_BALANCE_WAKE for this...
> > 
> > I don't see how.  Oh, you mean another domain level, top level being
> > cache property, and turn off when degenerating?  That looks like it'd
> > be a problem, but adding SD_CACHE_SIBLING or whatnot should work.
> > Problem is how to gain knowledge of whether multicores share a cache
> > or not.
> 
> Actually I meant setting the SD_BALANCE_WAKE flag for the SMT and MC
> domains (and then making sure that "MC" really means "shares LLC" in
> the arch code), and then using this as indication in the sched code..

I don't think we can do that, because SD_WAKE_BALANCE already has a
different meaning.  SD_WAKE_AFFINE could be used though, affine wakeups
have always been a cache thing, and for trying to keep things affine to
a package or whatnot, we have SD_PREFER_LOCAL.  Sounds clean to me.

> if you're a multicore domain you better have a shared cache.. that's
> what it should mean. If it does not we should fix that.

Sounds reasonable to me.  I'll go make explosions.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  5:47                         ` Mike Galbraith
@ 2009-10-26  5:57                           ` Mike Galbraith
  2009-10-26  7:01                             ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-26  5:57 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Peter Zijlstra, mingo, linux-kernel

On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:

> > if you're a multicore domain you better have a shared cache.. that's
> > what it should mean. If it does not we should fix that.
> 
> Sounds reasonable to me.  I'll go make explosions.

(Actually, if multicode and sibling does indeed mean shared cache, no
arch tinkering should be necessary, just reset SD_WAKE_AFFINE when
degenerating should work fine.  Only thing is multicore with siblings..
and test test test)

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  5:57                           ` Mike Galbraith
@ 2009-10-26  7:01                             ` Ingo Molnar
  2009-10-26  7:05                               ` Arjan van de Ven
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2009-10-26  7:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Arjan van de Ven, Peter Zijlstra, linux-kernel


* Mike Galbraith <efault@gmx.de> wrote:

> On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> > On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> 
> > > if you're a multicore domain you better have a shared cache.. 
> > > that's what it should mean. If it does not we should fix that.
> > 
> > Sounds reasonable to me.  I'll go make explosions.
> 
> (Actually, if multicode and sibling does indeed mean shared cache, no 
> arch tinkering should be necessary, just reset SD_WAKE_AFFINE when 
> degenerating should work fine.  Only thing is multicore with 
> siblings.. and test test test)

Correct. There's a few cpus where multicore means separate caches but 
all modern CPUs have shared caches for cores so we want to tune for 
that.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  7:01                             ` Ingo Molnar
@ 2009-10-26  7:05                               ` Arjan van de Ven
  2009-10-26 11:33                                 ` Suresh Siddha
  0 siblings, 1 reply; 35+ messages in thread
From: Arjan van de Ven @ 2009-10-26  7:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Galbraith, Peter Zijlstra, linux-kernel

On Mon, 26 Oct 2009 08:01:12 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Mon, 2009-10-26 at 06:47 +0100, Mike Galbraith wrote:
> > > On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> > 
> > > > if you're a multicore domain you better have a shared cache.. 
> > > > that's what it should mean. If it does not we should fix that.
> > > 
> > > Sounds reasonable to me.  I'll go make explosions.
> > 
> > (Actually, if multicode and sibling does indeed mean shared cache,
> > no arch tinkering should be necessary, just reset SD_WAKE_AFFINE
> > when degenerating should work fine.  Only thing is multicore with 
> > siblings.. and test test test)
> 
> Correct. There's a few cpus where multicore means separate caches but 
> all modern CPUs have shared caches for cores so we want to tune for 
> that.

for those cpus where mc means separate caches we should fix the arch
code to set up separate MC domains to be honest..
I can look into that in a bit..


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  7:05                               ` Arjan van de Ven
@ 2009-10-26 11:33                                 ` Suresh Siddha
  0 siblings, 0 replies; 35+ messages in thread
From: Suresh Siddha @ 2009-10-26 11:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, linux-kernel

On Mon, 2009-10-26 at 00:05 -0700, Arjan van de Ven wrote:
> On Mon, 26 Oct 2009 08:01:12 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Correct. There's a few cpus where multicore means separate caches but 
> > all modern CPUs have shared caches for cores so we want to tune for 
> > that.
> 
> for those cpus where mc means separate caches we should fix the arch
> code to set up separate MC domains to be honest..
> I can look into that in a bit..

In the default performance mode, multi-core domain is populated with
only cores sharing last-level cache. In the case where the cores don't
share caches, we represent them in the smp domain.

thanks,
suresh


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  1:53               ` Peter Zijlstra
  2009-10-26  4:38                 ` Mike Galbraith
@ 2009-10-27 14:35                 ` Mike Galbraith
  2009-10-28  7:25                   ` Mike Galbraith
  2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
  1 sibling, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-10-27 14:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Arjan van de Ven, mingo, linux-kernel

On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> >                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > -                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > +                               (level == SD_LV_SIBLING || level == SD_LV_MC)) {
> 
> quick comment without actually having looked at the patch, we should
> really get rid of sd->level and encode properties of the sched domains
> in sd->flags.

I used SD_PREFER_SIBLING in the below.  Did I break anything?

(wonder what it does for pgsql+oltp on beefy box with siblings)

tip v2.6.32-rc5-1724-g77a088c

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip           9999.77   18472.11   34931.60   34412.09   33006.76   32104.36   30700.47   28111.31   25535.09
             10082.75   18625.12   34928.17   34476.91   33088.70   32002.36   30695.77   28173.94   25551.05
              9949.05   18466.54   34942.66   34420.74   33092.45   32041.10   30666.43   28090.90   25467.63
tip avg      10010.52   18521.25   34934.14   34436.58   33062.63   32049.27   30687.55   28125.38   25517.92

tip+          9622.23   18297.65   34496.12   34230.85   32704.20   31796.54   30480.45   27740.20   25394.12
             10207.79   18275.83   34622.39   34222.47   32996.69   31936.48   30551.29   28144.48   25616.62
             10225.32   18515.02   34538.41   34278.06   33014.14   31965.31   30363.90   28089.41   25531.81
tip+ avg     10018.44   18362.83   34552.30   34243.79   32905.01   31899.44   30465.21   27991.36   25514.18
vs tip          1.000       .991       .989       .994       .995       .995       .992       .995       .999

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13945.42   26973.91   52504.18   52613.32   51310.82   50442.61   49826.52   48760.62   45570.45
             13921.41   27021.48   52722.64   52565.16   51483.19   50638.83   49499.51   48621.31   46115.77
             13924.94   26961.02   52624.45   52365.49   51384.91   50499.44   49622.83   48065.03   45743.14
tip avg      13930.59   26985.47   52617.09   52514.65   51392.97   50526.96   49649.62   48482.32   45809.78

tip+         15259.79   29162.31   52609.01   52562.16   51578.48   50631.90   49537.41   48376.23   46058.95
             15156.54   29114.10   52760.02   52524.86   51412.94   50656.30   48774.34   47968.77   45905.02
             15118.64   29190.73   52929.34   52503.58   51574.34   50232.27   49599.15   48283.42   45766.74
tip+ avg     15178.32   29155.71   52766.12   52530.20   51521.92   50506.82   49303.63   48209.47   45910.23
vs tip          1.089      1.080      1.002      1.000      1.002       .999       .993       .994      1.002

sched: check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if found, wake to
that CPU/sibling instead of the waker's CPU.  This improves pgsql+oltp
ramp up by roughly 8%.  Possibly more for other loads, depending on overlap.
The trade-off is a roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

---
 kernel/sched_fair.c |   33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1398,11 +1398,36 @@ static int select_task_rq_fair(struct ta
 				want_sd = 0;
 		}
 
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+		if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
+			int candidate = -1, i;
 
-			affine_sd = tmp;
-			want_affine = 0;
+			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+				candidate = cpu;
+
+			/*
+			 * Check for an idle shared cache.
+			 */
+			if (tmp->flags & SD_PREFER_SIBLING) {
+				if (candidate == cpu) {
+					if (!cpu_rq(prev_cpu)->cfs.nr_running)
+						candidate = prev_cpu;
+				}
+
+				if (candidate == -1 || candidate == cpu) {
+					for_each_cpu(i, sched_domain_span(tmp)) {
+						if (!cpu_rq(i)->cfs.nr_running) {
+							candidate = i;
+							break;
+						}
+					}
+				}
+			}
+
+			if (candidate >= 0) {
+				affine_sd = tmp;
+				want_affine = 0;
+				cpu = candidate;
+			}
 		}
 
 		if (!want_sd && !want_affine)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-27 14:35                 ` Mike Galbraith
@ 2009-10-28  7:25                   ` Mike Galbraith
  2009-10-28 18:36                     ` Mike Galbraith
  2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-10-28  7:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Arjan van de Ven, mingo, linux-kernel

On Tue, 2009-10-27 at 15:35 +0100, Mike Galbraith wrote:
> On Mon, 2009-10-26 at 02:53 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-10-25 at 23:04 +0100, Mike Galbraith wrote:
> > >                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > > -                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > > +                               (level == SD_LV_SIBLING || level == SD_LV_MC)) {
> > 
> > quick comment without actually having looked at the patch, we should
> > really get rid of sd->level and encode properties of the sched domains
> > in sd->flags.
> 
> I used SD_PREFER_SIBLING in the below.  Did I break anything?

Um, other than taskset.

> +				if (candidate == -1 || candidate == cpu) {
> +					for_each_cpu(i, sched_domain_span(tmp)) {

                                                if (!cpumask_test_cpu(i, &p->cpus_allowed))
                                                        continue;

(i think i'm going to need that domain flag in a few minutes anyway)

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-28  7:25                   ` Mike Galbraith
@ 2009-10-28 18:36                     ` Mike Galbraith
  0 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-10-28 18:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Arjan van de Ven, mingo, linux-kernel

On Wed, 2009-10-28 at 08:25 +0100, Mike Galbraith wrote:

> (i think i'm going to need that domain flag in a few minutes anyway)

Dirty trick didn't work out.

As long as I let wake_affine() decide whether an affine wakeup should
succeed based on the original cpu and then take the switched out cpu
instead, it worked ok.  Try to be cleaner, and it turned to crud.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-10-27 14:35                 ` Mike Galbraith
  2009-10-28  7:25                   ` Mike Galbraith
@ 2009-11-04 19:33                   ` tip-bot for Mike Galbraith
  2009-11-04 20:37                     ` Mike Galbraith
  2009-11-05  9:30                     ` Ingo Molnar
  1 sibling, 2 replies; 35+ messages in thread
From: tip-bot for Mike Galbraith @ 2009-11-04 19:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, efault, arjan, stable, tglx, mingo

Commit-ID:  a1f84a3ab8e002159498814eaa7e48c33752b04b
Gitweb:     http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 4 Nov 2009 18:46:22 +0100

sched: Check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.

This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
LKML-Reference: <1256654138.17752.7.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   33 +++++++++++++++++++++++++++++----
 1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..da87385 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1372,11 +1372,36 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 				want_sd = 0;
 		}
 
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+		if (want_affine && (tmp->flags & SD_WAKE_AFFINE)) {
+			int candidate = -1, i;
 
-			affine_sd = tmp;
-			want_affine = 0;
+			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+				candidate = cpu;
+
+			/*
+			 * Check for an idle shared cache.
+			 */
+			if (tmp->flags & SD_PREFER_SIBLING) {
+				if (candidate == cpu) {
+					if (!cpu_rq(prev_cpu)->cfs.nr_running)
+						candidate = prev_cpu;
+				}
+
+				if (candidate == -1 || candidate == cpu) {
+					for_each_cpu(i, sched_domain_span(tmp)) {
+						if (!cpu_rq(i)->cfs.nr_running) {
+							candidate = i;
+							break;
+						}
+					}
+				}
+			}
+
+			if (candidate >= 0) {
+				affine_sd = tmp;
+				want_affine = 0;
+				cpu = candidate;
+			}
 		}
 
 		if (!want_sd && !want_affine)

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
@ 2009-11-04 20:37                     ` Mike Galbraith
  2009-11-04 21:41                       ` Mike Galbraith
  2009-11-05  9:30                     ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2009-11-04 20:37 UTC (permalink / raw)
  To: arjan, mingo, hpa, linux-kernel, stable, peterz, tglx, mingo
  Cc: linux-tip-commits

On Wed, 2009-11-04 at 19:33 +0000, tip-bot for Mike Galbraith wrote:
> Commit-ID:  a1f84a3ab8e002159498814eaa7e48c33752b04b
> Gitweb:     http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> Author:     Mike Galbraith <efault@gmx.de>
> AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> 
> sched: Check for an idle shared cache in select_task_rq_fair()
> 
> When waking affine, check for an idle shared cache, and if
> found, wake to that CPU/sibling instead of the waker's CPU.

This one still wants some work.  It's not playing well with 1b9508f.

tip v2.6.32-rc6-1731-gc5bb4b1

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10447.14   19734.58   36038.18   35776.85   34662.76   33682.30   32256.22   28770.99   25323.23
             10462.61   19580.14   36050.48   35942.63   35054.84   33988.40   32423.89   29259.65   25892.24
             10501.02   19231.27   36007.03   35985.32   35060.79   33945.47   32400.42   29140.84   25716.16
tip avg      10470.25   19515.33   36031.89   35901.60   34926.13   33872.05   32360.17   29057.16   25643.87

tip v2.6.32-rc6-1731-gc5bb4b1 + 1b9508f (Rate-limit newidle)
tip+         10095.64   19989.67   36449.85   35923.98   35024.24   34026.61   32411.75   28965.77   27287.96
             10764.98   19864.03   36632.78   36338.00   35610.52   34439.88   32942.57   29889.18   27426.12
             10734.48   19572.63   36507.90   36236.36   35486.51   34443.86   32938.90   30136.39   27186.61
tip+ avg     10531.70   19808.77   36530.17   36166.11   35373.75   34303.45   32764.40   29663.78   27300.23
                1.005      1.015      1.013      1.007      1.012      1.012      1.012      1.020      1.064

v2.6.32-rc6-1796-gd995f1d (a1f84a3 not playing well with 1b9508f)
             10578.44   19358.96   36184.70   35549.19   34625.40   33658.72   31947.51   27504.11   20400.97
             10610.76   19748.84   36401.08   36021.75   34830.67   33566.52   31989.97   27622.25   21348.00
             10568.87   19600.60   36190.48   35959.96   34805.45   33641.46   32005.30   27591.48   22509.01



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-11-04 20:37                     ` Mike Galbraith
@ 2009-11-04 21:41                       ` Mike Galbraith
  0 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-11-04 21:41 UTC (permalink / raw)
  To: arjan
  Cc: mingo, hpa, linux-kernel, stable, peterz, tglx, mingo, linux-tip-commits

On Wed, 2009-11-04 at 21:37 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-04 at 19:33 +0000, tip-bot for Mike Galbraith wrote:
> > Commit-ID:  a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Gitweb:     http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Author:     Mike Galbraith <efault@gmx.de>
> > AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> > 
> > sched: Check for an idle shared cache in select_task_rq_fair()
> > 
> > When waking affine, check for an idle shared cache, and if
> > found, wake to that CPU/sibling instead of the waker's CPU.
> 
> This one still wants some work.  It's not playing well with 1b9508f.

Still does want some work (on the tail), but those numbers were from a
failed experiment kernel, not d995f1d.  Below are the correct numbers.

tip v2.6.32-rc6-1731-gc5bb4b1

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10447.14   19734.58   36038.18   35776.85   34662.76   33682.30   32256.22   28770.99   25323.23
             10462.61   19580.14   36050.48   35942.63   35054.84   33988.40   32423.89   29259.65   25892.24
             10501.02   19231.27   36007.03   35985.32   35060.79   33945.47   32400.42   29140.84   25716.16
tip avg      10470.25   19515.33   36031.89   35901.60   34926.13   33872.05   32360.17   29057.16   25643.87

tip v2.6.32-rc6-1731-gc5bb4b1 + 1b9508f (Rate-limit newidle)
tip+         10095.64   19989.67   36449.85   35923.98   35024.24   34026.61   32411.75   28965.77   27287.96
             10764.98   19864.03   36632.78   36338.00   35610.52   34439.88   32942.57   29889.18   27426.12
             10734.48   19572.63   36507.90   36236.36   35486.51   34443.86   32938.90   30136.39   27186.61
tip+ avg     10531.70   19808.77   36530.17   36166.11   35373.75   34303.45   32764.40   29663.78   27300.23
                1.005      1.015      1.013      1.007      1.012      1.012      1.012      1.020      1.064

v2.6.32-rc6-1796-gd995f1d
             10745.30   19684.71   36367.25   35890.25   34598.08   33672.37   32327.99   28744.09   25393.00
             10549.41   19747.85   36778.96   36410.58   35531.90   34292.03   32603.66   29487.77   25351.09
             10678.30   19944.74   36789.44   36403.78   35523.40   34267.10   32659.60   29359.59   25751.93
avg          10657.67   19792.43   36645.21   36234.87   35217.79   34077.16   32530.41   29197.15   25498.67
vs c5bb4b1      1.017      1.014      1.017      1.009      1.008      1.006      1.005       1.004      .994

Lost the tail gain, gained peak.  Drink one, spill one, give one away.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
  2009-11-04 20:37                     ` Mike Galbraith
@ 2009-11-05  9:30                     ` Ingo Molnar
  2009-11-05  9:57                       ` Mike Galbraith
  1 sibling, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2009-11-05  9:30 UTC (permalink / raw)
  To: tip-bot for Mike Galbraith
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, peterz, arjan, stable, tglx


* tip-bot for Mike Galbraith <efault@gmx.de> wrote:

> Commit-ID:  a1f84a3ab8e002159498814eaa7e48c33752b04b
> Gitweb:     http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> Author:     Mike Galbraith <efault@gmx.de>
> AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> 
> sched: Check for an idle shared cache in select_task_rq_fair()

-tip testing found that this causes problems:

[   26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[   26.808000] caller is vmstat_update+0x26/0x70
[   26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
[   26.816000] Call Trace:
[   26.820000]  [<c1924a24>] ? printk+0x28/0x3c
[   26.824000]  [<c13258a0>] debug_smp_processor_id+0xf0/0x110
[   26.824000] mount used greatest stack depth: 1464 bytes left
[   26.828000]  [<c111d086>] vmstat_update+0x26/0x70
[   26.832000]  [<c1086418>] worker_thread+0x188/0x310
[   26.836000]  [<c10863b7>] ? worker_thread+0x127/0x310
[   26.840000]  [<c108d310>] ? autoremove_wake_function+0x0/0x60
[   26.844000]  [<c1086290>] ? worker_thread+0x0/0x310
[   26.848000]  [<c108cf0c>] kthread+0x7c/0x90
[   26.852000]  [<c108ce90>] ? kthread+0x0/0x90
[   26.856000]  [<c100c0a7>] kernel_thread_helper+0x7/0x10
[   26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[   26.864000] caller is vmstat_update+0x3c/0x70

oh ... doesnt it break cpus_allowed?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-11-05  9:30                     ` Ingo Molnar
@ 2009-11-05  9:57                       ` Mike Galbraith
  2009-11-05 10:00                         ` Mike Galbraith
  2009-11-06  7:09                         ` [tip:sched/core] sched: Fix affinity logic " tip-bot for Mike Galbraith
  0 siblings, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-11-05  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, peterz, arjan, stable, tglx

On Thu, 2009-11-05 at 10:30 +0100, Ingo Molnar wrote:
> * tip-bot for Mike Galbraith <efault@gmx.de> wrote:
> 
> > Commit-ID:  a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Gitweb:     http://git.kernel.org/tip/a1f84a3ab8e002159498814eaa7e48c33752b04b
> > Author:     Mike Galbraith <efault@gmx.de>
> > AuthorDate: Tue, 27 Oct 2009 15:35:38 +0100
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Wed, 4 Nov 2009 18:46:22 +0100
> > 
> > sched: Check for an idle shared cache in select_task_rq_fair()
> 
> -tip testing found that this causes problems:
> 
> [   26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
> [   26.808000] caller is vmstat_update+0x26/0x70
> [   26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
> [   26.816000] Call Trace:
> [   26.820000]  [<c1924a24>] ? printk+0x28/0x3c
> [   26.824000]  [<c13258a0>] debug_smp_processor_id+0xf0/0x110
> [   26.824000] mount used greatest stack depth: 1464 bytes left
> [   26.828000]  [<c111d086>] vmstat_update+0x26/0x70
> [   26.832000]  [<c1086418>] worker_thread+0x188/0x310
> [   26.836000]  [<c10863b7>] ? worker_thread+0x127/0x310
> [   26.840000]  [<c108d310>] ? autoremove_wake_function+0x0/0x60
> [   26.844000]  [<c1086290>] ? worker_thread+0x0/0x310
> [   26.848000]  [<c108cf0c>] kthread+0x7c/0x90
> [   26.852000]  [<c108ce90>] ? kthread+0x0/0x90
> [   26.856000]  [<c100c0a7>] kernel_thread_helper+0x7/0x10
> [   26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
> [   26.864000] caller is vmstat_update+0x3c/0x70
> 
> oh ... doesnt it break cpus_allowed?

Erk, indeed.  You didn't apply the follow-up fix, and I forgot as well.


sched: select_task_rq_fair():: add missing cpu allowed check in commit a1f84a3.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 32f06ed..5488a5d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1415,6 +1415,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 
 				if (candidate == -1 || candidate == cpu) {
 					for_each_cpu(i, sched_domain_span(tmp)) {
+						if (!cpumask_test_cpu(i, &p->cpus_allowed))
+							continue;
 						if (!cpu_rq(i)->cfs.nr_running) {
 							candidate = i;
 							break;



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair()
  2009-11-05  9:57                       ` Mike Galbraith
@ 2009-11-05 10:00                         ` Mike Galbraith
  2009-11-06  7:09                         ` [tip:sched/core] sched: Fix affinity logic " tip-bot for Mike Galbraith
  1 sibling, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2009-11-05 10:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, peterz, arjan, stable, tglx

On Thu, 2009-11-05 at 10:57 +0100, Mike Galbraith wrote:

> Erk, indeed.  You didn't apply the follow-up fix, and I forgot as well.

But maybe you should just back it out until I've fiddled/tested a bit
more.  Dunno.

	-Mike


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip:sched/core] sched: Fix affinity logic in select_task_rq_fair()
  2009-11-05  9:57                       ` Mike Galbraith
  2009-11-05 10:00                         ` Mike Galbraith
@ 2009-11-06  7:09                         ` tip-bot for Mike Galbraith
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot for Mike Galbraith @ 2009-11-06  7:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, stable, efault, tglx, mingo

Commit-ID:  fd210738f6601d0fb462df9a2fe5a41896ff6a8f
Gitweb:     http://git.kernel.org/tip/fd210738f6601d0fb462df9a2fe5a41896ff6a8f
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Thu, 5 Nov 2009 10:57:46 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 5 Nov 2009 11:01:39 +0100

sched: Fix affinity logic in select_task_rq_fair()

Ingo Molnar reported:

[   26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[   26.808000] caller is vmstat_update+0x26/0x70
[   26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
[   26.816000] Call Trace:
[   26.820000]  [<c1924a24>] ? printk+0x28/0x3c
[   26.824000]  [<c13258a0>] debug_smp_processor_id+0xf0/0x110
[   26.824000] mount used greatest stack depth: 1464 bytes left
[   26.828000]  [<c111d086>] vmstat_update+0x26/0x70
[   26.832000]  [<c1086418>] worker_thread+0x188/0x310
[   26.836000]  [<c10863b7>] ? worker_thread+0x127/0x310
[   26.840000]  [<c108d310>] ? autoremove_wake_function+0x0/0x60
[   26.844000]  [<c1086290>] ? worker_thread+0x0/0x310
[   26.848000]  [<c108cf0c>] kthread+0x7c/0x90
[   26.852000]  [<c108ce90>] ? kthread+0x0/0x90
[   26.856000]  [<c100c0a7>] kernel_thread_helper+0x7/0x10
[   26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[   26.864000] caller is vmstat_update+0x3c/0x70

Because this commit:

  a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()

broke ->cpus_allowed.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: arjan@infradead.org
Cc: <stable@kernel.org>
LKML-Reference: <1257415066.12867.1.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index da87385..e4d4483 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1389,6 +1389,8 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 
 				if (candidate == -1 || candidate == cpu) {
 					for_each_cpu(i, sched_domain_span(tmp)) {
+						if (!cpumask_test_cpu(i, &p->cpus_allowed))
+							continue;
 						if (!cpu_rq(i)->cfs.nr_running) {
 							candidate = i;
 							break;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-10-26  5:36                       ` Arjan van de Ven
  2009-10-26  5:47                         ` Mike Galbraith
@ 2009-11-10 21:59                         ` Peter Zijlstra
  2009-11-11  6:01                           ` Arjan van de Ven
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2009-11-10 21:59 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Mike Galbraith, mingo, lkml

On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> 
> if you're a multicore domain you better have a shared cache.. that's
> what it should mean. If it does not we should fix that.

Fully agreed.. I've been dying to do the below for ages :-)

---
 arch/x86/kernel/smpboot.c |   10 +---------
 1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 213a7a3..297b307 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -433,15 +433,7 @@ void __cpuinit set_cpu_sibling_map(int cpu)
 const struct cpumask *cpu_coregroup_mask(int cpu)
 {
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
-	/*
-	 * For perf, we return last level cache shared map.
-	 * And for power savings, we return cpu_core_map
-	 */
-	if ((sched_mc_power_savings || sched_smt_power_savings) &&
-	    !(cpu_has(c, X86_FEATURE_AMD_DCM)))
-		return cpu_core_mask(cpu);
-	else
-		return c->llc_shared_map;
+	return c->llc_shared_map;
 }
 
 static void impress_friends(void)



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: Disable affine wakeups by default
  2009-11-10 21:59                         ` Peter Zijlstra
@ 2009-11-11  6:01                           ` Arjan van de Ven
  0 siblings, 0 replies; 35+ messages in thread
From: Arjan van de Ven @ 2009-11-11  6:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, mingo, lkml

On Tue, 10 Nov 2009 22:59:29 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, 2009-10-25 at 22:36 -0700, Arjan van de Ven wrote:
> > 
> > if you're a multicore domain you better have a shared cache.. that's
> > what it should mean. If it does not we should fix that.
> 
> Fully agreed.. I've been dying to do the below for ages :-)
> 

completely sane.

Acked-by: Arjan van de Ven <arjan@linux.intel.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-11-11  5:59 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-24 19:58 [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Arjan van de Ven
2009-10-24 20:04 ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Arjan van de Ven
2009-10-24 20:07   ` [PATCH 3/3] sched: Disable affine wakeups by default Arjan van de Ven
2009-10-25  6:55     ` Mike Galbraith
2009-10-25 16:51       ` Arjan van de Ven
2009-10-25 17:38         ` Mike Galbraith
2009-10-25 19:33           ` Arjan van de Ven
2009-10-25 22:04             ` Mike Galbraith
2009-10-26  1:53               ` Peter Zijlstra
2009-10-26  4:38                 ` Mike Galbraith
2009-10-26  4:52                   ` Arjan van de Ven
2009-10-26  5:08                     ` Mike Galbraith
2009-10-26  5:36                       ` Arjan van de Ven
2009-10-26  5:47                         ` Mike Galbraith
2009-10-26  5:57                           ` Mike Galbraith
2009-10-26  7:01                             ` Ingo Molnar
2009-10-26  7:05                               ` Arjan van de Ven
2009-10-26 11:33                                 ` Suresh Siddha
2009-11-10 21:59                         ` Peter Zijlstra
2009-11-11  6:01                           ` Arjan van de Ven
2009-10-27 14:35                 ` Mike Galbraith
2009-10-28  7:25                   ` Mike Galbraith
2009-10-28 18:36                     ` Mike Galbraith
2009-11-04 19:33                   ` [tip:sched/core] sched: Check for an idle shared cache in select_task_rq_fair() tip-bot for Mike Galbraith
2009-11-04 20:37                     ` Mike Galbraith
2009-11-04 21:41                       ` Mike Galbraith
2009-11-05  9:30                     ` Ingo Molnar
2009-11-05  9:57                       ` Mike Galbraith
2009-11-05 10:00                         ` Mike Galbraith
2009-11-06  7:09                         ` [tip:sched/core] sched: Fix affinity logic " tip-bot for Mike Galbraith
2009-10-26  5:21             ` [PATCH 3/3] sched: Disable affine wakeups by default Mike Galbraith
2009-10-25  8:01     ` Peter Zijlstra
2009-10-25  8:01   ` [PATCH 2/3] sched: Add aggressive load balancing for certain situations Peter Zijlstra
2009-10-25 11:48     ` Peter Zijlstra
2009-10-25  8:03 ` [PATCH 1/3] sched: Enable wake balancing for the SMT/HT domain Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.