All of lore.kernel.org
 help / color / mirror / Atom feed
* Plumbers: Tweaking scheduler policy micro-conf RFP
@ 2012-05-11 16:16 Vincent Guittot
  2012-05-11 16:26 ` Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-11 16:16 UTC (permalink / raw)
  To: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, peterz, panto, mingo, paul.brett,
	pdeschrijver, pjt, efault, fweisbec, geoff, rostedt, tglx,
	amit.kucheria
  Cc: linux-kernel, linaro-sched-sig

This is a request-for-participation in a micro conference during the
next Linux Plumber Conference (29-31st Aug).
It'll require critical mass measured in talk submissions in the
general area of scheduler and task management.

If you're working on improving the the scheduler policy used to place
a task on a CPU to suit your HW, we are inviting your participation
and request you to submit a proposal to present your problem (e.g.
power-efficiency) or a solution to solve said problem that should be
considered by upstream developers.

We've interacted with the people in To: list before in our quest to
better understand how the scheduler works and we're hoping you all
will consider participating in the micro-conf to help guide what kinds
of ideas are likely to make it upstream.

If you have ongoing work or ideas in the the following areas we're
especially interested in hearing from you:
 1. Consolidation of statistics with other frameworks (cpuidle,
cpufreq, scheduler all seem to track their own statistics related to
load, idleness, etc. Can this be converted to a library that is
useable by all?)
 2. Replacement for task consolidation on fewer CPUs aka. replacement
for sched_mc
 3. Improvement in the placement of activity beside tasks: timer,
workqueue, IO, interruption
 4. Instrumentation to calculate the compute capacity available on
active cores and its utilization by a given workload

We are thinking of organising the micro-conf as a Q & A session where
a participant would state a problem and then there would be
brainstorming on if this is indeed a problem and is so, how to achieve
a solution. In other words, 20-30 minute slots of each Q & A

1. Problem statements with specific examples on why changing the
default scheduler policy is desired
2. For each problem, if it is deemed not possible to accomplish easily
today, brainstorming on what an acceptable solution would look like
(frameworks to build upon, interfaces to use, related work in the
area, key people to involve, etc.)

Please email us if you will be attending the conference and interested
in talking about this problem space.

Regards,
Amit & Vincent

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-11 16:16 Plumbers: Tweaking scheduler policy micro-conf RFP Vincent Guittot
@ 2012-05-11 16:26 ` Steven Rostedt
  2012-05-11 16:38   ` Vincent Guittot
  2012-05-15  0:53 ` Paul E. McKenney
  2012-05-15  8:02 ` Vincent Guittot
  2 siblings, 1 reply; 55+ messages in thread
From: Steven Rostedt @ 2012-05-11 16:26 UTC (permalink / raw)
  To: Vincent Guittot, Juri Lelli
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, peterz, panto, mingo, paul.brett,
	pdeschrijver, pjt, efault, fweisbec, geoff, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig

Not really specific to HW, but Juri Lelli has been working on a deadline
scheduler. Perhaps his work may be of interest.

-- Steve


On Fri, 2012-05-11 at 18:16 +0200, Vincent Guittot wrote:
> This is a request-for-participation in a micro conference during the
> next Linux Plumber Conference (29-31st Aug).
> It'll require critical mass measured in talk submissions in the
> general area of scheduler and task management.
> 
> If you're working on improving the the scheduler policy used to place
> a task on a CPU to suit your HW, we are inviting your participation
> and request you to submit a proposal to present your problem (e.g.
> power-efficiency) or a solution to solve said problem that should be
> considered by upstream developers.
> 
> We've interacted with the people in To: list before in our quest to
> better understand how the scheduler works and we're hoping you all
> will consider participating in the micro-conf to help guide what kinds
> of ideas are likely to make it upstream.
> 
> If you have ongoing work or ideas in the the following areas we're
> especially interested in hearing from you:
>  1. Consolidation of statistics with other frameworks (cpuidle,
> cpufreq, scheduler all seem to track their own statistics related to
> load, idleness, etc. Can this be converted to a library that is
> useable by all?)
>  2. Replacement for task consolidation on fewer CPUs aka. replacement
> for sched_mc
>  3. Improvement in the placement of activity beside tasks: timer,
> workqueue, IO, interruption
>  4. Instrumentation to calculate the compute capacity available on
> active cores and its utilization by a given workload
> 
> We are thinking of organising the micro-conf as a Q & A session where
> a participant would state a problem and then there would be
> brainstorming on if this is indeed a problem and is so, how to achieve
> a solution. In other words, 20-30 minute slots of each Q & A
> 
> 1. Problem statements with specific examples on why changing the
> default scheduler policy is desired
> 2. For each problem, if it is deemed not possible to accomplish easily
> today, brainstorming on what an acceptable solution would look like
> (frameworks to build upon, interfaces to use, related work in the
> area, key people to involve, etc.)
> 
> Please email us if you will be attending the conference and interested
> in talking about this problem space.
> 
> Regards,
> Amit & Vincent



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-11 16:26 ` Steven Rostedt
@ 2012-05-11 16:38   ` Vincent Guittot
  2012-05-15  8:41     ` Juri Lelli
  0 siblings, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2012-05-11 16:38 UTC (permalink / raw)
  To: Steven Rostedt, Juri Lelli
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, peterz, panto, mingo, paul.brett,
	pdeschrijver, pjt, efault, fweisbec, geoff, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig

On 11 May 2012 18:26, Steven Rostedt <rostedt@goodmis.org> wrote:
> Not really specific to HW, but Juri Lelli has been working on a deadline
> scheduler. Perhaps his work may be of interest.

Yes for sure, if Juri is agree to present his work

Vincent
>
> -- Steve
>
>
> On Fri, 2012-05-11 at 18:16 +0200, Vincent Guittot wrote:
>> This is a request-for-participation in a micro conference during the
>> next Linux Plumber Conference (29-31st Aug).
>> It'll require critical mass measured in talk submissions in the
>> general area of scheduler and task management.
>>
>> If you're working on improving the the scheduler policy used to place
>> a task on a CPU to suit your HW, we are inviting your participation
>> and request you to submit a proposal to present your problem (e.g.
>> power-efficiency) or a solution to solve said problem that should be
>> considered by upstream developers.
>>
>> We've interacted with the people in To: list before in our quest to
>> better understand how the scheduler works and we're hoping you all
>> will consider participating in the micro-conf to help guide what kinds
>> of ideas are likely to make it upstream.
>>
>> If you have ongoing work or ideas in the the following areas we're
>> especially interested in hearing from you:
>>  1. Consolidation of statistics with other frameworks (cpuidle,
>> cpufreq, scheduler all seem to track their own statistics related to
>> load, idleness, etc. Can this be converted to a library that is
>> useable by all?)
>>  2. Replacement for task consolidation on fewer CPUs aka. replacement
>> for sched_mc
>>  3. Improvement in the placement of activity beside tasks: timer,
>> workqueue, IO, interruption
>>  4. Instrumentation to calculate the compute capacity available on
>> active cores and its utilization by a given workload
>>
>> We are thinking of organising the micro-conf as a Q & A session where
>> a participant would state a problem and then there would be
>> brainstorming on if this is indeed a problem and is so, how to achieve
>> a solution. In other words, 20-30 minute slots of each Q & A
>>
>> 1. Problem statements with specific examples on why changing the
>> default scheduler policy is desired
>> 2. For each problem, if it is deemed not possible to accomplish easily
>> today, brainstorming on what an acceptable solution would look like
>> (frameworks to build upon, interfaces to use, related work in the
>> area, key people to involve, etc.)
>>
>> Please email us if you will be attending the conference and interested
>> in talking about this problem space.
>>
>> Regards,
>> Amit & Vincent
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-11 16:16 Plumbers: Tweaking scheduler policy micro-conf RFP Vincent Guittot
  2012-05-11 16:26 ` Steven Rostedt
@ 2012-05-15  0:53 ` Paul E. McKenney
  2012-05-15  8:02 ` Vincent Guittot
  2 siblings, 0 replies; 55+ messages in thread
From: Paul E. McKenney @ 2012-05-15  0:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, peterz, panto, mingo, paul.brett,
	pdeschrijver, pjt, efault, fweisbec, geoff, rostedt, tglx,
	amit.kucheria, linux-kernel, linaro-sched-sig

On Fri, May 11, 2012 at 06:16:21PM +0200, Vincent Guittot wrote:
> This is a request-for-participation in a micro conference during the
> next Linux Plumber Conference (29-31st Aug).
> It'll require critical mass measured in talk submissions in the
> general area of scheduler and task management.

Hello, Vincent,

I still cannot claim any particular scheduler expertise, but would be
happy to act as moderator, if you would like.

							Thanx, Paul

> If you're working on improving the the scheduler policy used to place
> a task on a CPU to suit your HW, we are inviting your participation
> and request you to submit a proposal to present your problem (e.g.
> power-efficiency) or a solution to solve said problem that should be
> considered by upstream developers.
> 
> We've interacted with the people in To: list before in our quest to
> better understand how the scheduler works and we're hoping you all
> will consider participating in the micro-conf to help guide what kinds
> of ideas are likely to make it upstream.
> 
> If you have ongoing work or ideas in the the following areas we're
> especially interested in hearing from you:
>  1. Consolidation of statistics with other frameworks (cpuidle,
> cpufreq, scheduler all seem to track their own statistics related to
> load, idleness, etc. Can this be converted to a library that is
> useable by all?)
>  2. Replacement for task consolidation on fewer CPUs aka. replacement
> for sched_mc
>  3. Improvement in the placement of activity beside tasks: timer,
> workqueue, IO, interruption
>  4. Instrumentation to calculate the compute capacity available on
> active cores and its utilization by a given workload
> 
> We are thinking of organising the micro-conf as a Q & A session where
> a participant would state a problem and then there would be
> brainstorming on if this is indeed a problem and is so, how to achieve
> a solution. In other words, 20-30 minute slots of each Q & A
> 
> 1. Problem statements with specific examples on why changing the
> default scheduler policy is desired
> 2. For each problem, if it is deemed not possible to accomplish easily
> today, brainstorming on what an acceptable solution would look like
> (frameworks to build upon, interfaces to use, related work in the
> area, key people to involve, etc.)
> 
> Please email us if you will be attending the conference and interested
> in talking about this problem space.
> 
> Regards,
> Amit & Vincent
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-11 16:16 Plumbers: Tweaking scheduler policy micro-conf RFP Vincent Guittot
  2012-05-11 16:26 ` Steven Rostedt
  2012-05-15  0:53 ` Paul E. McKenney
@ 2012-05-15  8:02 ` Vincent Guittot
  2012-05-15  8:34   ` mou Chen
  2012-05-15 12:23   ` Peter Zijlstra
  2 siblings, 2 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15  8:02 UTC (permalink / raw)
  To: peterz
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On 11 May 2012 18:16, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> This is a request-for-participation in a micro conference during the
> next Linux Plumber Conference (29-31st Aug).
> It'll require critical mass measured in talk submissions in the
> general area of scheduler and task management.
>
>
> If you have ongoing work or ideas in the the following areas we're
> especially interested in hearing from you:
>  1. Consolidation of statistics with other frameworks (cpuidle,
> cpufreq, scheduler all seem to track their own statistics related to
> load, idleness, etc. Can this be converted to a library that is
> useable by all?)
>  2. Replacement for task consolidation on fewer CPUs aka. replacement
> for sched_mc

Peter,

Would you like to present the ongoing work around the load balance
policy and the replacement for sched_mc during the scheduler
micro-conf ?

Regards,
Vincent

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15  8:02 ` Vincent Guittot
@ 2012-05-15  8:34   ` mou Chen
  2012-05-15  9:07     ` Vincent Guittot
  2012-05-15 12:23   ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: mou Chen @ 2012-05-15  8:34 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: linux-kernel, Ingo Molnar, torvalds, Peter Zijlstra

Hi Vincent

There is no necessary to change the scheduling policy for for a
SPECIFIC work. In case changing the scheduling policy just for this
reason will drop the total performance at all. People know that
desktop scheduler is for interactivity and server scheduler is for
throughput and we are just designing a scheduler for these 2 groups of
task.

All right you may not agree with me at all. However, changing the
policy for a SPECIFIC work is unnecessary. Or it is more better to
share about "how to make a scheduler for yourselves" with people who
don't know how to write a scheduler. :-)

                                                      Chen

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-11 16:38   ` Vincent Guittot
@ 2012-05-15  8:41     ` Juri Lelli
  0 siblings, 0 replies; 55+ messages in thread
From: Juri Lelli @ 2012-05-15  8:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steven Rostedt, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, peterz, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, tglx,
	amit.kucheria, linux-kernel, linaro-sched-sig

On 05/11/2012 06:38 PM, Vincent Guittot wrote:
> On 11 May 2012 18:26, Steven Rostedt<rostedt@goodmis.org>  wrote:
>> Not really specific to HW, but Juri Lelli has been working on a deadline
>> scheduler. Perhaps his work may be of interest.
>
> Yes for sure, if Juri is agree to present his work
>

Sure! It will be really interesting for me.
I'm setting up the proposal. I'll send it today.

Thanks and Regards,

- Juri
  
> Vincent
>>
>> -- Steve
>>
>>
>> On Fri, 2012-05-11 at 18:16 +0200, Vincent Guittot wrote:
>>> This is a request-for-participation in a micro conference during the
>>> next Linux Plumber Conference (29-31st Aug).
>>> It'll require critical mass measured in talk submissions in the
>>> general area of scheduler and task management.
>>>
>>> If you're working on improving the the scheduler policy used to place
>>> a task on a CPU to suit your HW, we are inviting your participation
>>> and request you to submit a proposal to present your problem (e.g.
>>> power-efficiency) or a solution to solve said problem that should be
>>> considered by upstream developers.
>>>
>>> We've interacted with the people in To: list before in our quest to
>>> better understand how the scheduler works and we're hoping you all
>>> will consider participating in the micro-conf to help guide what kinds
>>> of ideas are likely to make it upstream.
>>>
>>> If you have ongoing work or ideas in the the following areas we're
>>> especially interested in hearing from you:
>>>   1. Consolidation of statistics with other frameworks (cpuidle,
>>> cpufreq, scheduler all seem to track their own statistics related to
>>> load, idleness, etc. Can this be converted to a library that is
>>> useable by all?)
>>>   2. Replacement for task consolidation on fewer CPUs aka. replacement
>>> for sched_mc
>>>   3. Improvement in the placement of activity beside tasks: timer,
>>> workqueue, IO, interruption
>>>   4. Instrumentation to calculate the compute capacity available on
>>> active cores and its utilization by a given workload
>>>
>>> We are thinking of organising the micro-conf as a Q&  A session where
>>> a participant would state a problem and then there would be
>>> brainstorming on if this is indeed a problem and is so, how to achieve
>>> a solution. In other words, 20-30 minute slots of each Q&  A
>>>
>>> 1. Problem statements with specific examples on why changing the
>>> default scheduler policy is desired
>>> 2. For each problem, if it is deemed not possible to accomplish easily
>>> today, brainstorming on what an acceptable solution would look like
>>> (frameworks to build upon, interfaces to use, related work in the
>>> area, key people to involve, etc.)
>>>
>>> Please email us if you will be attending the conference and interested
>>> in talking about this problem space.
>>>
>>> Regards,
>>> Amit&  Vincent
>>
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15  8:34   ` mou Chen
@ 2012-05-15  9:07     ` Vincent Guittot
  2012-05-15  9:17       ` Pantelis Antoniou
  0 siblings, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15  9:07 UTC (permalink / raw)
  To: mou Chen; +Cc: linux-kernel, Ingo Molnar, torvalds, Peter Zijlstra

On 15 May 2012 10:34, mou Chen <hi3766691@gmail.com> wrote:
> Hi Vincent
>
> There is no necessary to change the scheduling policy for for a
> SPECIFIC work. In case changing the scheduling policy just for this
> reason will drop the total performance at all. People know that
> desktop scheduler is for interactivity and server scheduler is for
> throughput and we are just designing a scheduler for these 2 groups of
> task.

Hi Chen,

First of all, I'm not sure that limiting the scheduling policy to
server and desktop mode is the right solution because Linux is used in
much more system like the embedded one.

Then, some weaknesses have been point out around the sched_mc and its
power saving policy and IIRC Peter was close to remove the sched_mc
stuff: http://thread.gmane.org/gmane.linux.kernel/1236846. It sounds
like the sched_mc should be removed, replace or at least rework to
have something smarter which takes into account more inputs than just
cluster or hyper threading links between cores which seems to be no
more sufficient.

So LPC would be a good place to discuss this point

Regards,
Vincent

>
> All right you may not agree with me at all. However, changing the
> policy for a SPECIFIC work is unnecessary. Or it is more better to
> share about "how to make a scheduler for yourselves" with people who
> don't know how to write a scheduler. :-)
>
>                                                      Chen

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15  9:07     ` Vincent Guittot
@ 2012-05-15  9:17       ` Pantelis Antoniou
  2012-05-15 10:28         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Pantelis Antoniou @ 2012-05-15  9:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mou Chen, linux-kernel, Ingo Molnar, torvalds, Peter Zijlstra


On May 15, 2012, at 12:07 PM, Vincent Guittot wrote:

> On 15 May 2012 10:34, mou Chen <hi3766691@gmail.com> wrote:
>> Hi Vincent
>> 
>> There is no necessary to change the scheduling policy for for a
>> SPECIFIC work. In case changing the scheduling policy just for this
>> reason will drop the total performance at all. People know that
>> desktop scheduler is for interactivity and server scheduler is for
>> throughput and we are just designing a scheduler for these 2 groups of
>> task.
> 
> Hi Chen,
> 
> First of all, I'm not sure that limiting the scheduling policy to
> server and desktop mode is the right solution because Linux is used in
> much more system like the embedded one.
> 
> Then, some weaknesses have been point out around the sched_mc and its
> power saving policy and IIRC Peter was close to remove the sched_mc
> stuff: http://thread.gmane.org/gmane.linux.kernel/1236846. It sounds
> like the sched_mc should be removed, replace or at least rework to
> have something smarter which takes into account more inputs than just
> cluster or hyper threading links between cores which seems to be no
> more sufficient.
> 
> So LPC would be a good place to discuss this point
> 
> Regards,
> Vincent
> 

Hi Vincent

IMO this whole idea about 'server' or 'desktop' schedulers is bunk.

There should only be a single scheduler, but it should be flexible
enough to cater to the needs of quite different classes of hardware,
and their specific workload requirements.

I feel that we're in the birthing pains period of where what used
to be a simple goal (perform as much work as possible in the minimum
amount of time) for the scheduler, turns into something more complex
(perform as much work as possible while staying within this power &
thermal envelope).

The LPC should be an excellent place to discuss how we can achieve this.

Regards

-- Pantelis

PS. Or have lots of beer crying over it...


>> 
>> All right you may not agree with me at all. However, changing the
>> policy for a SPECIFIC work is unnecessary. Or it is more better to
>> share about "how to make a scheduler for yourselves" with people who
>> don't know how to write a scheduler. :-)
>> 
>>                                                      Chen
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15  9:17       ` Pantelis Antoniou
@ 2012-05-15 10:28         ` Peter Zijlstra
  2012-05-15 11:35           ` Pantelis Antoniou
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 10:28 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Vincent Guittot, mou Chen, linux-kernel, Ingo Molnar, torvalds

On Tue, 2012-05-15 at 12:17 +0300, Pantelis Antoniou wrote:
> IMO this whole idea about 'server' or 'desktop' schedulers is bunk.

Yeah, its complete shite. Everybody cares about throughput, latency and
power. The exact balance might differ between workloads but those cannot
be split between desktop/server at all. Furthermore, nobody wants one at
all costs to the others.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 10:28         ` Peter Zijlstra
@ 2012-05-15 11:35           ` Pantelis Antoniou
  2012-05-15 11:58             ` Peter Zijlstra
  2012-05-15 20:26             ` valdis.kletnieks
  0 siblings, 2 replies; 55+ messages in thread
From: Pantelis Antoniou @ 2012-05-15 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mou Chen, linux-kernel, Ingo Molnar, torvalds

On May 15, 2012, at 1:28 PM, Peter Zijlstra wrote:

> On Tue, 2012-05-15 at 12:17 +0300, Pantelis Antoniou wrote:
>> IMO this whole idea about 'server' or 'desktop' schedulers is bunk.
> 
> Yeah, its complete shite. Everybody cares about throughput, latency and
> power. The exact balance might differ between workloads but those cannot
> be split between desktop/server at all. Furthermore, nobody wants one at
> all costs to the others.

Expanding on this a little more, the balancing between the factors might
vary according to the workload you run at the time and not on a pre-set
scenario at the time.

For example take a server configuration, one would expect it to be geared
towards throughput with no regard to power or latency. This is not the
case today. Power saving can be considerable, and low latency might be
very desirable if you run on it stuff like a VoIP based soft PBX.

Same with a desktop, running ooffice and a browser most of the time, but
you would expect to run a game or an audio editing/performing application.

The smart-phone case is like juggling coals; you need to have the minimum
amount of power draw, but you better offer minimum latency and high
throughtput when pissed-off-avians is on. 

Now the question is how to fit this in a scheduler policy.

We have the 3 ones that Peter mentioned;

Throughput
Latency
Power

I can think of two more; thermal management, and memory I/O.

What other can we come up with? And what are the units that we are going
to measure them with?

For example:

Throughput: MIPS(?), bogo-mips(?), some kind of performance counter?

Latency: usecs(?)

Power: Now that's a tricky one, we can't measure power directly, it's a
function of the cpu load we run in a period of time, along with any
history of the cstates & pstates of that period. How can we collect
information about that? Also we to take into account peripheral device
power to that; GPUs are particularly power hungry. 

Thermal management: How to distribute load to the processors in such
a way that the temperature of the die doesn't increase too much that
we have to either go to a lower OPP or shut down the core all-together.
This is in direct conflict with throughput since we'd have better performance 
if we could keep the same warmed-up cpu going.

Memory I/O: Some workloads are memory bandwidth hungry but do not need
much CPU power. In the case of asymmetric cores it would make sense to move
the memory bandwidth hog to a lower performance CPU without any impact. 
Probably need to use some kind of performance counter for that; not going
to be very generic.

Any more ideas?

Regards

-- Pantelis
  

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 11:35           ` Pantelis Antoniou
@ 2012-05-15 11:58             ` Peter Zijlstra
  2012-05-15 12:32               ` Pantelis Antoniou
  2012-05-19 14:58               ` Luming Yu
  2012-05-15 20:26             ` valdis.kletnieks
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 11:58 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Vincent Guittot, mou Chen, linux-kernel, Ingo Molnar, torvalds

On Tue, 2012-05-15 at 14:35 +0300, Pantelis Antoniou wrote:
> 
> Throughput: MIPS(?), bogo-mips(?), some kind of performance counter?

Throughput is too generic a term to put a unit on. For some people its
tnx/s for others its frames/s neither are much (if at all) related to
MIPS (database tnx require lots of IO, video encoding likes FPU/SIMMD
stuff etc..).

> Latency: usecs(?)

nsec (chips are really really fast and only getting faster), but nsecs
of what :-) That is, which latency are we going to measure.

> Power: Now that's a tricky one, we can't measure power directly, it's a
> function of the cpu load we run in a period of time, along with any
> history of the cstates & pstates of that period. How can we collect
> information about that? Also we to take into account peripheral device
> power to that; GPUs are particularly power hungry. 

Intel provides some measure of CPU power drain on recent chips (iirc),
but yeah that doesn't include GPUs and other peripherals iirc.

> Thermal management: How to distribute load to the processors in such
> a way that the temperature of the die doesn't increase too much that
> we have to either go to a lower OPP or shut down the core all-together.
> This is in direct conflict with throughput since we'd have better performance 
> if we could keep the same warmed-up cpu going.

Core-hopping.. yay! We have the whole sensors framework that provides an
interface to such hardware, the question is, do chips have enough
sensors spread on them to be useful?

> Memory I/O: Some workloads are memory bandwidth hungry but do not need
> much CPU power. In the case of asymmetric cores it would make sense to move
> the memory bandwidth hog to a lower performance CPU without any impact. 
> Probably need to use some kind of performance counter for that; not going
> to be very generic. 

You're assuming the slower cores have the same memory bandwidth, isn't
that a dangerous assumption?

Anyway, so the 'problem' with using PMCs from within the scheduler is
that, 1) they're ass backwards slow on some chips (x86 anyone?) 2) some
userspace gets 'upset' if they can't get at all of them.

So it has to be optional at best, and I hate knobs :-) Also, the more
information you're going to feed this load-balancer thing, the harder
all that becomes, you don't want to do the full nm! m-dimensional bin
fit.. :-)




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15  8:02 ` Vincent Guittot
  2012-05-15  8:34   ` mou Chen
@ 2012-05-15 12:23   ` Peter Zijlstra
  2012-05-15 12:27     ` Peter Zijlstra
  2012-05-15 12:57     ` Vincent Guittot
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 12:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On Tue, 2012-05-15 at 10:02 +0200, Vincent Guittot wrote:
> 
> Would you like to present the ongoing work around the load balance
> policy and the replacement for sched_mc during the scheduler
> micro-conf ? 

Not sure there's much to say that isn't already said..

As it stands nobody cares (as evident by the total lack of progress
since the last time this all came up), so I've just queued the below
patch.


---
Subject: sched: Remove all power aware scheduling
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 09 Jan 2012 11:28:35 +0100

Its been broken forever and nobody cares enough to fix it proper..
remove it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   25 -
 Documentation/scheduler/sched-domains.txt          |    4 
 arch/x86/kernel/smpboot.c                          |    3 
 drivers/base/cpu.c                                 |    4 
 include/linux/cpu.h                                |    2 
 include/linux/sched.h                              |   47 ---
 include/linux/topology.h                           |    5 
 kernel/sched/core.c                                |   94 -------
 kernel/sched/fair.c                                |  278 ---------------------
 tools/power/cpupower/man/cpupower-set.1            |    9 
 tools/power/cpupower/utils/helpers/sysfs.c         |   35 --
 11 files changed, 4 insertions(+), 502 deletions(-)

--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -9,31 +9,6 @@ Contact:	Linux kernel mailing list <linu
 
 		/sys/devices/system/cpu/cpu#/
 
-What:		/sys/devices/system/cpu/sched_mc_power_savings
-		/sys/devices/system/cpu/sched_smt_power_savings
-Date:		June 2006
-Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
-Description:	Discover and adjust the kernel's multi-core scheduler support.
-
-		Possible values are:
-
-		0 - No power saving load balance (default value)
-		1 - Fill one thread/core/package first for long running threads
-		2 - Also bias task wakeups to semi-idle cpu package for power
-		    savings
-
-		sched_mc_power_savings is dependent upon SCHED_MC, which is
-		itself architecture dependent.
-
-		sched_smt_power_savings is dependent upon SCHED_SMT, which
-		is itself architecture dependent.
-
-		The two files are independent of each other. It is possible
-		that one file may be present without the other.
-
-		Introduced by git commit 5c45bf27.
-
-
 What:		/sys/devices/system/cpu/kernel_max
 		/sys/devices/system/cpu/offline
 		/sys/devices/system/cpu/online
--- a/Documentation/scheduler/sched-domains.txt
+++ b/Documentation/scheduler/sched-domains.txt
@@ -61,10 +61,6 @@ might have just one domain covering its
 struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
 the specifics and what to tune.
 
-For SMT, the architecture must define CONFIG_SCHED_SMT and provide a
-cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of
-all "i"'s siblings as well as "i" itself.
-
 Architectures may retain the regular override the default SD_*_INIT flags
 while using the generic domain builder in kernel/sched.c if they wish to
 retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -413,8 +413,7 @@ const struct cpumask *cpu_coregroup_mask
 	 * For perf, we return last level cache shared map.
 	 * And for power savings, we return cpu_core_map
 	 */
-	if ((sched_mc_power_savings || sched_smt_power_savings) &&
-	    !(cpu_has(c, X86_FEATURE_AMD_DCM)))
+	if (!(cpu_has(c, X86_FEATURE_AMD_DCM)))
 		return cpu_core_mask(cpu);
 	else
 		return cpu_llc_shared_mask(cpu);
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -330,8 +330,4 @@ void __init cpu_dev_init(void)
 		panic("Failed to register CPU subsystem");
 
 	cpu_dev_register_generic();
-
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-	sched_create_sysfs_power_savings_entries(cpu_subsys.dev_root);
-#endif
 }
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -36,8 +36,6 @@ extern void cpu_remove_dev_attr(struct d
 extern int cpu_add_dev_attr_group(struct attribute_group *attrs);
 extern void cpu_remove_dev_attr_group(struct attribute_group *attrs);
 
-extern int sched_create_sysfs_power_savings_entries(struct device *dev);
-
 #ifdef CONFIG_HOTPLUG_CPU
 extern void unregister_cpu(struct cpu *cpu);
 extern ssize_t arch_cpu_probe(const char *, size_t);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -855,61 +855,14 @@ enum cpu_idle_type {
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_PREFER_LOCAL		0x0040  /* Prefer to keep tasks local to this domain */
 #define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
-#define SD_POWERSAVINGS_BALANCE	0x0100	/* Balance for power savings */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
-enum powersavings_balance_level {
-	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
-	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package
-					 * first for long running threads
-					 */
-	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
-					 * cpu package for power savings
-					 */
-	MAX_POWERSAVINGS_BALANCE_LEVELS
-};
-
-extern int sched_mc_power_savings, sched_smt_power_savings;
-
-static inline int sd_balance_for_mc_power(void)
-{
-	if (sched_smt_power_savings)
-		return SD_POWERSAVINGS_BALANCE;
-
-	if (!sched_mc_power_savings)
-		return SD_PREFER_SIBLING;
-
-	return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
-	if (sched_mc_power_savings | sched_smt_power_savings)
-		return SD_POWERSAVINGS_BALANCE;
-
-	return SD_PREFER_SIBLING;
-}
-
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps aggressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
- */
-
-static inline int sd_power_saving_flags(void)
-{
-	if (sched_mc_power_savings | sched_smt_power_savings)
-		return SD_BALANCE_NEWIDLE;
-
-	return 0;
-}
-
 struct sched_group_power {
 	atomic_t ref;
 	/*
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -98,7 +98,6 @@ int arch_update_cpu_topology(void);
 				| 0*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
@@ -134,8 +133,6 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
-				| sd_balance_for_mc_power()		\
-				| sd_power_saving_flags()		\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
@@ -167,8 +164,6 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 0*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
-				| sd_balance_for_package_power()	\
-				| sd_power_saving_flags()		\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5920,8 +5920,6 @@ static const struct cpumask *cpu_cpu_mas
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
-int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
-
 struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
@@ -6313,7 +6311,6 @@ sd_numa_init(struct sched_domain_topolog
 					| 0*SD_WAKE_AFFINE
 					| 0*SD_PREFER_LOCAL
 					| 0*SD_SHARE_CPUPOWER
-					| 0*SD_POWERSAVINGS_BALANCE
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
@@ -6810,97 +6807,6 @@ void partition_sched_domains(int ndoms_n
 	mutex_unlock(&sched_domains_mutex);
 }
 
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-static void reinit_sched_domains(void)
-{
-	get_online_cpus();
-
-	/* Destroy domains first to force the rebuild */
-	partition_sched_domains(0, NULL, NULL);
-
-	rebuild_sched_domains();
-	put_online_cpus();
-}
-
-static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
-{
-	unsigned int level = 0;
-
-	if (sscanf(buf, "%u", &level) != 1)
-		return -EINVAL;
-
-	/*
-	 * level is always be positive so don't check for
-	 * level < POWERSAVINGS_BALANCE_NONE which is 0
-	 * What happens on 0 or 1 byte write,
-	 * need to check for count as well?
-	 */
-
-	if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
-		return -EINVAL;
-
-	if (smt)
-		sched_smt_power_savings = level;
-	else
-		sched_mc_power_savings = level;
-
-	reinit_sched_domains();
-
-	return count;
-}
-
-#ifdef CONFIG_SCHED_MC
-static ssize_t sched_mc_power_savings_show(struct device *dev,
-					   struct device_attribute *attr,
-					   char *buf)
-{
-	return sprintf(buf, "%u\n", sched_mc_power_savings);
-}
-static ssize_t sched_mc_power_savings_store(struct device *dev,
-					    struct device_attribute *attr,
-					    const char *buf, size_t count)
-{
-	return sched_power_savings_store(buf, count, 0);
-}
-static DEVICE_ATTR(sched_mc_power_savings, 0644,
-		   sched_mc_power_savings_show,
-		   sched_mc_power_savings_store);
-#endif
-
-#ifdef CONFIG_SCHED_SMT
-static ssize_t sched_smt_power_savings_show(struct device *dev,
-					    struct device_attribute *attr,
-					    char *buf)
-{
-	return sprintf(buf, "%u\n", sched_smt_power_savings);
-}
-static ssize_t sched_smt_power_savings_store(struct device *dev,
-					    struct device_attribute *attr,
-					     const char *buf, size_t count)
-{
-	return sched_power_savings_store(buf, count, 1);
-}
-static DEVICE_ATTR(sched_smt_power_savings, 0644,
-		   sched_smt_power_savings_show,
-		   sched_smt_power_savings_store);
-#endif
-
-int __init sched_create_sysfs_power_savings_entries(struct device *dev)
-{
-	int err = 0;
-
-#ifdef CONFIG_SCHED_SMT
-	if (smt_capable())
-		err = device_create_file(dev, &dev_attr_sched_smt_power_savings);
-#endif
-#ifdef CONFIG_SCHED_MC
-	if (!err && mc_capable())
-		err = device_create_file(dev, &dev_attr_sched_mc_power_savings);
-#endif
-	return err;
-}
-#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-
 /*
  * Update cpusets according to cpu_active mask.  If cpusets are
  * disabled, cpuset_update_active_cpus() becomes a simple wrapper
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2721,7 +2721,7 @@ select_task_rq_fair(struct task_struct *
 		 * If power savings logic is enabled for a domain, see if we
 		 * are not overloaded, if so, don't balance wider.
 		 */
-		if (tmp->flags & (SD_POWERSAVINGS_BALANCE|SD_PREFER_LOCAL)) {
+		if (tmp->flags & (SD_PREFER_LOCAL)) {
 			unsigned long power = 0;
 			unsigned long nr_running = 0;
 			unsigned long capacity;
@@ -2734,9 +2734,6 @@ select_task_rq_fair(struct task_struct *
 
 			capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 
-			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
-				nr_running /= 2;
-
 			if (nr_running < capacity)
 				want_sd = 0;
 		}
@@ -3435,14 +3432,6 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-	int power_savings_balance; /* Is powersave balance needed for this sd */
-	struct sched_group *group_min; /* Least loaded group in sd */
-	struct sched_group *group_leader; /* Group which relieves group_min */
-	unsigned long min_load_per_task; /* load_per_task in group_min */
-	unsigned long leader_nr_running; /* Nr running of group_leader */
-	unsigned long min_nr_running; /* Nr running of group_min */
-#endif
 };
 
 /*
@@ -3486,147 +3475,6 @@ static inline int get_sd_load_idx(struct
 	return load_idx;
 }
 
-
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-/**
- * init_sd_power_savings_stats - Initialize power savings statistics for
- * the given sched_domain, during load balancing.
- *
- * @sd: Sched domain whose power-savings statistics are to be initialized.
- * @sds: Variable containing the statistics for sd.
- * @idle: Idle status of the CPU at which we're performing load-balancing.
- */
-static inline void init_sd_power_savings_stats(struct sched_domain *sd,
-	struct sd_lb_stats *sds, enum cpu_idle_type idle)
-{
-	/*
-	 * Busy processors will not participate in power savings
-	 * balance.
-	 */
-	if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
-		sds->power_savings_balance = 0;
-	else {
-		sds->power_savings_balance = 1;
-		sds->min_nr_running = ULONG_MAX;
-		sds->leader_nr_running = 0;
-	}
-}
-
-/**
- * update_sd_power_savings_stats - Update the power saving stats for a
- * sched_domain while performing load balancing.
- *
- * @group: sched_group belonging to the sched_domain under consideration.
- * @sds: Variable containing the statistics of the sched_domain
- * @local_group: Does group contain the CPU for which we're performing
- * 		load balancing ?
- * @sgs: Variable containing the statistics of the group.
- */
-static inline void update_sd_power_savings_stats(struct sched_group *group,
-	struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs)
-{
-
-	if (!sds->power_savings_balance)
-		return;
-
-	/*
-	 * If the local group is idle or completely loaded
-	 * no need to do power savings balance at this domain
-	 */
-	if (local_group && (sds->this_nr_running >= sgs->group_capacity ||
-				!sds->this_nr_running))
-		sds->power_savings_balance = 0;
-
-	/*
-	 * If a group is already running at full capacity or idle,
-	 * don't include that group in power savings calculations
-	 */
-	if (!sds->power_savings_balance ||
-		sgs->sum_nr_running >= sgs->group_capacity ||
-		!sgs->sum_nr_running)
-		return;
-
-	/*
-	 * Calculate the group which has the least non-idle load.
-	 * This is the group from where we need to pick up the load
-	 * for saving power
-	 */
-	if ((sgs->sum_nr_running < sds->min_nr_running) ||
-	    (sgs->sum_nr_running == sds->min_nr_running &&
-	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
-		sds->group_min = group;
-		sds->min_nr_running = sgs->sum_nr_running;
-		sds->min_load_per_task = sgs->sum_weighted_load /
-						sgs->sum_nr_running;
-	}
-
-	/*
-	 * Calculate the group which is almost near its
-	 * capacity but still has some space to pick up some load
-	 * from other group and save more power
-	 */
-	if (sgs->sum_nr_running + 1 > sgs->group_capacity)
-		return;
-
-	if (sgs->sum_nr_running > sds->leader_nr_running ||
-	    (sgs->sum_nr_running == sds->leader_nr_running &&
-	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
-		sds->group_leader = group;
-		sds->leader_nr_running = sgs->sum_nr_running;
-	}
-}
-
-/**
- * check_power_save_busiest_group - see if there is potential for some power-savings balance
- * @env: load balance environment
- * @sds: Variable containing the statistics of the sched_domain
- *	under consideration.
- *
- * Description:
- * Check if we have potential to perform some power-savings balance.
- * If yes, set the busiest group to be the least loaded group in the
- * sched_domain, so that it's CPUs can be put to idle.
- *
- * Returns 1 if there is potential to perform power-savings balance.
- * Else returns 0.
- */
-static inline
-int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	if (!sds->power_savings_balance)
-		return 0;
-
-	if (sds->this != sds->group_leader ||
-			sds->group_leader == sds->group_min)
-		return 0;
-
-	env->imbalance = sds->min_load_per_task;
-	sds->busiest = sds->group_min;
-
-	return 1;
-
-}
-#else /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-static inline void init_sd_power_savings_stats(struct sched_domain *sd,
-	struct sd_lb_stats *sds, enum cpu_idle_type idle)
-{
-	return;
-}
-
-static inline void update_sd_power_savings_stats(struct sched_group *group,
-	struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs)
-{
-	return;
-}
-
-static inline
-int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	return 0;
-}
-#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-
-
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -3932,7 +3780,6 @@ static inline void update_sd_lb_stats(st
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
-	init_sd_power_savings_stats(env->sd, sds, env->idle);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -3981,7 +3828,6 @@ static inline void update_sd_lb_stats(st
 			sds->group_imb = sgs.group_imb;
 		}
 
-		update_sd_power_savings_stats(sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4278,12 +4124,6 @@ find_busiest_group(struct lb_env *env, c
 	return sds.busiest;
 
 out_balanced:
-	/*
-	 * There is no obvious imbalance. But check if we can do some balancing
-	 * to save power.
-	 */
-	if (check_power_save_busiest_group(env, &sds))
-		return sds.busiest;
 ret:
 	env->imbalance = 0;
 	return NULL;
@@ -4361,28 +4201,6 @@ static int need_active_balance(struct lb
 		 */
 		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
 			return 1;
-
-		/*
-		 * The only task running in a non-idle cpu can be moved to this
-		 * cpu in an attempt to completely freeup the other CPU
-		 * package.
-		 *
-		 * The package power saving logic comes from
-		 * find_busiest_group(). If there are no imbalance, then
-		 * f_b_g() will return NULL. However when sched_mc={1,2} then
-		 * f_b_g() will select a group from which a running task may be
-		 * pulled to this cpu in order to make the other package idle.
-		 * If there is no opportunity to make a package idle and if
-		 * there are no imbalance, then f_b_g() will return NULL and no
-		 * action will be taken in load_balance_newidle().
-		 *
-		 * Under normal task pull operation due to imbalance, there
-		 * will be more than one task in the source run queue and
-		 * move_tasks() will succeed.  ld_moved will be true and this
-		 * active balance code will not be triggered.
-		 */
-		if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
-			return 0;
 	}
 
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
@@ -4704,104 +4522,10 @@ static struct {
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
-/**
- * lowest_flag_domain - Return lowest sched_domain containing flag.
- * @cpu:	The cpu whose lowest level of sched domain is to
- *		be returned.
- * @flag:	The flag to check for the lowest sched_domain
- *		for the given cpu.
- *
- * Returns the lowest sched_domain of a cpu which contains the given flag.
- */
-static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
-{
-	struct sched_domain *sd;
-
-	for_each_domain(cpu, sd)
-		if (sd->flags & flag)
-			break;
-
-	return sd;
-}
-
-/**
- * for_each_flag_domain - Iterates over sched_domains containing the flag.
- * @cpu:	The cpu whose domains we're iterating over.
- * @sd:		variable holding the value of the power_savings_sd
- *		for cpu.
- * @flag:	The flag to filter the sched_domains to be iterated.
- *
- * Iterates over all the scheduler domains for a given cpu that has the 'flag'
- * set, starting from the lowest sched_domain to the highest.
- */
-#define for_each_flag_domain(cpu, sd, flag) \
-	for (sd = lowest_flag_domain(cpu, flag); \
-		(sd && (sd->flags & flag)); sd = sd->parent)
-
-/**
- * find_new_ilb - Finds the optimum idle load balancer for nomination.
- * @cpu:	The cpu which is nominating a new idle_load_balancer.
- *
- * Returns:	Returns the id of the idle load balancer if it exists,
- *		Else, returns >= nr_cpu_ids.
- *
- * This algorithm picks the idle load balancer such that it belongs to a
- * semi-idle powersavings sched_domain. The idea is to try and avoid
- * completely idle packages/cores just for the purpose of idle load balancing
- * when there are other idle cpu's which are better suited for that job.
- */
-static int find_new_ilb(int cpu)
-{
-	int ilb = cpumask_first(nohz.idle_cpus_mask);
-	struct sched_group *ilbg;
-	struct sched_domain *sd;
-
-	/*
-	 * Have idle load balancer selection from semi-idle packages only
-	 * when power-aware load balancing is enabled
-	 */
-	if (!(sched_smt_power_savings || sched_mc_power_savings))
-		goto out_done;
-
-	/*
-	 * Optimize for the case when we have no idle CPUs or only one
-	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
-	 */
-	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
-		goto out_done;
-
-	rcu_read_lock();
-	for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
-		ilbg = sd->groups;
-
-		do {
-			if (ilbg->group_weight !=
-				atomic_read(&ilbg->sgp->nr_busy_cpus)) {
-				ilb = cpumask_first_and(nohz.idle_cpus_mask,
-							sched_group_cpus(ilbg));
-				goto unlock;
-			}
-
-			ilbg = ilbg->next;
-
-		} while (ilbg != sd->groups);
-	}
-unlock:
-	rcu_read_unlock();
-
-out_done:
-	if (ilb < nr_cpu_ids && idle_cpu(ilb))
-		return ilb;
-
-	return nr_cpu_ids;
-}
-#else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
 static inline int find_new_ilb(int call_cpu)
 {
 	return nr_cpu_ids;
 }
-#endif
 
 /*
  * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
--- a/tools/power/cpupower/man/cpupower-set.1
+++ b/tools/power/cpupower/man/cpupower-set.1
@@ -85,15 +85,6 @@ Adjust the kernel's multi-core scheduler
 savings
 .RE
 
-sched_mc_power_savings is dependent upon SCHED_MC, which is
-itself architecture dependent.
-
-sched_smt_power_savings is dependent upon SCHED_SMT, which
-is itself architecture dependent.
-
-The two files are independent of each other. It is possible
-that one file may be present without the other.
-
 .SH "SEE ALSO"
 cpupower-info(1), cpupower-monitor(1), powertop(1)
 .PP
--- a/tools/power/cpupower/utils/helpers/sysfs.c
+++ b/tools/power/cpupower/utils/helpers/sysfs.c
@@ -362,22 +362,7 @@ char *sysfs_get_cpuidle_driver(void)
  */
 int sysfs_get_sched(const char *smt_mc)
 {
-	unsigned long value;
-	char linebuf[MAX_LINE_LEN];
-	char *endp;
-	char path[SYSFS_PATH_MAX];
-
-	if (strcmp("mc", smt_mc) && strcmp("smt", smt_mc))
-		return -EINVAL;
-
-	snprintf(path, sizeof(path),
-		PATH_TO_CPU "sched_%s_power_savings", smt_mc);
-	if (sysfs_read_file(path, linebuf, MAX_LINE_LEN) == 0)
-		return -1;
-	value = strtoul(linebuf, &endp, 0);
-	if (endp == linebuf || errno == ERANGE)
-		return -1;
-	return value;
+	return -ENODEV;
 }
 
 /*
@@ -388,21 +373,5 @@ int sysfs_get_sched(const char *smt_mc)
  */
 int sysfs_set_sched(const char *smt_mc, int val)
 {
-	char linebuf[MAX_LINE_LEN];
-	char path[SYSFS_PATH_MAX];
-	struct stat statbuf;
-
-	if (strcmp("mc", smt_mc) && strcmp("smt", smt_mc))
-		return -EINVAL;
-
-	snprintf(path, sizeof(path),
-		PATH_TO_CPU "sched_%s_power_savings", smt_mc);
-	sprintf(linebuf, "%d", val);
-
-	if (stat(path, &statbuf) != 0)
-		return -ENODEV;
-
-	if (sysfs_write_file(path, linebuf, MAX_LINE_LEN) == 0)
-		return -1;
-	return 0;
+	return -ENODEV;
 }



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 12:23   ` Peter Zijlstra
@ 2012-05-15 12:27     ` Peter Zijlstra
  2012-05-15 12:57     ` Vincent Guittot
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 12:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: panto, smuckle, Juri Lelli, mingo, linaro-sched-sig, rostedt,
	tglx, geoff, efault, linux-kernel

On Tue, 2012-05-15 at 14:23 +0200, Peter Zijlstra wrote:
> -#else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
>  static inline int find_new_ilb(int call_cpu)
>  {
>         return nr_cpu_ids;
>  }
> -#endif 

That was missing a hunk...

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4524,6 +4524,11 @@ static struct {
 
 static inline int find_new_ilb(int call_cpu)
 {
+       int ilb = cpumask_first(nohz.idle_cpus_mask);
+
+       if (ilb < nr_cpu_ids && idle_cpu(ilb))
+               return ilb;
+
        return nr_cpu_ids;
 }



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 11:58             ` Peter Zijlstra
@ 2012-05-15 12:32               ` Pantelis Antoniou
  2012-05-15 12:59                 ` Peter Zijlstra
  2012-05-19 14:58               ` Luming Yu
  1 sibling, 1 reply; 55+ messages in thread
From: Pantelis Antoniou @ 2012-05-15 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mou Chen, linux-kernel, Ingo Molnar, torvalds

Hi Peter,

On May 15, 2012, at 2:58 PM, Peter Zijlstra wrote:

> On Tue, 2012-05-15 at 14:35 +0300, Pantelis Antoniou wrote:
>> 
>> Throughput: MIPS(?), bogo-mips(?), some kind of performance counter?
> 
> Throughput is too generic a term to put a unit on. For some people its
> tnx/s for others its frames/s neither are much (if at all) related to
> MIPS (database tnx require lots of IO, video encoding likes FPU/SIMMD
> stuff etc..).
> 

I agree, throughput is a loaded term. However something simple as
amount of time in which the CPU was not idle is easy enough to get.
Now to get something more complicated like actual instruction counts
or FP/vector instruction counts you're going to need hardware support.

One fine point would be to respect the OPP points. I.e. a time period
t where the CPU was at 50% of max freq should be roughly analogous to
a t/2 time period were CPU was running at 100%.


>> Latency: usecs(?)
> 
> nsec (chips are really really fast and only getting faster), but nsecs
> of what :-) That is, which latency are we going to measure.
> 

At least we agree it's a time unit :) Measured from the point a task was
eligible for execution? Any other point?

>> Power: Now that's a tricky one, we can't measure power directly, it's a
>> function of the cpu load we run in a period of time, along with any
>> history of the cstates & pstates of that period. How can we collect
>> information about that? Also we to take into account peripheral device
>> power to that; GPUs are particularly power hungry. 
> 
> Intel provides some measure of CPU power drain on recent chips (iirc),
> but yeah that doesn't include GPUs and other peripherals iirc.
> 
>> Thermal management: How to distribute load to the processors in such
>> a way that the temperature of the die doesn't increase too much that
>> we have to either go to a lower OPP or shut down the core all-together.
>> This is in direct conflict with throughput since we'd have better performance 
>> if we could keep the same warmed-up cpu going.
> 
> Core-hopping.. yay! We have the whole sensors framework that provides an
> interface to such hardware, the question is, do chips have enough
> sensors spread on them to be useful?
> 

Well, not all of them do, but the ones that do are going to be pretty numerous
in the very near future :)

Combining this with the previous question, it is well known that CPUs physically
are nothing more than really efficient space heaters :) Could we use the 
readings from the sensor framework to come up with a correlation between
increased temperature X with power draw Y? If so how?

>> Memory I/O: Some workloads are memory bandwidth hungry but do not need
>> much CPU power. In the case of asymmetric cores it would make sense to move
>> the memory bandwidth hog to a lower performance CPU without any impact. 
>> Probably need to use some kind of performance counter for that; not going
>> to be very generic. 
> 
> You're assuming the slower cores have the same memory bandwidth, isn't
> that a dangerous assumption?
> 

Again, some class of hardware does provide the same bandwidth to the lower
performance cores. For some well know cases (cough, ..roid), it is said
that it is a win.

> Anyway, so the 'problem' with using PMCs from within the scheduler is
> that, 1) they're ass backwards slow on some chips (x86 anyone?) 2) some
> userspace gets 'upset' if they can't get at all of them.
> 

We might not have to access them at every context switch; would we be able
to get some useful data if we collected every few ms? Note, not all PMUs are
that slow as x86.

Userspace could always get access to the kernel's collected data if needed,
or we could just disable PMU accesses when userspace tries to do the same.
It is a corner user case after all. 

> So it has to be optional at best, and I hate knobs :-) Also, the more
> information you're going to feed this load-balancer thing, the harder
> all that becomes, you don't want to do the full nm! m-dimensional bin
> fit.. :-)
> 
> 
> 

Is this a contest about who hates knobs more? I'm in :)

Well, I don't plan to feed the load-balancer all this crap. What I'm thinking
is take those N metrics, form a vector according to some (yet unknown) weighting 
factors, and schedule according to the vector value, and how it 'fits' to a
virtual bin representing a core and it's environment.


Regards

-- Pantelis


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 12:23   ` Peter Zijlstra
  2012-05-15 12:27     ` Peter Zijlstra
@ 2012-05-15 12:57     ` Vincent Guittot
  2012-05-15 13:00       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On 15 May 2012 14:23, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-05-15 at 10:02 +0200, Vincent Guittot wrote:
>>
>> Would you like to present the ongoing work around the load balance
>> policy and the replacement for sched_mc during the scheduler
>> micro-conf ?
>
> Not sure there's much to say that isn't already said..
>
> As it stands nobody cares (as evident by the total lack of progress
> since the last time this all came up), so I've just queued the below
> patch.

Not sure that nobody cares but it's much more that scheduler,
load_balance and sched_mc are sensible enough that it's difficult to
ensure that a modification will not break everything for someone else.

>
>
> ---
> Subject: sched: Remove all power aware scheduling
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon, 09 Jan 2012 11:28:35 +0100
>
> Its been broken forever and nobody cares enough to fix it proper..
> remove it.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu |   25 -
>  Documentation/scheduler/sched-domains.txt          |    4
>  arch/x86/kernel/smpboot.c                          |    3
>  drivers/base/cpu.c                                 |    4
>  include/linux/cpu.h                                |    2
>  include/linux/sched.h                              |   47 ---
>  include/linux/topology.h                           |    5
>  kernel/sched/core.c                                |   94 -------
>  kernel/sched/fair.c                                |  278 ---------------------
>  tools/power/cpupower/man/cpupower-set.1            |    9
>  tools/power/cpupower/utils/helpers/sysfs.c         |   35 --
>  11 files changed, 4 insertions(+), 502 deletions(-)
>
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -9,31 +9,6 @@ Contact:       Linux kernel mailing list <linu
>
>                /sys/devices/system/cpu/cpu#/
>
> -What:          /sys/devices/system/cpu/sched_mc_power_savings
> -               /sys/devices/system/cpu/sched_smt_power_savings
> -Date:          June 2006
> -Contact:       Linux kernel mailing list <linux-kernel@vger.kernel.org>
> -Description:   Discover and adjust the kernel's multi-core scheduler support.
> -
> -               Possible values are:
> -
> -               0 - No power saving load balance (default value)
> -               1 - Fill one thread/core/package first for long running threads
> -               2 - Also bias task wakeups to semi-idle cpu package for power
> -                   savings
> -
> -               sched_mc_power_savings is dependent upon SCHED_MC, which is
> -               itself architecture dependent.
> -
> -               sched_smt_power_savings is dependent upon SCHED_SMT, which
> -               is itself architecture dependent.
> -
> -               The two files are independent of each other. It is possible
> -               that one file may be present without the other.
> -
> -               Introduced by git commit 5c45bf27.
> -
> -
>  What:          /sys/devices/system/cpu/kernel_max
>                /sys/devices/system/cpu/offline
>                /sys/devices/system/cpu/online
> --- a/Documentation/scheduler/sched-domains.txt
> +++ b/Documentation/scheduler/sched-domains.txt
> @@ -61,10 +61,6 @@ might have just one domain covering its
>  struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
>  the specifics and what to tune.
>
> -For SMT, the architecture must define CONFIG_SCHED_SMT and provide a
> -cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of
> -all "i"'s siblings as well as "i" itself.
> -
>  Architectures may retain the regular override the default SD_*_INIT flags
>  while using the generic domain builder in kernel/sched.c if they wish to
>  retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -413,8 +413,7 @@ const struct cpumask *cpu_coregroup_mask
>         * For perf, we return last level cache shared map.
>         * And for power savings, we return cpu_core_map
>         */
> -       if ((sched_mc_power_savings || sched_smt_power_savings) &&
> -           !(cpu_has(c, X86_FEATURE_AMD_DCM)))
> +       if (!(cpu_has(c, X86_FEATURE_AMD_DCM)))
>                return cpu_core_mask(cpu);
>        else
>                return cpu_llc_shared_mask(cpu);
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -330,8 +330,4 @@ void __init cpu_dev_init(void)
>                panic("Failed to register CPU subsystem");
>
>        cpu_dev_register_generic();
> -
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> -       sched_create_sysfs_power_savings_entries(cpu_subsys.dev_root);
> -#endif
>  }
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -36,8 +36,6 @@ extern void cpu_remove_dev_attr(struct d
>  extern int cpu_add_dev_attr_group(struct attribute_group *attrs);
>  extern void cpu_remove_dev_attr_group(struct attribute_group *attrs);
>
> -extern int sched_create_sysfs_power_savings_entries(struct device *dev);
> -
>  #ifdef CONFIG_HOTPLUG_CPU
>  extern void unregister_cpu(struct cpu *cpu);
>  extern ssize_t arch_cpu_probe(const char *, size_t);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -855,61 +855,14 @@ enum cpu_idle_type {
>  #define SD_WAKE_AFFINE         0x0020  /* Wake task to waking CPU */
>  #define SD_PREFER_LOCAL                0x0040  /* Prefer to keep tasks local to this domain */
>  #define SD_SHARE_CPUPOWER      0x0080  /* Domain members share cpu power */
> -#define SD_POWERSAVINGS_BALANCE        0x0100  /* Balance for power savings */
>  #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg resources */
>  #define SD_SERIALIZE           0x0400  /* Only a single load balancing instance */
>  #define SD_ASYM_PACKING                0x0800  /* Place busy groups earlier in the domain */
>  #define SD_PREFER_SIBLING      0x1000  /* Prefer to place tasks in a sibling domain */
>  #define SD_OVERLAP             0x2000  /* sched_domains of this level overlap */
>
> -enum powersavings_balance_level {
> -       POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
> -       POWERSAVINGS_BALANCE_BASIC,     /* Fill one thread/core/package
> -                                        * first for long running threads
> -                                        */
> -       POWERSAVINGS_BALANCE_WAKEUP,    /* Also bias task wakeups to semi-idle
> -                                        * cpu package for power savings
> -                                        */
> -       MAX_POWERSAVINGS_BALANCE_LEVELS
> -};
> -
> -extern int sched_mc_power_savings, sched_smt_power_savings;
> -
> -static inline int sd_balance_for_mc_power(void)
> -{
> -       if (sched_smt_power_savings)
> -               return SD_POWERSAVINGS_BALANCE;
> -
> -       if (!sched_mc_power_savings)
> -               return SD_PREFER_SIBLING;
> -
> -       return 0;
> -}
> -
> -static inline int sd_balance_for_package_power(void)
> -{
> -       if (sched_mc_power_savings | sched_smt_power_savings)
> -               return SD_POWERSAVINGS_BALANCE;
> -
> -       return SD_PREFER_SIBLING;
> -}
> -
>  extern int __weak arch_sd_sibiling_asym_packing(void);
>
> -/*
> - * Optimise SD flags for power savings:
> - * SD_BALANCE_NEWIDLE helps aggressive task consolidation and power savings.
> - * Keep default SD flags if sched_{smt,mc}_power_saving=0
> - */
> -
> -static inline int sd_power_saving_flags(void)
> -{
> -       if (sched_mc_power_savings | sched_smt_power_savings)
> -               return SD_BALANCE_NEWIDLE;
> -
> -       return 0;
> -}
> -
>  struct sched_group_power {
>        atomic_t ref;
>        /*
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -98,7 +98,6 @@ int arch_update_cpu_topology(void);
>                                | 0*SD_BALANCE_WAKE                     \
>                                | 1*SD_WAKE_AFFINE                      \
>                                | 1*SD_SHARE_CPUPOWER                   \
> -                               | 0*SD_POWERSAVINGS_BALANCE             \
>                                | 1*SD_SHARE_PKG_RESOURCES              \
>                                | 0*SD_SERIALIZE                        \
>                                | 0*SD_PREFER_SIBLING                   \
> @@ -134,8 +133,6 @@ int arch_update_cpu_topology(void);
>                                | 0*SD_SHARE_CPUPOWER                   \
>                                | 1*SD_SHARE_PKG_RESOURCES              \
>                                | 0*SD_SERIALIZE                        \
> -                               | sd_balance_for_mc_power()             \
> -                               | sd_power_saving_flags()               \
>                                ,                                       \
>        .last_balance           = jiffies,                              \
>        .balance_interval       = 1,                                    \
> @@ -167,8 +164,6 @@ int arch_update_cpu_topology(void);
>                                | 0*SD_SHARE_CPUPOWER                   \
>                                | 0*SD_SHARE_PKG_RESOURCES              \
>                                | 0*SD_SERIALIZE                        \
> -                               | sd_balance_for_package_power()        \
> -                               | sd_power_saving_flags()               \
>                                ,                                       \
>        .last_balance           = jiffies,                              \
>        .balance_interval       = 1,                                    \
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5920,8 +5920,6 @@ static const struct cpumask *cpu_cpu_mas
>        return cpumask_of_node(cpu_to_node(cpu));
>  }
>
> -int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
> -
>  struct sd_data {
>        struct sched_domain **__percpu sd;
>        struct sched_group **__percpu sg;
> @@ -6313,7 +6311,6 @@ sd_numa_init(struct sched_domain_topolog
>                                        | 0*SD_WAKE_AFFINE
>                                        | 0*SD_PREFER_LOCAL
>                                        | 0*SD_SHARE_CPUPOWER
> -                                       | 0*SD_POWERSAVINGS_BALANCE
>                                        | 0*SD_SHARE_PKG_RESOURCES
>                                        | 1*SD_SERIALIZE
>                                        | 0*SD_PREFER_SIBLING
> @@ -6810,97 +6807,6 @@ void partition_sched_domains(int ndoms_n
>        mutex_unlock(&sched_domains_mutex);
>  }
>
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> -static void reinit_sched_domains(void)
> -{
> -       get_online_cpus();
> -
> -       /* Destroy domains first to force the rebuild */
> -       partition_sched_domains(0, NULL, NULL);
> -
> -       rebuild_sched_domains();
> -       put_online_cpus();
> -}
> -
> -static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> -{
> -       unsigned int level = 0;
> -
> -       if (sscanf(buf, "%u", &level) != 1)
> -               return -EINVAL;
> -
> -       /*
> -        * level is always be positive so don't check for
> -        * level < POWERSAVINGS_BALANCE_NONE which is 0
> -        * What happens on 0 or 1 byte write,
> -        * need to check for count as well?
> -        */
> -
> -       if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
> -               return -EINVAL;
> -
> -       if (smt)
> -               sched_smt_power_savings = level;
> -       else
> -               sched_mc_power_savings = level;
> -
> -       reinit_sched_domains();
> -
> -       return count;
> -}
> -
> -#ifdef CONFIG_SCHED_MC
> -static ssize_t sched_mc_power_savings_show(struct device *dev,
> -                                          struct device_attribute *attr,
> -                                          char *buf)
> -{
> -       return sprintf(buf, "%u\n", sched_mc_power_savings);
> -}
> -static ssize_t sched_mc_power_savings_store(struct device *dev,
> -                                           struct device_attribute *attr,
> -                                           const char *buf, size_t count)
> -{
> -       return sched_power_savings_store(buf, count, 0);
> -}
> -static DEVICE_ATTR(sched_mc_power_savings, 0644,
> -                  sched_mc_power_savings_show,
> -                  sched_mc_power_savings_store);
> -#endif
> -
> -#ifdef CONFIG_SCHED_SMT
> -static ssize_t sched_smt_power_savings_show(struct device *dev,
> -                                           struct device_attribute *attr,
> -                                           char *buf)
> -{
> -       return sprintf(buf, "%u\n", sched_smt_power_savings);
> -}
> -static ssize_t sched_smt_power_savings_store(struct device *dev,
> -                                           struct device_attribute *attr,
> -                                            const char *buf, size_t count)
> -{
> -       return sched_power_savings_store(buf, count, 1);
> -}
> -static DEVICE_ATTR(sched_smt_power_savings, 0644,
> -                  sched_smt_power_savings_show,
> -                  sched_smt_power_savings_store);
> -#endif
> -
> -int __init sched_create_sysfs_power_savings_entries(struct device *dev)
> -{
> -       int err = 0;
> -
> -#ifdef CONFIG_SCHED_SMT
> -       if (smt_capable())
> -               err = device_create_file(dev, &dev_attr_sched_smt_power_savings);
> -#endif
> -#ifdef CONFIG_SCHED_MC
> -       if (!err && mc_capable())
> -               err = device_create_file(dev, &dev_attr_sched_mc_power_savings);
> -#endif
> -       return err;
> -}
> -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> -
>  /*
>  * Update cpusets according to cpu_active mask.  If cpusets are
>  * disabled, cpuset_update_active_cpus() becomes a simple wrapper
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2721,7 +2721,7 @@ select_task_rq_fair(struct task_struct *
>                 * If power savings logic is enabled for a domain, see if we
>                 * are not overloaded, if so, don't balance wider.
>                 */
> -               if (tmp->flags & (SD_POWERSAVINGS_BALANCE|SD_PREFER_LOCAL)) {
> +               if (tmp->flags & (SD_PREFER_LOCAL)) {
>                        unsigned long power = 0;
>                        unsigned long nr_running = 0;
>                        unsigned long capacity;
> @@ -2734,9 +2734,6 @@ select_task_rq_fair(struct task_struct *
>
>                        capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
>
> -                       if (tmp->flags & SD_POWERSAVINGS_BALANCE)
> -                               nr_running /= 2;
> -
>                        if (nr_running < capacity)
>                                want_sd = 0;
>                }
> @@ -3435,14 +3432,6 @@ struct sd_lb_stats {
>        unsigned int  busiest_group_weight;
>
>        int group_imb; /* Is there imbalance in this sd */
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> -       int power_savings_balance; /* Is powersave balance needed for this sd */
> -       struct sched_group *group_min; /* Least loaded group in sd */
> -       struct sched_group *group_leader; /* Group which relieves group_min */
> -       unsigned long min_load_per_task; /* load_per_task in group_min */
> -       unsigned long leader_nr_running; /* Nr running of group_leader */
> -       unsigned long min_nr_running; /* Nr running of group_min */
> -#endif
>  };
>
>  /*
> @@ -3486,147 +3475,6 @@ static inline int get_sd_load_idx(struct
>        return load_idx;
>  }
>
> -
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> -/**
> - * init_sd_power_savings_stats - Initialize power savings statistics for
> - * the given sched_domain, during load balancing.
> - *
> - * @sd: Sched domain whose power-savings statistics are to be initialized.
> - * @sds: Variable containing the statistics for sd.
> - * @idle: Idle status of the CPU at which we're performing load-balancing.
> - */
> -static inline void init_sd_power_savings_stats(struct sched_domain *sd,
> -       struct sd_lb_stats *sds, enum cpu_idle_type idle)
> -{
> -       /*
> -        * Busy processors will not participate in power savings
> -        * balance.
> -        */
> -       if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
> -               sds->power_savings_balance = 0;
> -       else {
> -               sds->power_savings_balance = 1;
> -               sds->min_nr_running = ULONG_MAX;
> -               sds->leader_nr_running = 0;
> -       }
> -}
> -
> -/**
> - * update_sd_power_savings_stats - Update the power saving stats for a
> - * sched_domain while performing load balancing.
> - *
> - * @group: sched_group belonging to the sched_domain under consideration.
> - * @sds: Variable containing the statistics of the sched_domain
> - * @local_group: Does group contain the CPU for which we're performing
> - *             load balancing ?
> - * @sgs: Variable containing the statistics of the group.
> - */
> -static inline void update_sd_power_savings_stats(struct sched_group *group,
> -       struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs)
> -{
> -
> -       if (!sds->power_savings_balance)
> -               return;
> -
> -       /*
> -        * If the local group is idle or completely loaded
> -        * no need to do power savings balance at this domain
> -        */
> -       if (local_group && (sds->this_nr_running >= sgs->group_capacity ||
> -                               !sds->this_nr_running))
> -               sds->power_savings_balance = 0;
> -
> -       /*
> -        * If a group is already running at full capacity or idle,
> -        * don't include that group in power savings calculations
> -        */
> -       if (!sds->power_savings_balance ||
> -               sgs->sum_nr_running >= sgs->group_capacity ||
> -               !sgs->sum_nr_running)
> -               return;
> -
> -       /*
> -        * Calculate the group which has the least non-idle load.
> -        * This is the group from where we need to pick up the load
> -        * for saving power
> -        */
> -       if ((sgs->sum_nr_running < sds->min_nr_running) ||
> -           (sgs->sum_nr_running == sds->min_nr_running &&
> -            group_first_cpu(group) > group_first_cpu(sds->group_min))) {
> -               sds->group_min = group;
> -               sds->min_nr_running = sgs->sum_nr_running;
> -               sds->min_load_per_task = sgs->sum_weighted_load /
> -                                               sgs->sum_nr_running;
> -       }
> -
> -       /*
> -        * Calculate the group which is almost near its
> -        * capacity but still has some space to pick up some load
> -        * from other group and save more power
> -        */
> -       if (sgs->sum_nr_running + 1 > sgs->group_capacity)
> -               return;
> -
> -       if (sgs->sum_nr_running > sds->leader_nr_running ||
> -           (sgs->sum_nr_running == sds->leader_nr_running &&
> -            group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
> -               sds->group_leader = group;
> -               sds->leader_nr_running = sgs->sum_nr_running;
> -       }
> -}
> -
> -/**
> - * check_power_save_busiest_group - see if there is potential for some power-savings balance
> - * @env: load balance environment
> - * @sds: Variable containing the statistics of the sched_domain
> - *     under consideration.
> - *
> - * Description:
> - * Check if we have potential to perform some power-savings balance.
> - * If yes, set the busiest group to be the least loaded group in the
> - * sched_domain, so that it's CPUs can be put to idle.
> - *
> - * Returns 1 if there is potential to perform power-savings balance.
> - * Else returns 0.
> - */
> -static inline
> -int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
> -{
> -       if (!sds->power_savings_balance)
> -               return 0;
> -
> -       if (sds->this != sds->group_leader ||
> -                       sds->group_leader == sds->group_min)
> -               return 0;
> -
> -       env->imbalance = sds->min_load_per_task;
> -       sds->busiest = sds->group_min;
> -
> -       return 1;
> -
> -}
> -#else /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> -static inline void init_sd_power_savings_stats(struct sched_domain *sd,
> -       struct sd_lb_stats *sds, enum cpu_idle_type idle)
> -{
> -       return;
> -}
> -
> -static inline void update_sd_power_savings_stats(struct sched_group *group,
> -       struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs)
> -{
> -       return;
> -}
> -
> -static inline
> -int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
> -{
> -       return 0;
> -}
> -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> -
> -
>  unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
>  {
>        return SCHED_POWER_SCALE;
> @@ -3932,7 +3780,6 @@ static inline void update_sd_lb_stats(st
>        if (child && child->flags & SD_PREFER_SIBLING)
>                prefer_sibling = 1;
>
> -       init_sd_power_savings_stats(env->sd, sds, env->idle);
>        load_idx = get_sd_load_idx(env->sd, env->idle);
>
>        do {
> @@ -3981,7 +3828,6 @@ static inline void update_sd_lb_stats(st
>                        sds->group_imb = sgs.group_imb;
>                }
>
> -               update_sd_power_savings_stats(sg, sds, local_group, &sgs);
>                sg = sg->next;
>        } while (sg != env->sd->groups);
>  }
> @@ -4278,12 +4124,6 @@ find_busiest_group(struct lb_env *env, c
>        return sds.busiest;
>
>  out_balanced:
> -       /*
> -        * There is no obvious imbalance. But check if we can do some balancing
> -        * to save power.
> -        */
> -       if (check_power_save_busiest_group(env, &sds))
> -               return sds.busiest;
>  ret:
>        env->imbalance = 0;
>        return NULL;
> @@ -4361,28 +4201,6 @@ static int need_active_balance(struct lb
>                 */
>                if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>                        return 1;
> -
> -               /*
> -                * The only task running in a non-idle cpu can be moved to this
> -                * cpu in an attempt to completely freeup the other CPU
> -                * package.
> -                *
> -                * The package power saving logic comes from
> -                * find_busiest_group(). If there are no imbalance, then
> -                * f_b_g() will return NULL. However when sched_mc={1,2} then
> -                * f_b_g() will select a group from which a running task may be
> -                * pulled to this cpu in order to make the other package idle.
> -                * If there is no opportunity to make a package idle and if
> -                * there are no imbalance, then f_b_g() will return NULL and no
> -                * action will be taken in load_balance_newidle().
> -                *
> -                * Under normal task pull operation due to imbalance, there
> -                * will be more than one task in the source run queue and
> -                * move_tasks() will succeed.  ld_moved will be true and this
> -                * active balance code will not be triggered.
> -                */
> -               if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
> -                       return 0;
>        }
>
>        return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
> @@ -4704,104 +4522,10 @@ static struct {
>        unsigned long next_balance;     /* in jiffy units */
>  } nohz ____cacheline_aligned;
>
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> -/**
> - * lowest_flag_domain - Return lowest sched_domain containing flag.
> - * @cpu:       The cpu whose lowest level of sched domain is to
> - *             be returned.
> - * @flag:      The flag to check for the lowest sched_domain
> - *             for the given cpu.
> - *
> - * Returns the lowest sched_domain of a cpu which contains the given flag.
> - */
> -static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> -{
> -       struct sched_domain *sd;
> -
> -       for_each_domain(cpu, sd)
> -               if (sd->flags & flag)
> -                       break;
> -
> -       return sd;
> -}
> -
> -/**
> - * for_each_flag_domain - Iterates over sched_domains containing the flag.
> - * @cpu:       The cpu whose domains we're iterating over.
> - * @sd:                variable holding the value of the power_savings_sd
> - *             for cpu.
> - * @flag:      The flag to filter the sched_domains to be iterated.
> - *
> - * Iterates over all the scheduler domains for a given cpu that has the 'flag'
> - * set, starting from the lowest sched_domain to the highest.
> - */
> -#define for_each_flag_domain(cpu, sd, flag) \
> -       for (sd = lowest_flag_domain(cpu, flag); \
> -               (sd && (sd->flags & flag)); sd = sd->parent)
> -
> -/**
> - * find_new_ilb - Finds the optimum idle load balancer for nomination.
> - * @cpu:       The cpu which is nominating a new idle_load_balancer.
> - *
> - * Returns:    Returns the id of the idle load balancer if it exists,
> - *             Else, returns >= nr_cpu_ids.
> - *
> - * This algorithm picks the idle load balancer such that it belongs to a
> - * semi-idle powersavings sched_domain. The idea is to try and avoid
> - * completely idle packages/cores just for the purpose of idle load balancing
> - * when there are other idle cpu's which are better suited for that job.
> - */
> -static int find_new_ilb(int cpu)
> -{
> -       int ilb = cpumask_first(nohz.idle_cpus_mask);
> -       struct sched_group *ilbg;
> -       struct sched_domain *sd;
> -
> -       /*
> -        * Have idle load balancer selection from semi-idle packages only
> -        * when power-aware load balancing is enabled
> -        */
> -       if (!(sched_smt_power_savings || sched_mc_power_savings))
> -               goto out_done;
> -
> -       /*
> -        * Optimize for the case when we have no idle CPUs or only one
> -        * idle CPU. Don't walk the sched_domain hierarchy in such cases
> -        */
> -       if (cpumask_weight(nohz.idle_cpus_mask) < 2)
> -               goto out_done;
> -
> -       rcu_read_lock();
> -       for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
> -               ilbg = sd->groups;
> -
> -               do {
> -                       if (ilbg->group_weight !=
> -                               atomic_read(&ilbg->sgp->nr_busy_cpus)) {
> -                               ilb = cpumask_first_and(nohz.idle_cpus_mask,
> -                                                       sched_group_cpus(ilbg));
> -                               goto unlock;
> -                       }
> -
> -                       ilbg = ilbg->next;
> -
> -               } while (ilbg != sd->groups);
> -       }
> -unlock:
> -       rcu_read_unlock();
> -
> -out_done:
> -       if (ilb < nr_cpu_ids && idle_cpu(ilb))
> -               return ilb;
> -
> -       return nr_cpu_ids;
> -}
> -#else /*  (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
>  static inline int find_new_ilb(int call_cpu)
>  {
>        return nr_cpu_ids;
>  }
> -#endif
>
>  /*
>  * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
> --- a/tools/power/cpupower/man/cpupower-set.1
> +++ b/tools/power/cpupower/man/cpupower-set.1
> @@ -85,15 +85,6 @@ Adjust the kernel's multi-core scheduler
>  savings
>  .RE
>
> -sched_mc_power_savings is dependent upon SCHED_MC, which is
> -itself architecture dependent.
> -
> -sched_smt_power_savings is dependent upon SCHED_SMT, which
> -is itself architecture dependent.
> -
> -The two files are independent of each other. It is possible
> -that one file may be present without the other.
> -
>  .SH "SEE ALSO"
>  cpupower-info(1), cpupower-monitor(1), powertop(1)
>  .PP
> --- a/tools/power/cpupower/utils/helpers/sysfs.c
> +++ b/tools/power/cpupower/utils/helpers/sysfs.c
> @@ -362,22 +362,7 @@ char *sysfs_get_cpuidle_driver(void)
>  */
>  int sysfs_get_sched(const char *smt_mc)
>  {
> -       unsigned long value;
> -       char linebuf[MAX_LINE_LEN];
> -       char *endp;
> -       char path[SYSFS_PATH_MAX];
> -
> -       if (strcmp("mc", smt_mc) && strcmp("smt", smt_mc))
> -               return -EINVAL;
> -
> -       snprintf(path, sizeof(path),
> -               PATH_TO_CPU "sched_%s_power_savings", smt_mc);
> -       if (sysfs_read_file(path, linebuf, MAX_LINE_LEN) == 0)
> -               return -1;
> -       value = strtoul(linebuf, &endp, 0);
> -       if (endp == linebuf || errno == ERANGE)
> -               return -1;
> -       return value;
> +       return -ENODEV;
>  }
>
>  /*
> @@ -388,21 +373,5 @@ int sysfs_get_sched(const char *smt_mc)
>  */
>  int sysfs_set_sched(const char *smt_mc, int val)
>  {
> -       char linebuf[MAX_LINE_LEN];
> -       char path[SYSFS_PATH_MAX];
> -       struct stat statbuf;
> -
> -       if (strcmp("mc", smt_mc) && strcmp("smt", smt_mc))
> -               return -EINVAL;
> -
> -       snprintf(path, sizeof(path),
> -               PATH_TO_CPU "sched_%s_power_savings", smt_mc);
> -       sprintf(linebuf, "%d", val);
> -
> -       if (stat(path, &statbuf) != 0)
> -               return -ENODEV;
> -
> -       if (sysfs_write_file(path, linebuf, MAX_LINE_LEN) == 0)
> -               return -1;
> -       return 0;
> +       return -ENODEV;
>  }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 12:32               ` Pantelis Antoniou
@ 2012-05-15 12:59                 ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 12:59 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Vincent Guittot, mou Chen, linux-kernel, Ingo Molnar, torvalds

On Tue, 2012-05-15 at 15:32 +0300, Pantelis Antoniou wrote:
> 
> Well, I don't plan to feed the load-balancer all this crap. What I'm thinking
> is take those N metrics, form a vector according to some (yet unknown) weighting 
> factors, and schedule according to the vector value, and how it 'fits' to a
> virtual bin representing a core and it's environment. 

Currently we only have a bit-vector and a retry loop. I suspect you can
make the multi-value vector map to a bit-vector if you can keep cpu wide
statistics to compare against as well.

We want to avoid having to do a full sort.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 12:57     ` Vincent Guittot
@ 2012-05-15 13:00       ` Peter Zijlstra
  2012-05-15 15:05         ` Vincent Guittot
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 13:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> 
> Not sure that nobody cares but it's much more that scheduler,
> load_balance and sched_mc are sensible enough that it's difficult to
> ensure that a modification will not break everything for someone
> else. 

Thing is, its already broken, there's nothing else to break :-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 13:00       ` Peter Zijlstra
@ 2012-05-15 15:05         ` Vincent Guittot
  2012-05-15 15:19           ` Paul E. McKenney
                             ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>>
>> Not sure that nobody cares but it's much more that scheduler,
>> load_balance and sched_mc are sensible enough that it's difficult to
>> ensure that a modification will not break everything for someone
>> else.
>
> Thing is, its already broken, there's nothing else to break :-)
>

sched_mc is the only power-aware knob in the current scheduler. It's
far from being perfect but it seems to work on some ARM platform at
least. You mentioned at the scheduler mini-summit that we need a
cleaner replacement and everybody has agreed on that point. Is anybody
working on it yet ? and can we discuss at Plumber's what this
replacement would look like ?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:05         ` Vincent Guittot
@ 2012-05-15 15:19           ` Paul E. McKenney
  2012-05-15 15:27             ` Vincent Guittot
  2012-05-15 15:35           ` Peter Zijlstra
  2012-05-15 16:30           ` Vaidyanathan Srinivasan
  2 siblings, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2012-05-15 15:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Tue, May 15, 2012 at 05:05:47PM +0200, Vincent Guittot wrote:
> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> >>
> >> Not sure that nobody cares but it's much more that scheduler,
> >> load_balance and sched_mc are sensible enough that it's difficult to
> >> ensure that a modification will not break everything for someone
> >> else.
> >
> > Thing is, its already broken, there's nothing else to break :-)
> 
> sched_mc is the only power-aware knob in the current scheduler. It's
> far from being perfect but it seems to work on some ARM platform at
> least. You mentioned at the scheduler mini-summit that we need a
> cleaner replacement and everybody has agreed on that point. Is anybody
> working on it yet ? and can we discuss at Plumber's what this
> replacement would look like ?

Hello, Vincent,

If I understand the patch from a first glance, Peter is doing what
everyone agreed to at the scheduler minisummit back in February.
>From the notes from that meeting at:

https://wiki.linaro.org/WorkingGroups/PowerManagement/Doc/HMPscheduling

See the line reading "Remove sched_mc. [Peter Zijlstra]".

Or am I confused about what Peter's patch does?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:19           ` Paul E. McKenney
@ 2012-05-15 15:27             ` Vincent Guittot
  0 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15 15:27 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On 15 May 2012 17:19, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Tue, May 15, 2012 at 05:05:47PM +0200, Vincent Guittot wrote:
>> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>> >>
>> >> Not sure that nobody cares but it's much more that scheduler,
>> >> load_balance and sched_mc are sensible enough that it's difficult to
>> >> ensure that a modification will not break everything for someone
>> >> else.
>> >
>> > Thing is, its already broken, there's nothing else to break :-)
>>
>> sched_mc is the only power-aware knob in the current scheduler. It's
>> far from being perfect but it seems to work on some ARM platform at
>> least. You mentioned at the scheduler mini-summit that we need a
>> cleaner replacement and everybody has agreed on that point. Is anybody
>> working on it yet ? and can we discuss at Plumber's what this
>> replacement would look like ?
>
> Hello, Vincent,
>
> If I understand the patch from a first glance, Peter is doing what
> everyone agreed to at the scheduler minisummit back in February.

Yes, that's why I propose to discuss about the replacement during the
next Plumber conference

> From the notes from that meeting at:
>
> https://wiki.linaro.org/WorkingGroups/PowerManagement/Doc/HMPscheduling
>
> See the line reading "Remove sched_mc. [Peter Zijlstra]".
>
> Or am I confused about what Peter's patch does?
>
>                                                        Thanx, Paul
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:05         ` Vincent Guittot
  2012-05-15 15:19           ` Paul E. McKenney
@ 2012-05-15 15:35           ` Peter Zijlstra
  2012-05-15 15:45             ` Peter Zijlstra
                               ` (4 more replies)
  2012-05-15 16:30           ` Vaidyanathan Srinivasan
  2 siblings, 5 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 15:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> >>
> >> Not sure that nobody cares but it's much more that scheduler,
> >> load_balance and sched_mc are sensible enough that it's difficult to
> >> ensure that a modification will not break everything for someone
> >> else.
> >
> > Thing is, its already broken, there's nothing else to break :-)
> >
> 
> sched_mc is the only power-aware knob in the current scheduler. It's
> far from being perfect but it seems to work on some ARM platform at
> least. You mentioned at the scheduler mini-summit that we need a
> cleaner replacement and everybody has agreed on that point. Is anybody
> working on it yet ? 

Apparently not.. 

> and can we discuss at Plumber's what this replacement would look like ?

one knob: sched_balance_policy with tri-state {performance, power, auto}

Where auto should likely look at things like are we on battery and
co-ordinate with cpufreq muck or whatever.

Per domain knobs are insane, large multi-state knobs are insane, the
existing scheme is therefore insane^2. Can you find a sysad who'd like
to explore 3^3=27 states for optimal power/perf for his workload on a
simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
sockets etc..?

As to the exact policy, I think the current 2 (load-balance + wakeup) is
the sensible one..

Also, I still have this pending email from you asking about the topology
setup stuff I really need to reply to.. but people keep sending me bugs
reports :/

But really short, look at kernel/sched/core.c:default_topology[]

I'd like to get rid of sd_init_* into a single function like
sd_numa_init(), this would mean all archs would need to do is provide a
simple list of ever increasing masks that match their topology.

To aid this we can add some SDTL_flags, initially I was thinking of:

 SDTL_SHARE_CORE	-- aka SMT
 SDTL_SHARE_CACHE	-- LLC cache domain (typically multi-core)
 SDTL_SHARE_MEMORY	-- NUMA-node (typically socket)

The 'performance' policy is typically to spread over shared resources so
as to minimize contention on these.

If you want to add some power we need some extra flags, maybe something
like:

 SDTL_SHARE_POWERLINE	-- power domain (typically socket)

so you know where the boundaries are where you can turn stuff off so you
know what/where to pack bits.

Possibly we also add something like:

 SDTL_PERF_SPREAD	-- spread on performance mode
 SDTL_POWER_PACK	-- pack on power mode

To over-ride the defaults. But ideally I'd leave those until after we've
got the basics working and there is a clear need for them (with a
spread/pack default for perf/power aware).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:35           ` Peter Zijlstra
@ 2012-05-15 15:45             ` Peter Zijlstra
  2012-05-16 18:30             ` Peter Zijlstra
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 15:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On Tue, 2012-05-15 at 17:35 +0200, Peter Zijlstra wrote:
> To aid this we can add some SDTL_flags, initially I was thinking of:
> 
>  SDTL_SHARE_CORE        -- aka SMT
>  SDTL_SHARE_CACHE       -- LLC cache domain (typically multi-core)
>  SDTL_SHARE_MEMORY      -- NUMA-node (typically socket)
> 
> The 'performance' policy is typically to spread over shared resources so
> as to minimize contention on these.
> 
> If you want to add some power we need some extra flags, maybe something
> like:
> 
>  SDTL_SHARE_POWERLINE   -- power domain (typically socket)
> 
> so you know where the boundaries are where you can turn stuff off so you
> know what/where to pack bits. 

Similarly if someone fancies doing the core-hopping muck and can get the
sensor data etc..

SDTL_SHARE_TEMPERATURE could indicate cpumask that shares a temperature
sensor and needs hopping when the temperature rises to some threshold.

Assumes here is that all these mask are strongly hierarchical and don't
have weird overlaps (I'd like to meet the hardware architect who puts
SMT threads in different memory domains).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:05         ` Vincent Guittot
  2012-05-15 15:19           ` Paul E. McKenney
  2012-05-15 15:35           ` Peter Zijlstra
@ 2012-05-15 16:30           ` Vaidyanathan Srinivasan
  2012-05-15 18:13             ` Vincent Guittot
  2 siblings, 1 reply; 55+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-05-15 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

* Vincent Guittot <vincent.guittot@linaro.org> [2012-05-15 17:05:47]:

> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> >>
> >> Not sure that nobody cares but it's much more that scheduler,
> >> load_balance and sched_mc are sensible enough that it's difficult to
> >> ensure that a modification will not break everything for someone
> >> else.
> >
> > Thing is, its already broken, there's nothing else to break :-)
> >
> 
> sched_mc is the only power-aware knob in the current scheduler. It's
> far from being perfect but it seems to work on some ARM platform at
> least. You mentioned at the scheduler mini-summit that we need a
> cleaner replacement and everybody has agreed on that point. Is anybody
> working on it yet ? and can we discuss at Plumber's what this
> replacement would look like ?

Hi Vincent,

In the earlier discussion we listed down the cleanup requirements.
I made a cleanup patch that unifies the sysfs interface as a first
step.

[RFC PATCH v1 0/2] sched: unified sched_powersavings tunables
http://thread.gmane.org/gmane.linux.kernel/1239750

I need to make this scheme generically work on different topology,
which is the real problem that we need to solve.

--Vaidy


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 16:30           ` Vaidyanathan Srinivasan
@ 2012-05-15 18:13             ` Vincent Guittot
  0 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-15 18:13 UTC (permalink / raw)
  To: svaidy
  Cc: Peter Zijlstra, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On 15 May 2012 18:30, Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> [2012-05-15 17:05:47]:
>
>> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>> >>
>> >> Not sure that nobody cares but it's much more that scheduler,
>> >> load_balance and sched_mc are sensible enough that it's difficult to
>> >> ensure that a modification will not break everything for someone
>> >> else.
>> >
>> > Thing is, its already broken, there's nothing else to break :-)
>> >
>>
>> sched_mc is the only power-aware knob in the current scheduler. It's
>> far from being perfect but it seems to work on some ARM platform at
>> least. You mentioned at the scheduler mini-summit that we need a
>> cleaner replacement and everybody has agreed on that point. Is anybody
>> working on it yet ? and can we discuss at Plumber's what this
>> replacement would look like ?
>
> Hi Vincent,
>
> In the earlier discussion we listed down the cleanup requirements.
> I made a cleanup patch that unifies the sysfs interface as a first
> step.
>
> [RFC PATCH v1 0/2] sched: unified sched_powersavings tunables
> http://thread.gmane.org/gmane.linux.kernel/1239750
>
> I need to make this scheme generically work on different topology,
> which is the real problem that we need to solve.

Hi Vaidy

Your cleanup patch should not impact current ARM platform. IIUC, you
have merged sched_mc and sched_smt interface and there is no ARM
platform with SMT yet

Regards,
Vincent
>
> --Vaidy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 11:35           ` Pantelis Antoniou
  2012-05-15 11:58             ` Peter Zijlstra
@ 2012-05-15 20:26             ` valdis.kletnieks
  2012-05-15 20:33               ` Peter Zijlstra
  2012-05-16 12:08               ` Pantelis Antoniou
  1 sibling, 2 replies; 55+ messages in thread
From: valdis.kletnieks @ 2012-05-15 20:26 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Peter Zijlstra, Vincent Guittot, mou Chen, linux-kernel,
	Ingo Molnar, torvalds

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

On Tue, 15 May 2012 14:35:53 +0300, Pantelis Antoniou said:

> Thermal management: How to distribute load to the processors in such
> a way that the temperature of the die doesn't increase too much that
> we have to either go to a lower OPP or shut down the core all-together.
> This is in direct conflict with throughput since we'd have better performance
> if we could keep the same warmed-up cpu going.

It's not just "temperature of the die".  When you have multiple aisles of 42U
racks full of servers, you often hit "must keep average total BTU load per
server below X" constraints.  There's plenty of colo's that are only using 40%
of their floor space due to cooling constraints (you may be able to get the
power company to pull another megawatt of copper into the building, but then
you need to find someplace to put another megawatt worth of cooling).


[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 20:26             ` valdis.kletnieks
@ 2012-05-15 20:33               ` Peter Zijlstra
  2012-05-16 12:08               ` Pantelis Antoniou
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-15 20:33 UTC (permalink / raw)
  To: valdis.kletnieks
  Cc: Pantelis Antoniou, Vincent Guittot, mou Chen, linux-kernel,
	Ingo Molnar, torvalds

On Tue, 2012-05-15 at 16:26 -0400, valdis.kletnieks@vt.edu wrote:
> On Tue, 15 May 2012 14:35:53 +0300, Pantelis Antoniou said:
> 
> > Thermal management: How to distribute load to the processors in such
> > a way that the temperature of the die doesn't increase too much that
> > we have to either go to a lower OPP or shut down the core all-together.
> > This is in direct conflict with throughput since we'd have better performance
> > if we could keep the same warmed-up cpu going.
> 
> It's not just "temperature of the die".  When you have multiple aisles of 42U
> racks full of servers, you often hit "must keep average total BTU load per
> server below X" constraints.  There's plenty of colo's that are only using 40%
> of their floor space due to cooling constraints (you may be able to get the
> power company to pull another megawatt of copper into the building, but then
> you need to find someplace to put another megawatt worth of cooling).

Yeah, I know.. sadly ACPI-4+ is a complete and utter trainwreck. But in
case someone wants to fix this proper I'm willing to talk.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 20:26             ` valdis.kletnieks
  2012-05-15 20:33               ` Peter Zijlstra
@ 2012-05-16 12:08               ` Pantelis Antoniou
  1 sibling, 0 replies; 55+ messages in thread
From: Pantelis Antoniou @ 2012-05-16 12:08 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Peter Zijlstra, Vincent Guittot, mou Chen, linux-kernel,
	Ingo Molnar, torvalds


On May 15, 2012, at 11:26 PM, Valdis.Kletnieks@vt.edu wrote:

> On Tue, 15 May 2012 14:35:53 +0300, Pantelis Antoniou said:
> 
>> Thermal management: How to distribute load to the processors in such
>> a way that the temperature of the die doesn't increase too much that
>> we have to either go to a lower OPP or shut down the core all-together.
>> This is in direct conflict with throughput since we'd have better performance
>> if we could keep the same warmed-up cpu going.
> 
> It's not just "temperature of the die".  When you have multiple aisles of 42U
> racks full of servers, you often hit "must keep average total BTU load per
> server below X" constraints.  There's plenty of colo's that are only using 40%
> of their floor space due to cooling constraints (you may be able to get the
> power company to pull another megawatt of copper into the building, but then
> you need to find someplace to put another megawatt worth of cooling).
> 

Interesting,

Never thought about that... 

-- Pantelis


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:35           ` Peter Zijlstra
  2012-05-15 15:45             ` Peter Zijlstra
@ 2012-05-16 18:30             ` Peter Zijlstra
  2012-05-19 17:08               ` Linus Torvalds
  2012-05-16 18:49             ` Vaidyanathan Srinivasan
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-16 18:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On Tue, 2012-05-15 at 17:35 +0200, Peter Zijlstra wrote:
> But really short, look at kernel/sched/core.c:default_topology[]

The reason I want to push this into the arch is that the current one
size fits all topology really doesn't fit.

On x86 there's stuff like the Intel Core2 Quad which is two dual-core
dies glued together resulting in a socket with 2 cache domains. And the
current topology simply cannot represent this.

There's the s390 'book' domain, which is now littering generic code.

And AMD Magny-Cours which has two nodes on a socket.


But I want to very much limit the arch interface so that we don't get
the current mess of arch maintainers having to guess at things like
SD_INIT_BOOK and what all these various random numbers in there mean.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:35           ` Peter Zijlstra
  2012-05-15 15:45             ` Peter Zijlstra
  2012-05-16 18:30             ` Peter Zijlstra
@ 2012-05-16 18:49             ` Vaidyanathan Srinivasan
  2012-05-16 19:40               ` Peter Zijlstra
  2012-05-16 21:20             ` Vincent Guittot
       [not found]             ` <20120518161817.GE18312@e103034-lin.cambridge.arm.com>
  4 siblings, 1 reply; 55+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-05-16 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

* Peter Zijlstra <peterz@infradead.org> [2012-05-15 17:35:41]:

> On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
> > On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
 
[snip]

> But really short, look at kernel/sched/core.c:default_topology[]
> 
> I'd like to get rid of sd_init_* into a single function like
> sd_numa_init(), this would mean all archs would need to do is provide a
> simple list of ever increasing masks that match their topology.

You are suggesting that the archs will provide sched/core a list of
masks equivalent to the number of sched domain levels that we need to
build.  The SDTL_SHARE_XXX flag will also be passed per mask in order
to decide the SD flags for that domain.

> To aid this we can add some SDTL_flags, initially I was thinking of:
> 
>  SDTL_SHARE_CORE	-- aka SMT
>  SDTL_SHARE_CACHE	-- LLC cache domain (typically multi-core)
>  SDTL_SHARE_MEMORY	-- NUMA-node (typically socket)
> 
> The 'performance' policy is typically to spread over shared resources so
> as to minimize contention on these.
> 
> If you want to add some power we need some extra flags, maybe something
> like:
> 
>  SDTL_SHARE_POWERLINE	-- power domain (typically socket)

Let me take a case of two-socket,quad-core,HT x86 (Nehalem):

SDTL_SHARE_POWERLINE should be passed along with a cpumask that
represents sd_init_CPU or cpu_cpu_mask today.  So the number of
domains we build per-cpu will depend on the topology and the
sched_powersavings settings. 

--Vaidy


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-16 18:49             ` Vaidyanathan Srinivasan
@ 2012-05-16 19:40               ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-16 19:40 UTC (permalink / raw)
  To: svaidy
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Thu, 2012-05-17 at 00:19 +0530, Vaidyanathan Srinivasan wrote:

> Let me take a case of two-socket,quad-core,HT x86 (Nehalem):
> 
> SDTL_SHARE_POWERLINE should be passed along with a cpumask that
> represents sd_init_CPU or cpu_cpu_mask today.  So the number of
> domains we build per-cpu will depend on the topology and the
> sched_powersavings settings. 

No, the topology should at all time be independent of powersavings,
current x86's topology depending on that is one of the biggest warts
ever. Also sched_powersavings, doesn't actually exist anymore.

The NHM-EP from your example should do just two levels since mc and cpu
are identical, I guess we could add a pass that merges identical masks
so you can still specify 3 levels if you want.

The NUMA stuff is done automatically based on SLIT, so you don't need to
go above the socket level.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 15:35           ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  2012-05-16 18:49             ` Vaidyanathan Srinivasan
@ 2012-05-16 21:20             ` Vincent Guittot
       [not found]             ` <20120518161817.GE18312@e103034-lin.cambridge.arm.com>
  4 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2012-05-16 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, smuckle, khilman, Robin.Randhawa, suresh.b.siddha,
	thebigcorporation, venki, panto, mingo, paul.brett, pdeschrijver,
	pjt, efault, fweisbec, geoff, rostedt, tglx, amit.kucheria,
	linux-kernel, linaro-sched-sig, Morten Rasmussen, Juri Lelli

On 15 May 2012 17:35, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
>> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>> >>
>> >> Not sure that nobody cares but it's much more that scheduler,
>> >> load_balance and sched_mc are sensible enough that it's difficult to
>> >> ensure that a modification will not break everything for someone
>> >> else.
>> >
>> > Thing is, its already broken, there's nothing else to break :-)
>> >
>>
>> sched_mc is the only power-aware knob in the current scheduler. It's
>> far from being perfect but it seems to work on some ARM platform at
>> least. You mentioned at the scheduler mini-summit that we need a
>> cleaner replacement and everybody has agreed on that point. Is anybody
>> working on it yet ?
>
> Apparently not..
>
>> and can we discuss at Plumber's what this replacement would look like ?
>
> one knob: sched_balance_policy with tri-state {performance, power, auto}
>
> Where auto should likely look at things like are we on battery and
> co-ordinate with cpufreq muck or whatever.

IIUC performance and power will be platform and architecture agnostic
and will only rely on a "simple" cpu topology description and auto
mode would exchange information with framework like cpufreq which can
provide some platform specific information like a clock rate
dependency.

>
> Per domain knobs are insane, large multi-state knobs are insane, the
> existing scheme is therefore insane^2. Can you find a sysad who'd like
> to explore 3^3=27 states for optimal power/perf for his workload on a
> simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
> sockets etc..?
>
> As to the exact policy, I think the current 2 (load-balance + wakeup) is
> the sensible one..
>
> Also, I still have this pending email from you asking about the topology
> setup stuff I really need to reply to.. but people keep sending me bugs
> reports :/
>

I'm interested to get feedback when you will have time

> But really short, look at kernel/sched/core.c:default_topology[]
>
> I'd like to get rid of sd_init_* into a single function like
> sd_numa_init(), this would mean all archs would need to do is provide a
> simple list of ever increasing masks that match their topology.
>
> To aid this we can add some SDTL_flags, initially I was thinking of:
>
>  SDTL_SHARE_CORE        -- aka SMT
>  SDTL_SHARE_CACHE       -- LLC cache domain (typically multi-core)
>  SDTL_SHARE_MEMORY      -- NUMA-node (typically socket)
>
> The 'performance' policy is typically to spread over shared resources so
> as to minimize contention on these.
>
> If you want to add some power we need some extra flags, maybe something
> like:
>
>  SDTL_SHARE_POWERLINE   -- power domain (typically socket)
>
> so you know where the boundaries are where you can turn stuff off so you
> know what/where to pack bits.

I'm not sure to see how this flag will be used compared to the others.
The first 3 SDTL_SHARE_XXX about topology are exclusive and described
different level of CPU but the SDTL_SHARE_POWERLINE could be used at
each level to describe is CPU in the sched_domain are sharing or not
the power domain

>
> Possibly we also add something like:
>
>  SDTL_PERF_SPREAD       -- spread on performance mode
>  SDTL_POWER_PACK        -- pack on power mode
>
> To over-ride the defaults. But ideally I'd leave those until after we've
> got the basics working and there is a clear need for them (with a
> spread/pack default for perf/power aware).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
       [not found]             ` <20120518161817.GE18312@e103034-lin.cambridge.arm.com>
@ 2012-05-18 16:24               ` Morten Rasmussen
  2012-05-18 16:39                 ` Peter Zijlstra
  2012-05-18 16:46                 ` Pantelis Antoniou
  0 siblings, 2 replies; 55+ messages in thread
From: Morten Rasmussen @ 2012-05-18 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panto, smuckle, Juri Lelli, mingo, linaro-sched-sig, rostedt,
	tglx, geoff, efault, linux-kernel

On Fri, May 18, 2012 at 05:18:17PM +0100, Morten Rasmussen wrote:
> On Tue, May 15, 2012 at 04:35:41PM +0100, Peter Zijlstra wrote:
> > On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
> > > On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
> > > >>
> > > >> Not sure that nobody cares but it's much more that scheduler,
> > > >> load_balance and sched_mc are sensible enough that it's difficult to
> > > >> ensure that a modification will not break everything for someone
> > > >> else.
> > > >
> > > > Thing is, its already broken, there's nothing else to break :-)
> > > >
> > > 
> > > sched_mc is the only power-aware knob in the current scheduler. It's
> > > far from being perfect but it seems to work on some ARM platform at
> > > least. You mentioned at the scheduler mini-summit that we need a
> > > cleaner replacement and everybody has agreed on that point. Is anybody
> > > working on it yet ? 
> > 
> > Apparently not.. 
> > 
> > > and can we discuss at Plumber's what this replacement would look like ?
> > 
> > one knob: sched_balance_policy with tri-state {performance, power, auto}
> 
> Interesting. What would the power policy look like? Would performance
> and power be the two extremes of the power/performance trade-off? In
> that case I would assume that most embedded systems would be using auto.
> 
> > 
> > Where auto should likely look at things like are we on battery and
> > co-ordinate with cpufreq muck or whatever.
> > 
> > Per domain knobs are insane, large multi-state knobs are insane, the
> > existing scheme is therefore insane^2. Can you find a sysad who'd like
> > to explore 3^3=27 states for optimal power/perf for his workload on a
> > simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
> > sockets etc..?
> > 
> > As to the exact policy, I think the current 2 (load-balance + wakeup) is
> > the sensible one..
> > 
> > Also, I still have this pending email from you asking about the topology
> > setup stuff I really need to reply to.. but people keep sending me bugs
> > reports :/
> > 
> > But really short, look at kernel/sched/core.c:default_topology[]
> > 
> > I'd like to get rid of sd_init_* into a single function like
> > sd_numa_init(), this would mean all archs would need to do is provide a
> > simple list of ever increasing masks that match their topology.
> > 
> > To aid this we can add some SDTL_flags, initially I was thinking of:
> > 
> >  SDTL_SHARE_CORE	-- aka SMT
> >  SDTL_SHARE_CACHE	-- LLC cache domain (typically multi-core)
> >  SDTL_SHARE_MEMORY	-- NUMA-node (typically socket)
> > 
> > The 'performance' policy is typically to spread over shared resources so
> > as to minimize contention on these.
> >
> 
> Would it be worth extending this architecture specification to contain
> more information like CPU_POWER for each core? After having experimented
> a bit with scheduling on big.LITTLE my experience is that more
> information about the platform is needed to make proper scheduling
> decisions. So if the topology definition is going to be more generic and
> be set up by the architecture it could be worth adding all the bits of
> information that the scheduler would need to that data structure.
> 
> With such data structure, the scheduler would only need one knob to
> adjust the power/performance trade-off. Any thoughts?
>  

One more thing. I have experimented with PJT's load-tracking patchset
and found it very useful for big.LITTLE scheduling. Is there any plans
for including them?

	Morten

> > If you want to add some power we need some extra flags, maybe something
> > like:
> > 
> >  SDTL_SHARE_POWERLINE	-- power domain (typically socket)
> > 
> > so you know where the boundaries are where you can turn stuff off so you
> > know what/where to pack bits.
> > 
> > Possibly we also add something like:
> > 
> >  SDTL_PERF_SPREAD	-- spread on performance mode
> >  SDTL_POWER_PACK	-- pack on power mode
> > 
> > To over-ride the defaults. But ideally I'd leave those until after we've
> > got the basics working and there is a clear need for them (with a
> > spread/pack default for perf/power aware).
> 
> In my experience power optimized scheduling is quite tricky, especially
> if you still want some level of performance. For heterogeneous
> architecture packing might not be the best solution. Some indication of
> the power/performance profile of each core could be useful.
> 
> Best regards,
> Morten
> 
> 
> _______________________________________________
> linaro-sched-sig mailing list
> linaro-sched-sig@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-sched-sig
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-18 16:24               ` Morten Rasmussen
@ 2012-05-18 16:39                 ` Peter Zijlstra
  2012-05-18 16:46                 ` Pantelis Antoniou
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-18 16:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: panto, smuckle, Juri Lelli, mingo, linaro-sched-sig, rostedt,
	tglx, geoff, efault, linux-kernel, Paul Turner

On Fri, 2012-05-18 at 17:24 +0100, Morten Rasmussen wrote:
> One more thing. I have experimented with PJT's load-tracking patchset
> and found it very useful for big.LITTLE scheduling. Is there any plans
> for including them? 

Yes.. as soon as PJT comes out of the deep dark caves of googleplex and
posts them again (provided they pass review etc.. but since they already
had a go I'm fairly confident).

Paul you reading?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-18 16:24               ` Morten Rasmussen
  2012-05-18 16:39                 ` Peter Zijlstra
@ 2012-05-18 16:46                 ` Pantelis Antoniou
  1 sibling, 0 replies; 55+ messages in thread
From: Pantelis Antoniou @ 2012-05-18 16:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, smuckle, Juri Lelli, mingo, linaro-sched-sig,
	rostedt, tglx, geoff, efault, linux-kernel


On May 18, 2012, at 7:24 PM, Morten Rasmussen wrote:

> On Fri, May 18, 2012 at 05:18:17PM +0100, Morten Rasmussen wrote:
>> On Tue, May 15, 2012 at 04:35:41PM +0100, Peter Zijlstra wrote:
>>> On Tue, 2012-05-15 at 17:05 +0200, Vincent Guittot wrote:
>>>> On 15 May 2012 15:00, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Tue, 2012-05-15 at 14:57 +0200, Vincent Guittot wrote:
>>>>>> 
>>>>>> Not sure that nobody cares but it's much more that scheduler,
>>>>>> load_balance and sched_mc are sensible enough that it's difficult to
>>>>>> ensure that a modification will not break everything for someone
>>>>>> else.
>>>>> 
>>>>> Thing is, its already broken, there's nothing else to break :-)
>>>>> 
>>>> 
>>>> sched_mc is the only power-aware knob in the current scheduler. It's
>>>> far from being perfect but it seems to work on some ARM platform at
>>>> least. You mentioned at the scheduler mini-summit that we need a
>>>> cleaner replacement and everybody has agreed on that point. Is anybody
>>>> working on it yet ? 
>>> 
>>> Apparently not.. 
>>> 
>>>> and can we discuss at Plumber's what this replacement would look like ?
>>> 
>>> one knob: sched_balance_policy with tri-state {performance, power, auto}
>> 
>> Interesting. What would the power policy look like? Would performance
>> and power be the two extremes of the power/performance trade-off? In
>> that case I would assume that most embedded systems would be using auto.
>> 
>>> 
>>> Where auto should likely look at things like are we on battery and
>>> co-ordinate with cpufreq muck or whatever.
>>> 
>>> Per domain knobs are insane, large multi-state knobs are insane, the
>>> existing scheme is therefore insane^2. Can you find a sysad who'd like
>>> to explore 3^3=27 states for optimal power/perf for his workload on a
>>> simple 2 socket hyper-threaded machine and 3^4=81 state space for 8
>>> sockets etc..?
>>> 
>>> As to the exact policy, I think the current 2 (load-balance + wakeup) is
>>> the sensible one..
>>> 
>>> Also, I still have this pending email from you asking about the topology
>>> setup stuff I really need to reply to.. but people keep sending me bugs
>>> reports :/
>>> 
>>> But really short, look at kernel/sched/core.c:default_topology[]
>>> 
>>> I'd like to get rid of sd_init_* into a single function like
>>> sd_numa_init(), this would mean all archs would need to do is provide a
>>> simple list of ever increasing masks that match their topology.
>>> 
>>> To aid this we can add some SDTL_flags, initially I was thinking of:
>>> 
>>> SDTL_SHARE_CORE	-- aka SMT
>>> SDTL_SHARE_CACHE	-- LLC cache domain (typically multi-core)
>>> SDTL_SHARE_MEMORY	-- NUMA-node (typically socket)
>>> 
>>> The 'performance' policy is typically to spread over shared resources so
>>> as to minimize contention on these.
>>> 
>> 
>> Would it be worth extending this architecture specification to contain
>> more information like CPU_POWER for each core? After having experimented
>> a bit with scheduling on big.LITTLE my experience is that more
>> information about the platform is needed to make proper scheduling
>> decisions. So if the topology definition is going to be more generic and
>> be set up by the architecture it could be worth adding all the bits of
>> information that the scheduler would need to that data structure.
>> 
>> With such data structure, the scheduler would only need one knob to
>> adjust the power/performance trade-off. Any thoughts?
>> 
> 
> One more thing. I have experimented with PJT's load-tracking patchset
> and found it very useful for big.LITTLE scheduling. Is there any plans
> for including them?
> 
> 	Morten
> 

One more vote for speedy integration of PJT's patches. They are working fine
as far as I can tell, and they are absolutely needed for the power aware
scheduler work.

-- Pantelis

>>> If you want to add some power we need some extra flags, maybe something
>>> like:
>>> 
>>> SDTL_SHARE_POWERLINE	-- power domain (typically socket)
>>> 
>>> so you know where the boundaries are where you can turn stuff off so you
>>> know what/where to pack bits.
>>> 
>>> Possibly we also add something like:
>>> 
>>> SDTL_PERF_SPREAD	-- spread on performance mode
>>> SDTL_POWER_PACK	-- pack on power mode
>>> 
>>> To over-ride the defaults. But ideally I'd leave those until after we've
>>> got the basics working and there is a clear need for them (with a
>>> spread/pack default for perf/power aware).
>> 
>> In my experience power optimized scheduling is quite tricky, especially
>> if you still want some level of performance. For heterogeneous
>> architecture packing might not be the best solution. Some indication of
>> the power/performance profile of each core could be useful.
>> 
>> Best regards,
>> Morten
>> 
>> 
>> _______________________________________________
>> linaro-sched-sig mailing list
>> linaro-sched-sig@lists.linaro.org
>> http://lists.linaro.org/mailman/listinfo/linaro-sched-sig
>> 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-15 11:58             ` Peter Zijlstra
  2012-05-15 12:32               ` Pantelis Antoniou
@ 2012-05-19 14:58               ` Luming Yu
  1 sibling, 0 replies; 55+ messages in thread
From: Luming Yu @ 2012-05-19 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pantelis Antoniou, Vincent Guittot, mou Chen, linux-kernel,
	Ingo Molnar, torvalds

On Tue, May 15, 2012 at 7:58 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2012-05-15 at 14:35 +0300, Pantelis Antoniou wrote:

>> Power: Now that's a tricky one, we can't measure power directly, it's a
>> function of the cpu load we run in a period of time, along with any
>> history of the cstates & pstates of that period. How can we collect
>> information about that? Also we to take into account peripheral device
>> power to that; GPUs are particularly power hungry.
>
> Intel provides some measure of CPU power drain on recent chips (iirc),
> but yeah that doesn't include GPUs and other peripherals iirc.
>
>> Thermal management: How to distribute load to the processors in such
>> a way that the temperature of the die doesn't increase too much that
>> we have to either go to a lower OPP or shut down the core all-together.
>> This is in direct conflict with throughput since we'd have better performance
>> if we could keep the same warmed-up cpu going.
>
> Core-hopping.. yay! We have the whole sensors framework that provides an
> interface to such hardware, the question is, do chips have enough
> sensors spread on them to be useful?
>
>> Memory I/O: Some workloads are memory bandwidth hungry but do not need
>> much CPU power. In the case of asymmetric cores it would make sense to move
>> the memory bandwidth hog to a lower performance CPU without any impact.
>> Probably need to use some kind of performance counter for that; not going
>> to be very generic.
>
> You're assuming the slower cores have the same memory bandwidth, isn't
> that a dangerous assumption?
>
> Anyway, so the 'problem' with using PMCs from within the scheduler is
> that, 1) they're ass backwards slow on some chips (x86 anyone?) 2) some
> userspace gets 'upset' if they can't get at all of them.
>
> So it has to be optional at best, and I hate knobs :-) Also, the more
> information you're going to feed this load-balancer thing, the harder
> all that becomes, you don't want to do the full nm! m-dimensional bin
> fit.. :-)
>

Just curious if load-balance doesn't necessarily mean power/thermal balance,
or memory balance, then where to hack to satisfy such needs when it became
critical. e.g. People may want to have fine granularity usage plan to
control over
how processors gets used in day-to-day life. e.g. People may want to idle half
processors for few hours,while we want the request done in a way that it create
minimal impact on the quality of service provided by the system, which means:
1. we need to choose best cpus to idle. 2. soft-offline-cpu is not
right solution,
3. when really needed, get them back to service as fast as possible.

So my question is if any existing scheduler can help me do this?
Yes, I need knobs or interfaces :-)  that could be used by another
driver which could be
thermal or power related.  if there isn't, I would be very interested
to help create one.

thanks.
/l

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-16 18:30             ` Peter Zijlstra
@ 2012-05-19 17:08               ` Linus Torvalds
  2012-05-19 22:55                 ` Peter Zijlstra
                                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-05-19 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Wed, May 16, 2012 at 11:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The reason I want to push this into the arch is that the current one
> size fits all topology really doesn't fit.

NAK NAK NAK.

Ingo, please don't take any of these patches if they are starting to
make NUMA scheduling be some arch-specific crap.

Peter - you're way off base. You are totally and utterly wrong for
several reasons:

 - x86 isn't that special. In fact, x86 covers almost all the possible
NUMA architectures just within x86, so making some arch-specific NUMA
crap is *idiotic*.

  Your argument that SMT, multi-core with NUMA effects within the
core, yadda yadda is somehow x86-specific is pure crap. Everybody else
already does that or will do that. More importantly, most of the other
architectures don't even *matter* enough for them to ever write their
own NUMA scheduler.

 - NUMA topology details aren't even that important!

   Christ, people. You're never going to be perfect anyway. And
outside of some benchmarks, and some trivial loads, the loads aren't
going to be well-behaved to really let you even try.

  You don't even know what the right scheduling policy should be. We
already know that even the *existing* power-aware scheduling is pure
crap and nobody believes it actually works.

Trying to redesign something from scratch when you don't even
understand it, AND THEN MAKING IT ARCH-SPECIFIC, is just f*cking
moronic. The *only* thing that will ever result in is some crap code
that handles the one or two machines you tested it on right, for the
one or two loads you tested it with.

If you cannot make it simple enough, and generic enough, that it works
reasonably well for POWER and s390 (and ARM), don't even bother.

Seriously.

If you hate the s390 'book' domain and it really causes problems, rip
it out. NUMA code has *never* worked well. And I'm not talking just
Linux. I'm talking about all the other systems that tried to to do it,
and tried too effin hard.

Stop the idiocy. Make things *simpler* and *less* capable instead of
trying to cover some "real" topology. Nobody cares in real life.
Seriously. And the hardware people are going to keep on making things
different. Don't try to build up some perfect NUMA topology and then
try to see how insanely well you can match a particular machine. Make
some generic "roughly like this" topology with (say) four three of
NUMAness, and then have architectures say "this is roughly what my
machine looks like".

And if you cannot do that kind of "roughly generic NUMA", then stop
working on this crap immediately. Rather than waste everybodys time
with some crazy arch-specific random scheduling.

Make the levels be something like

 (a) "share core resources" (SMT or shared inner cache, like a shared
L2 when there is a big L3)
 (b) share socket
 (c) shared board/node

and don't try to make it any more clever. Don't try to describe just
*how* much resources are shared. Nobody cares at that level. If it's
not a difference of an order of magnitude, it's damn well the same
f*cking thing!  Don't think that SMT is all that different from "two
cores sharing the same L2". Sure, they are different, but you won't
find real-life loads that care enough for us to ever bother with the
differences.

Don't try to describe it any more than that. Seriously. Any time you
add "implementation detail" level knowledge (like the s390 'book' or
the various forms of SMT vs "shared decorer" (Bulldozer/Trinity) vs
"true cores sharing L2" (clusters of cores sharing an L2, with a big
shared L3 within the socket), you're just shooting yourself in the
foot. You're adding crap that shouldn't be added.

In particular, don't even try to give random "weights" to how close
things are to each other. Sure, you can parse (and generate) those
complex NUMA tables, but nobody is *ever* smart enough to really use
them. Once you move data between boards/nodes, screw the number of
hops. You are NOT going to get some scheduling decision right that
says "node X is closer to node Y than to node Z". Especially since the
load is invariably going to access non-node memory too *anyway*.

Seriously, if you think you need some complex data structures to
describe the relationship between cores, you're barking up the wrong
tree.

Make it a simple three-level tree like the above. No weights. No
*nothing*. If there isn't an order of magnitude difference in
performance and/or a *major* power domain issue, they're at the same
level. Nothing smarter than that, because it will just be f*cking
stupid, and it will be a nightmare to maintain, and nobody will
understand it anyway.

And the *only* thing that should be architecture-specific is the
architecture CPU init code that says "ok, this core is a SMT thread,
so it is in the same (a) level NUMA domain as that other core".

I'm very very serious about this. Try to make the scheduler have a
*simple* model that people can actually understand. For example, maybe
it can literally be a multi-level balancing thing, where the per-cpu
runqueues are grouped into a "shared core resources" balancer that
balances within the SMT or shared-L2 domain. And then there's an
upper-level balancer (that runs much more seldom) that is written to
balances within the socket. And then one that balances within the
node/board. And finally one that balances across boards.

Then, at each level, just say "spread out" (unrelated loads that want
maximum throughput) vs "try to group together" (for power reasons and
to avoid cache bouncing for related loads).

I dunno what the details need to be, and the above is just some random
off-the-cuff example.

But what I *do* know that we don't want any arch-specific code. And I
*do* know that the real world simply isn't simple enough that we could
ever do a perfect job, so don't even try - instead aim for
"understandable, maintainable, and gets the main issues roughly
right".

                               Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-19 17:08               ` Linus Torvalds
@ 2012-05-19 22:55                 ` Peter Zijlstra
  2012-05-22  2:38                   ` Chen
  2012-05-19 23:13                 ` Plumbers: Tweaking scheduler policy micro-conf RFP Peter Zijlstra
  2012-05-19 23:22                 ` Peter Zijlstra
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-19 22:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Sat, 2012-05-19 at 10:08 -0700, Linus Torvalds wrote:

> Ingo, please don't take any of these patches if they are starting to
> make NUMA scheduling be some arch-specific crap.

I think there's a big mis-understanding here. I fully 100% agree with
you on that. And this thread in particular isn't about NUMA at all.

This thread is about modifying the arch interface of describing the
chip.

The current interface is we have 4 fixed topology domains:

 SMT
 MC
 BOOK
 CPU

(and the NUMA stuff comes on top of that and I just removed arch bits
from that, so lets leave that for now).

The first 3 domains depend on CONFIG_SCHED_{SMT,MC,BOOK} resp. and if an
architecture select one of those it will have to provide a function
cpu_{smt,coregroup,book}_mask and optionally put a struct sched_domain
initializer in their asm/topology.h.

Now I've had quite a few complaints from arch maintainers that the
sched_domain initializer is a far too unwieldy interface to fill out and
I quite agree with them.

Now all I've meant to propose in this thread is to replace the entire
above with a simpler interface.

Instead of the above all I'm asking of doing is providing something
along the lines of:

struct sched_topology arch_topology[] = {
	{ cpu_smt_mask, ST_SMT },
	{ cpu_llc_mask, ST_CACHE },
	{ cpu_socket_mask, ST_SOCKET },
	{ NULL, },
};

and that's just about all an arch would need to do.

That said, there are a few new things in ARM land like the big-little
stuff that have no direct relation to anything on the x86 side. And they
would very much like to have means of describing their chip topology as
well.


About power aware scheduling, yes its all a big mess and the current
stuff is horrid and broken.

That said, I do believe we can do better than nothing about it, and I'm
really not asking for anything perfect -- in fact I'm asking for pretty
much the same thing you are, something simple and understandable.

The simple pack stuff on a minimum amount of power-gated units instead
of spreading it out should get some benefit. For this we'd need to know
at what granularity a chip can power-gate.

> I'm very very serious about this. Try to make the scheduler have a
> *simple* model that people can actually understand. For example, maybe
> it can literally be a multi-level balancing thing, where the per-cpu
> runqueues are grouped into a "shared core resources" balancer that
> balances within the SMT or shared-L2 domain. And then there's an
> upper-level balancer (that runs much more seldom) that is written to
> balances within the socket. And then one that balances within the
> node/board. And finally one that balances across boards.

That is basically how the scheduler is set up. These are the
sched_domains.

There is an awful lot of complexity in that code though, and I've been
trying to clean some of that up but its very slow going.

The purpose of this thread is to both simplify and allow people to more
easily express what they really care about. For this we need to explore
the problem space.

I know I haven't replied to all your points, and I suspect many are
related to annoyances you might have from other threads and I shall
attempt to answer them later.

I do feel bad that I've managed to annoy you to such a degree though. I
really would rather have a much simpler load-balancer too.







^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-19 17:08               ` Linus Torvalds
  2012-05-19 22:55                 ` Peter Zijlstra
@ 2012-05-19 23:13                 ` Peter Zijlstra
  2012-05-19 23:22                 ` Peter Zijlstra
  2 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-19 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Sat, 2012-05-19 at 10:08 -0700, Linus Torvalds wrote:
> Don't try to build up some perfect NUMA topology and then
> try to see how insanely well you can match a particular machine. Make
> some generic "roughly like this" topology with (say) four three of
> NUMAness, and then have architectures say "this is roughly what my
> machine looks like".

> In particular, don't even try to give random "weights" to how close
> things are to each other. Sure, you can parse (and generate) those
> complex NUMA tables, but nobody is *ever* smart enough to really use
> them. Once you move data between boards/nodes, screw the number of
> hops. You are NOT going to get some scheduling decision right that
> says "node X is closer to node Y than to node Z". Especially since the
> load is invariably going to access non-node memory too *anyway*. 

I suspect this is related to the patch I recently did that creates numa
levels from the node_distance() table.

The fact is, that patch removed arch specific code. And yes initially I
tried to use the weights for more than simply creating the balance
levels but I've already realized that was a mistake and removed that
part.

So currently all it does is create load-balance levels based on how far
apart nodes are said to be and decrease the balance rate roughly
proportional to how many cpus are in each level.

The node_distance() table is mostly already a fabrication of the
arch/firmware; some people do exactly what you suggested, expose simple
groups of board vs rest and not bother with fine details.

I used the node_distance() table simply because this was an existing
arch interface that provides exactly what was needed and is used for
exactly this purpose in the mm/ part of the kernel as well.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-19 17:08               ` Linus Torvalds
  2012-05-19 22:55                 ` Peter Zijlstra
  2012-05-19 23:13                 ` Plumbers: Tweaking scheduler policy micro-conf RFP Peter Zijlstra
@ 2012-05-19 23:22                 ` Peter Zijlstra
  2012-05-21  7:16                   ` Ingo Molnar
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-05-19 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, paulmck, smuckle, khilman, Robin.Randhawa,
	suresh.b.siddha, thebigcorporation, venki, panto, mingo,
	paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff, rostedt,
	tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Sat, 2012-05-19 at 10:08 -0700, Linus Torvalds wrote:
> And I
> *do* know that the real world simply isn't simple enough that we could
> ever do a perfect job, so don't even try - instead aim for
> "understandable, maintainable, and gets the main issues roughly
> right". 

I think we're in violent agreement on many points and most of this is
based on a mis-understanding. I've argued for exactly this many times.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-19 23:22                 ` Peter Zijlstra
@ 2012-05-21  7:16                   ` Ingo Molnar
  2012-05-21 16:56                     ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-05-21  7:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Vincent Guittot, paulmck, smuckle, khilman,
	Robin.Randhawa, suresh.b.siddha, thebigcorporation, venki, panto,
	mingo, paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff,
	rostedt, tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, 2012-05-19 at 10:08 -0700, Linus Torvalds wrote:
> > And I
> > *do* know that the real world simply isn't simple enough that we could
> > ever do a perfect job, so don't even try - instead aim for
> > "understandable, maintainable, and gets the main issues roughly
> > right". 
> 
> I think we're in violent agreement on many points and most of 
> this is based on a mis-understanding. I've argued for exactly 
> this many times.

it's these pending commits in tip:sched/core:

8e7fbcbc22c1 sched: Remove stale power aware scheduling remnants and dysfunctional knobs
fac536f7e492 Merge branch 'sched/urgent' into sched/core
13e099d2f77e sched/debug: Fix printing large integers on 32-bit platforms
e44bc5c5d00e sched/fair: Improve the ->group_imb logic
556061b00c9f sched/nohz: Fix rq->cpu_load[] calculations
870a0bb5d636 sched/numa: Don't scale the imbalance
04f733b4afac sched/fair: Revert sched-domain iteration breakage
316ad248307f sched/x86: Rewrite set_cpu_sibling_map()
dd7d8634e619 sched/numa: Fix the new NUMA topology bits
cb83b629bae0 sched/numa: Rewrite the CONFIG_NUMA sched domain support
bd939f45da24 sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
0ce90475dcdb sched/fair: Add some serialization to the sched_domain load-balance walk
c22402a2f76e sched/fair: Let minimally loaded cpu balance the group
c82513e51355 sched: Change rq->nr_running to unsigned int
ad7687dde878 x86/numa: Check for nonsensical topologies on real hw as well
0acbb440f063 x86/numa: Hard partition cpu topology masks on node boundaries
94c0dd3278dd x86/numa: Allow specifying node_distance() for numa=fake
19209bbb8612 x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
489a71b029cd sched: Update documentation and comments

the result of these commits is:

 24 files changed, 417 insertions(+), 975 deletions(-)

Most of the linecount win is due to the removal of the 
dysfunctional power scheduling - but even without that commit 
it's a simplification:

 15 files changed, 415 insertions(+), 481 deletions(-)

while it lifts the historic limitations of the sched-domains 
approach and makes the code a whole lot more logical.

Nevertheless I'll wait for Linus to confirm that he agrees 
violently as well, before sending these bits ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-21  7:16                   ` Ingo Molnar
@ 2012-05-21 16:56                     ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-05-21 16:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Vincent Guittot, paulmck, smuckle, khilman,
	Robin.Randhawa, suresh.b.siddha, thebigcorporation, venki, panto,
	mingo, paul.brett, pdeschrijver, pjt, efault, fweisbec, geoff,
	rostedt, tglx, amit.kucheria, linux-kernel, linaro-sched-sig,
	Morten Rasmussen, Juri Lelli

On Mon, May 21, 2012 at 12:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Nevertheless I'll wait for Linus to confirm that he agrees
> violently as well, before sending these bits ;-)

So as long as people aren't going to do some arch-specific scheduling,
I don't care. It sounds like I misread what Peter's intentions were.

And line count removal is always a good sign.

                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-19 22:55                 ` Peter Zijlstra
@ 2012-05-22  2:38                   ` Chen
  2012-05-22  5:14                     ` Chen
  2012-05-23 15:03                     ` Ingo Molnar
  0 siblings, 2 replies; 55+ messages in thread
From: Chen @ 2012-05-22  2:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, torvalds
  Cc: linux-kernel, mou Chen

On Sun, May 20, 2012 at 6:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, 2012-05-19 at 10:08 -0700, Linus Torvalds wrote:
>
>> Ingo, please don't take any of these patches if they are starting to
>> make NUMA scheduling be some arch-specific crap.
>
> I think there's a big mis-understanding here. I fully 100% agree with
> you on that. And this thread in particular isn't about NUMA at all.
>
> This thread is about modifying the arch interface of describing the
> chip.
>
> The current interface is we have 4 fixed topology domains:
>
>  SMT
>  MC
>  BOOK
>  CPU
>
> (and the NUMA stuff comes on top of that and I just removed arch bits
> from that, so lets leave that for now).
>
> The first 3 domains depend on CONFIG_SCHED_{SMT,MC,BOOK} resp. and if an
> architecture select one of those it will have to provide a function
> cpu_{smt,coregroup,book}_mask and optionally put a struct sched_domain
> initializer in their asm/topology.h.
>
> Now I've had quite a few complaints from arch maintainers that the
> sched_domain initializer is a far too unwieldy interface to fill out and
> I quite agree with them.
>
> Now all I've meant to propose in this thread is to replace the entire
> above with a simpler interface.
>
> Instead of the above all I'm asking of doing is providing something
> along the lines of:
>
> struct sched_topology arch_topology[] = {
>        { cpu_smt_mask, ST_SMT },
>        { cpu_llc_mask, ST_CACHE },
>        { cpu_socket_mask, ST_SOCKET },
>        { NULL, },
> };
>
> and that's just about all an arch would need to do.
>
> That said, there are a few new things in ARM land like the big-little
> stuff that have no direct relation to anything on the x86 side. And they
> would very much like to have means of describing their chip topology as
> well.
>
>
> About power aware scheduling, yes its all a big mess and the current
> stuff is horrid and broken.
>
> That said, I do believe we can do better than nothing about it, and I'm
> really not asking for anything perfect -- in fact I'm asking for pretty
> much the same thing you are, something simple and understandable.
>
> The simple pack stuff on a minimum amount of power-gated units instead
> of spreading it out should get some benefit. For this we'd need to know
> at what granularity a chip can power-gate.
>
>> I'm very very serious about this. Try to make the scheduler have a
>> *simple* model that people can actually understand. For example, maybe
>> it can literally be a multi-level balancing thing, where the per-cpu
>> runqueues are grouped into a "shared core resources" balancer that
>> balances within the SMT or shared-L2 domain. And then there's an
>> upper-level balancer (that runs much more seldom) that is written to
>> balances within the socket. And then one that balances within the
>> node/board. And finally one that balances across boards.
>
> That is basically how the scheduler is set up. These are the
> sched_domains.
>
> There is an awful lot of complexity in that code though, and I've been
> trying to clean some of that up but its very slow going.
>
> The purpose of this thread is to both simplify and allow people to more
> easily express what they really care about. For this we need to explore
> the problem space.
>
> I know I haven't replied to all your points, and I suspect many are
> related to annoyances you might have from other threads and I shall
> attempt to answer them later.
>
> I do feel bad that I've managed to annoy you to such a degree though. I
> really would rather have a much simpler load-balancer too.
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Still you are just trying to said that your code is not bloated?
Up to over 500K for a cpu scheduler. Laughing

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-22  2:38                   ` Chen
@ 2012-05-22  5:14                     ` Chen
  2012-05-30  7:20                       ` Ingo Molnar
  2012-05-23 15:03                     ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Chen @ 2012-05-22  5:14 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, torvalds
  Cc: linux-kernel, mou Chen

On Tue, May 22, 2012 at 10:38 AM, Chen <hi3766691@gmail.com> wrote:
> Still you are just trying to said that your code is not bloated?
> Up to over 500K for a cpu scheduler. Laughing

So please stop increasing the size of cpu scheduler's code. Users
can't benefit anything from that. Also the interactivity problem of
scheduler is still exist though it improves a lot already.

It is better to stop bloating. Isn't ?

Also I m quite agree with Linus. The model of the scheduler now is
complex and there are many *UNNECESSARY* code. I CAN'T REALLY BENEFIT
ANYTHING. I just make my kernel with -j2 and the music is already
sucking![Intel E7500, 2.9GHZ, two core]. It can show that how the
interactivity problem is serious with mainline cpu scheduler. I know
it is not all the fault of mainline cpu scheduler but it is still a
big interactivity problem with it. [me think that Peter is proud of
his insane-box-supporting stuff]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-22  2:38                   ` Chen
  2012-05-22  5:14                     ` Chen
@ 2012-05-23 15:03                     ` Ingo Molnar
  2012-05-23 15:43                       ` Joe Perches
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-05-23 15:03 UTC (permalink / raw)
  To: Chen; +Cc: Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel


* Chen <hi3766691@gmail.com> wrote:

> Still you are just trying to said that your code is not bloated?
> Up to over 500K for a cpu scheduler. Laughing

Where did you get that 500K from? You are off from the truth 
almost by an order of magnitude.

Here's the scheduler size on Linus's latest tree, on 64-bit 
defconfig's:

 $ size kernel/sched/built-in.o 
   text	   data	    bss	    dec	    hex	filename
  83611	  10404	   2524	  96539	  1791b	kernel/sched/built-in.o

That's SMP+NUMA, i.e. everything included.

The !NUMA !SMP UP scheduler, if you are on a size starved 
ultra-embedded device, is even smaller, just 22K:

 $ size kernel/sched/built-in.o 
   text	   data	    bss	    dec	    hex	filename
  19882	   2218	    148	  22248	   56e8	kernel/sched/built-in.o

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-23 15:03                     ` Ingo Molnar
@ 2012-05-23 15:43                       ` Joe Perches
  2012-05-23 15:50                         ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Joe Perches @ 2012-05-23 15:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel

On Wed, 2012-05-23 at 17:03 +0200, Ingo Molnar wrote:
> * Chen <hi3766691@gmail.com> wrote:
> 
> > Still you are just trying to said that your code is not bloated?
> > Up to over 500K for a cpu scheduler. Laughing
> 
> Where did you get that 500K from? You are off from the truth 
> almost by an order of magnitude.
> 
> Here's the scheduler size on Linus's latest tree, on 64-bit 
> defconfig's:
> 
>  $ size kernel/sched/built-in.o 
>    text	   data	    bss	    dec	    hex	filename
>   83611	  10404	   2524	  96539	  1791b	kernel/sched/built-in.o
> 
> That's SMP+NUMA, i.e. everything included.
> 
> The !NUMA !SMP UP scheduler, if you are on a size starved 
> ultra-embedded device, is even smaller, just 22K:
> 
>  $ size kernel/sched/built-in.o 
>    text	   data	    bss	    dec	    hex	filename
>   19882	   2218	    148	  22248	   56e8	kernel/sched/built-in.o

Here's an allyesconfig x86-32
$ size kernel/sched/built-in.o
   text	   data	    bss	    dec	    hex	filename
 213892	  10856	  65832	 290580	  46f14	kernel/sched/built-in.o

But that's not the only sched related code.

In a 1000 cpu config, there also an extra 500+ bytes per cpu
in printk (I don't think that's particularly important btw)

kernel/printk.c adds:

static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);

Maybe #ifdefing this when !CONFIG_PRINTK would reduce size
a little in a few cases.  I've attached a trivial suggested patch.

btw: There's still the unnecessary
static DEFINE_PER_CPU(int, printk_pending);
but the code is more involved around that one.

printk.c needs some refactoring and modularization,
it's pretty ugly right now.

diff --git a/kernel/printk.c b/kernel/printk.c
index 32462d2..3bd9a11 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1734,16 +1734,21 @@ int is_console_locked(void)
 #define PRINTK_PENDING_SCHED	0x02
 
 static DEFINE_PER_CPU(int, printk_pending);
-static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
+
+#ifdef CONFIG_PRINTK
+static DEFINE_PER_CPU(char[PRINTK_BUF_SIZE], printk_sched_buf);
+#endif
 
 void printk_tick(void)
 {
 	if (__this_cpu_read(printk_pending)) {
 		int pending = __this_cpu_xchg(printk_pending, 0);
+#ifdef CONFIG_PRINTK
 		if (pending & PRINTK_PENDING_SCHED) {
 			char *buf = __get_cpu_var(printk_sched_buf);
 			printk(KERN_WARNING "[sched_delayed] %s", buf);
 		}
+#endif
 		if (pending & PRINTK_PENDING_WAKEUP)
 			wake_up_interruptible(&log_wait);
 	}



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-23 15:43                       ` Joe Perches
@ 2012-05-23 15:50                         ` Ingo Molnar
  2012-05-23 15:56                           ` Joe Perches
  2012-05-29 18:17                           ` [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK Joe Perches
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-05-23 15:50 UTC (permalink / raw)
  To: Joe Perches; +Cc: Chen, Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel


* Joe Perches <joe@perches.com> wrote:

> On Wed, 2012-05-23 at 17:03 +0200, Ingo Molnar wrote:
> > * Chen <hi3766691@gmail.com> wrote:
> > 
> > > Still you are just trying to said that your code is not bloated?
> > > Up to over 500K for a cpu scheduler. Laughing
> > 
> > Where did you get that 500K from? You are off from the truth 
> > almost by an order of magnitude.
> > 
> > Here's the scheduler size on Linus's latest tree, on 64-bit 
> > defconfig's:
> > 
> >  $ size kernel/sched/built-in.o 
> >    text	   data	    bss	    dec	    hex	filename
> >   83611	  10404	   2524	  96539	  1791b	kernel/sched/built-in.o
> > 
> > That's SMP+NUMA, i.e. everything included.
> > 
> > The !NUMA !SMP UP scheduler, if you are on a size starved 
> > ultra-embedded device, is even smaller, just 22K:
> > 
> >  $ size kernel/sched/built-in.o 
> >    text	   data	    bss	    dec	    hex	filename
> >   19882	   2218	    148	  22248	   56e8	kernel/sched/built-in.o
> 
> Here's an allyesconfig x86-32

allyesconfig includes a whole lot of debugging code so it's a 
pretty meaningless size test.

> $ size kernel/sched/built-in.o
>    text	   data	    bss	    dec	    hex	filename
>  213892	  10856	  65832	 290580	  46f14	kernel/sched/built-in.o
> 
> But that's not the only sched related code.
> 
> In a 1000 cpu config, there also an extra 500+ bytes per cpu
> in printk (I don't think that's particularly important btw)

A 1000 cpu piece of hardware will have a terabyte of RAM or 
more. 0.5K per CPU is reasonable.

> kernel/printk.c adds:
> 
> static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
> 
> Maybe #ifdefing this when !CONFIG_PRINTK would reduce size
> a little in a few cases.  I've attached a trivial suggested patch.

That might make sense for the ultra-embedded.

Still 500K is an obviously nonsensical number.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-23 15:50                         ` Ingo Molnar
@ 2012-05-23 15:56                           ` Joe Perches
  2012-05-23 15:59                             ` Ingo Molnar
  2012-05-29 18:17                           ` [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK Joe Perches
  1 sibling, 1 reply; 55+ messages in thread
From: Joe Perches @ 2012-05-23 15:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel

On Wed, 2012-05-23 at 17:50 +0200, Ingo Molnar wrote:
> * Joe Perches <joe@perches.com> wrote:
> > In a 1000 cpu config, there also an extra 500+ bytes per cpu
> > in printk (I don't think that's particularly important btw)
> A 1000 cpu piece of hardware will have a terabyte of RAM or 
> more. 0.5K per CPU is reasonable.

That's not fundamentally true, but it's not
particularly important right now either.

cheers, Joe


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-23 15:56                           ` Joe Perches
@ 2012-05-23 15:59                             ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-05-23 15:59 UTC (permalink / raw)
  To: Joe Perches; +Cc: Chen, Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel


* Joe Perches <joe@perches.com> wrote:

> On Wed, 2012-05-23 at 17:50 +0200, Ingo Molnar wrote:
> > * Joe Perches <joe@perches.com> wrote:
> > >
> > > In a 1000 cpu config, there also an extra 500+ bytes per cpu
> > > in printk (I don't think that's particularly important btw)
> >
> > A 1000 cpu piece of hardware will have a terabyte of RAM or 
> > more. 0.5K per CPU is reasonable.
> 
> That's not fundamentally true, [...]

Compared to the known alternatives it's pretty fundamentally 
true.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK
  2012-05-23 15:50                         ` Ingo Molnar
  2012-05-23 15:56                           ` Joe Perches
@ 2012-05-29 18:17                           ` Joe Perches
  2012-06-05 16:04                             ` Joe Perches
  2012-06-06  7:33                             ` Ingo Molnar
  1 sibling, 2 replies; 55+ messages in thread
From: Joe Perches @ 2012-05-29 18:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kay Sievers, LKML

The size of the per-cpu printk_sched buf is much larger
than necessary.  The maximum sched message emitted is
~80 bytes. Shrink the allocation for this printk_sched
buffer from 512 bytes to 128.

printk_sched creates an unnecessary per-cpu buffer when
CONFIG_PRINTK is not enabled.  Remove it when appropriate
so embedded uses save a bit of space too.

Signed-off-by: Joe Perches <joe@perches.com>
---
 kernel/printk.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 32462d2..61cff0b 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1726,24 +1726,30 @@ int is_console_locked(void)
 }
 
 /*
- * Delayed printk version, for scheduler-internal messages:
+ * Delayed printk version, for scheduler-internal messages.
+ * Not the normal 512 as it's a bit wasteful, sched messages are short,
+ * and 128 is more than sufficient for all current messages.
  */
-#define PRINTK_BUF_SIZE		512
+#define PRINTK_SCHED_BUF_SIZE	128
 
 #define PRINTK_PENDING_WAKEUP	0x01
 #define PRINTK_PENDING_SCHED	0x02
 
 static DEFINE_PER_CPU(int, printk_pending);
-static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
+#ifdef CONFIG_PRINTK
+static DEFINE_PER_CPU(char [PRINTK_SCHED_BUF_SIZE], printk_sched_buf);
+#endif
 
 void printk_tick(void)
 {
 	if (__this_cpu_read(printk_pending)) {
 		int pending = __this_cpu_xchg(printk_pending, 0);
+#ifdef CONFIG_PRINTK
 		if (pending & PRINTK_PENDING_SCHED) {
 			char *buf = __get_cpu_var(printk_sched_buf);
 			printk(KERN_WARNING "[sched_delayed] %s", buf);
 		}
+#endif
 		if (pending & PRINTK_PENDING_WAKEUP)
 			wake_up_interruptible(&log_wait);
 	}
@@ -2189,7 +2195,7 @@ int printk_sched(const char *fmt, ...)
 	buf = __get_cpu_var(printk_sched_buf);
 
 	va_start(args, fmt);
-	r = vsnprintf(buf, PRINTK_BUF_SIZE, fmt, args);
+	r = vsnprintf(buf, PRINTK_SCHED_BUF_SIZE, fmt, args);
 	va_end(args);
 
 	__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Plumbers: Tweaking scheduler policy micro-conf RFP
  2012-05-22  5:14                     ` Chen
@ 2012-05-30  7:20                       ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-05-30  7:20 UTC (permalink / raw)
  To: Chen; +Cc: Peter Zijlstra, Vincent Guittot, torvalds, linux-kernel


(restored the Cc:s)

* Chen <hi3766691@gmail.com> wrote:

> Oh, Just count the size of the scheduler code yourself, 
> actually 400 - 500k. core.c + fair.c + rt.c + idle_task.c + 
> everything

Only binary code is counted in bytes, source code is counted in 
lines.

20 KLOC for a full-featured CPU scheduler that does everything 
from simple UP scheduling to thousands of CPUs NUMA scheduling, 
cgroups, real-time and more, is entirely reasonable.

As a comparison the VM is 80+ KLOCS, arch/x86/ is 260+ KLOCs, 
networking is 720+ KLOCS and the FS subsystem is over 1 million 
lines of code.

The scheduler is in fact one of the smaller subsystems.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK
  2012-05-29 18:17                           ` [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK Joe Perches
@ 2012-06-05 16:04                             ` Joe Perches
  2012-06-06  7:25                               ` Ingo Molnar
  2012-06-06  7:33                             ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Joe Perches @ 2012-06-05 16:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kay Sievers, LKML

On Tue, 2012-05-29 at 11:17 -0700, Joe Perches wrote:
> The size of the per-cpu printk_sched buf is much larger
> than necessary.  The maximum sched message emitted is
> ~80 bytes. Shrink the allocation for this printk_sched
> buffer from 512 bytes to 128.
> 
> printk_sched creates an unnecessary per-cpu buffer when
> CONFIG_PRINTK is not enabled.  Remove it when appropriate
> so embedded uses save a bit of space too.

Ingo, what's happening with this patch?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK
  2012-06-05 16:04                             ` Joe Perches
@ 2012-06-06  7:25                               ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-06-06  7:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kay Sievers, LKML


* Joe Perches <joe@perches.com> wrote:

> On Tue, 2012-05-29 at 11:17 -0700, Joe Perches wrote:
> > The size of the per-cpu printk_sched buf is much larger
> > than necessary.  The maximum sched message emitted is
> > ~80 bytes. Shrink the allocation for this printk_sched
> > buffer from 512 bytes to 128.
> > 
> > printk_sched creates an unnecessary per-cpu buffer when
> > CONFIG_PRINTK is not enabled.  Remove it when appropriate
> > so embedded uses save a bit of space too.
> 
> Ingo, what's happening with this patch?

The merge window is a busy period for most maintainers, so 
non-fix patches sent in that period typically get delayed,
to be processed when there's more time available.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK
  2012-05-29 18:17                           ` [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK Joe Perches
  2012-06-05 16:04                             ` Joe Perches
@ 2012-06-06  7:33                             ` Ingo Molnar
  2012-06-06  7:42                               ` Joe Perches
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-06-06  7:33 UTC (permalink / raw)
  To: Joe Perches
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kay Sievers, LKML


* Joe Perches <joe@perches.com> wrote:

> The size of the per-cpu printk_sched buf is much larger
> than necessary.  The maximum sched message emitted is
> ~80 bytes. Shrink the allocation for this printk_sched
> buffer from 512 bytes to 128.
> 
> printk_sched creates an unnecessary per-cpu buffer when
> CONFIG_PRINTK is not enabled.  Remove it when appropriate
> so embedded uses save a bit of space too.
> 
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  kernel/printk.c |   14 ++++++++++----
>  1 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/printk.c b/kernel/printk.c
> index 32462d2..61cff0b 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -1726,24 +1726,30 @@ int is_console_locked(void)
>  }
>  
>  /*
> - * Delayed printk version, for scheduler-internal messages:
> + * Delayed printk version, for scheduler-internal messages.
> + * Not the normal 512 as it's a bit wasteful, sched messages are short,
> + * and 128 is more than sufficient for all current messages.
>   */
> -#define PRINTK_BUF_SIZE		512
> +#define PRINTK_SCHED_BUF_SIZE	128
>  
>  #define PRINTK_PENDING_WAKEUP	0x01
>  #define PRINTK_PENDING_SCHED	0x02
>  
>  static DEFINE_PER_CPU(int, printk_pending);
> -static DEFINE_PER_CPU(char [PRINTK_BUF_SIZE], printk_sched_buf);
> +#ifdef CONFIG_PRINTK
> +static DEFINE_PER_CPU(char [PRINTK_SCHED_BUF_SIZE], printk_sched_buf);
> +#endif
>  
>  void printk_tick(void)
>  {
>  	if (__this_cpu_read(printk_pending)) {
>  		int pending = __this_cpu_xchg(printk_pending, 0);
> +#ifdef CONFIG_PRINTK
>  		if (pending & PRINTK_PENDING_SCHED) {
>  			char *buf = __get_cpu_var(printk_sched_buf);
>  			printk(KERN_WARNING "[sched_delayed] %s", buf);
>  		}
> +#endif
>  		if (pending & PRINTK_PENDING_WAKEUP)
>  			wake_up_interruptible(&log_wait);
>  	}
> @@ -2189,7 +2195,7 @@ int printk_sched(const char *fmt, ...)
>  	buf = __get_cpu_var(printk_sched_buf);
>  
>  	va_start(args, fmt);
> -	r = vsnprintf(buf, PRINTK_BUF_SIZE, fmt, args);
> +	r = vsnprintf(buf, PRINTK_SCHED_BUF_SIZE, fmt, args);
>  	va_end(args);
>  
>  	__this_cpu_or(printk_pending, PRINTK_PENDING_SCHED);

The change makes sense but the further proliferation of #ifdefs 
is rather ugly and shows confusion: fundamentally, if we are 
going to cut out more printk functionality in the !CONFIG_PRINTK 
we might as well disable the whole thing, not just the 
printk_sched bits.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK
  2012-06-06  7:33                             ` Ingo Molnar
@ 2012-06-06  7:42                               ` Joe Perches
  0 siblings, 0 replies; 55+ messages in thread
From: Joe Perches @ 2012-06-06  7:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kay Sievers, LKML

On Wed, 2012-06-06 at 09:33 +0200, Ingo Molnar wrote:
> The change makes sense but the further proliferation of #ifdefs 
> is rather ugly and shows confusion: fundamentally, if we are 
> going to cut out more printk functionality in the !CONFIG_PRINTK 
> we might as well disable the whole thing, not just the 
> printk_sched bits.

I fundamentally agree, but I (or another sucker) get
around to actually refactoring printk into multiple
logical components, this is the simplest way to go
because it's guaranteed to work.

When it's refactored, a lot of the CONFIG tests should
go away.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2012-06-06  7:42 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-11 16:16 Plumbers: Tweaking scheduler policy micro-conf RFP Vincent Guittot
2012-05-11 16:26 ` Steven Rostedt
2012-05-11 16:38   ` Vincent Guittot
2012-05-15  8:41     ` Juri Lelli
2012-05-15  0:53 ` Paul E. McKenney
2012-05-15  8:02 ` Vincent Guittot
2012-05-15  8:34   ` mou Chen
2012-05-15  9:07     ` Vincent Guittot
2012-05-15  9:17       ` Pantelis Antoniou
2012-05-15 10:28         ` Peter Zijlstra
2012-05-15 11:35           ` Pantelis Antoniou
2012-05-15 11:58             ` Peter Zijlstra
2012-05-15 12:32               ` Pantelis Antoniou
2012-05-15 12:59                 ` Peter Zijlstra
2012-05-19 14:58               ` Luming Yu
2012-05-15 20:26             ` valdis.kletnieks
2012-05-15 20:33               ` Peter Zijlstra
2012-05-16 12:08               ` Pantelis Antoniou
2012-05-15 12:23   ` Peter Zijlstra
2012-05-15 12:27     ` Peter Zijlstra
2012-05-15 12:57     ` Vincent Guittot
2012-05-15 13:00       ` Peter Zijlstra
2012-05-15 15:05         ` Vincent Guittot
2012-05-15 15:19           ` Paul E. McKenney
2012-05-15 15:27             ` Vincent Guittot
2012-05-15 15:35           ` Peter Zijlstra
2012-05-15 15:45             ` Peter Zijlstra
2012-05-16 18:30             ` Peter Zijlstra
2012-05-19 17:08               ` Linus Torvalds
2012-05-19 22:55                 ` Peter Zijlstra
2012-05-22  2:38                   ` Chen
2012-05-22  5:14                     ` Chen
2012-05-30  7:20                       ` Ingo Molnar
2012-05-23 15:03                     ` Ingo Molnar
2012-05-23 15:43                       ` Joe Perches
2012-05-23 15:50                         ` Ingo Molnar
2012-05-23 15:56                           ` Joe Perches
2012-05-23 15:59                             ` Ingo Molnar
2012-05-29 18:17                           ` [PATCH] printk: Shrink printk_sched buffer size, eliminate it when !CONFIG_PRINTK Joe Perches
2012-06-05 16:04                             ` Joe Perches
2012-06-06  7:25                               ` Ingo Molnar
2012-06-06  7:33                             ` Ingo Molnar
2012-06-06  7:42                               ` Joe Perches
2012-05-19 23:13                 ` Plumbers: Tweaking scheduler policy micro-conf RFP Peter Zijlstra
2012-05-19 23:22                 ` Peter Zijlstra
2012-05-21  7:16                   ` Ingo Molnar
2012-05-21 16:56                     ` Linus Torvalds
2012-05-16 18:49             ` Vaidyanathan Srinivasan
2012-05-16 19:40               ` Peter Zijlstra
2012-05-16 21:20             ` Vincent Guittot
     [not found]             ` <20120518161817.GE18312@e103034-lin.cambridge.arm.com>
2012-05-18 16:24               ` Morten Rasmussen
2012-05-18 16:39                 ` Peter Zijlstra
2012-05-18 16:46                 ` Pantelis Antoniou
2012-05-15 16:30           ` Vaidyanathan Srinivasan
2012-05-15 18:13             ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.