All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] CFS scheduler, v3
@ 2007-04-18 17:50 Ingo Molnar
  2007-04-18 21:26 ` William Lee Irwin III
  2007-04-20  0:10 ` Peter Williams
  0 siblings, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2007-04-18 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett


this is the third release of the CFS patchset (against v2.6.21-rc7), and 
can be downloaded from:
 
   http://redhat.com/~mingo/cfs-scheduler/

this is a pure "fix reported regressions" release so there's much less 
churn:

   5 files changed, 71 insertions(+), 29 deletions(-)

(the added lines are mostly debug related, not genuine increase in the 
scheduler's size)

Changes since -v2:

 - bugfix: the yield() implementation was too naive and broke xine-lib.
   (this caused the Kaffine problems. Reported by S.Çağlar Onur)

 - performance fix: increased preemption granularity from 1msec to 
   5msecs, to address the kernbench regression reported by Nick Pigging. 
   (5msecs is probably still too low. This problem has also been pointed
    out by Con Kolivas.)

 - cleanup: renamed requeue_task to yield_task. (suggested by William 
   Lee Irwin III)

 - bugfix: use constant offset factor for nice levels instead of 
   sched_granularity_ns. Thus nice levels work even if someone sets 
   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
   address the many nice level related suggestions in -v4.

 - compiler warning fix: get rid of unused variable on UP in sched.c.
   (reported by Gene Heskett)

 - debug: new /proc/sys/kernel/sysctl_sched_delayed_wakeups flag.
   Default: off. This is to debug Gene Heskett's box.

as usual, any sort of feedback, bugreports, fixes and suggestions are 
more than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-18 17:50 [patch] CFS scheduler, v3 Ingo Molnar
@ 2007-04-18 21:26 ` William Lee Irwin III
  2007-04-18 21:33   ` Ingo Molnar
  2007-04-20 19:24   ` Christoph Lameter
  2007-04-20  0:10 ` Peter Williams
  1 sibling, 2 replies; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-18 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Wed, Apr 18, 2007 at 07:50:17PM +0200, Ingo Molnar wrote:
> this is the third release of the CFS patchset (against v2.6.21-rc7), and 
> can be downloaded from:
>    http://redhat.com/~mingo/cfs-scheduler/
> this is a pure "fix reported regressions" release so there's much less 
> churn:
>    5 files changed, 71 insertions(+), 29 deletions(-)
> (the added lines are mostly debug related, not genuine increase in the 
> scheduler's size)

It appears to me that the following can be taken in for mainline
(or rejected for mainline) independently of the rest of the cfs patch.


-- wli

Mark the runqueues ____cacheline_aligned_in_smp to avoid false sharing.

Index: sched/kernel/sched.c
===================================================================
--- sched.orig/kernel/sched.c	2007-04-18 14:10:03.593207728 -0700
+++ sched/kernel/sched.c	2007-04-18 14:11:39.270660075 -0700
@@ -278,7 +278,7 @@
 	struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues);
+static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
 
 static inline int cpu_of(struct rq *rq)
 {

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-18 21:26 ` William Lee Irwin III
@ 2007-04-18 21:33   ` Ingo Molnar
  2007-04-20 19:24   ` Christoph Lameter
  1 sibling, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett


* William Lee Irwin III <wli@holomorphy.com> wrote:

> It appears to me that the following can be taken in for mainline (or 
> rejected for mainline) independently of the rest of the cfs patch.

yeah - it's a patch written by Suresh, and this should already be in the 
for-v2.6.22 -mm queue. See:

  Subject: [patch] sched: align rq to cacheline boundary

on lkml.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-18 17:50 [patch] CFS scheduler, v3 Ingo Molnar
  2007-04-18 21:26 ` William Lee Irwin III
@ 2007-04-20  0:10 ` Peter Williams
  2007-04-20  4:48   ` Willy Tarreau
                     ` (3 more replies)
  1 sibling, 4 replies; 39+ messages in thread
From: Peter Williams @ 2007-04-20  0:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

Ingo Molnar wrote:
> 
>  - bugfix: use constant offset factor for nice levels instead of 
>    sched_granularity_ns. Thus nice levels work even if someone sets 
>    sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
>    address the many nice level related suggestions in -v4.

I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.

First suppose we have the following metrics available in addition to 
what's already provided.

rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */

where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that "scheduling 
cycle" would be construed as some mechanism being imposed on the scheduler.)

We can then define:

effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq->raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:

expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight

Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.

If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.

I appreciate that the notion of basing the expected wait on the task's 
average cpu use per scheduling cycle is counter intuitive but I believe 
that (if you think about it) you'll see that it actually makes sense.

Peter
PS Some reordering of calculation order within the expressions might be 
in order to keep them within the range of 32 bit arithmetic and so avoid 
64 bit arithmetic on 32 bit machines.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  0:10 ` Peter Williams
@ 2007-04-20  4:48   ` Willy Tarreau
  2007-04-20  6:02     ` Peter Williams
  2007-04-20  6:46   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2007-04-20  4:48 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> Ingo Molnar wrote:
> >
> > - bugfix: use constant offset factor for nice levels instead of 
> >   sched_granularity_ns. Thus nice levels work even if someone sets 
> >   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >   address the many nice level related suggestions in -v4.
> 
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it onto 
> the CPU and then place it into an ordered list (based on this value) of 
> tasks waiting for the CPU.  I think that this is a great idea and my 
> suggestion is with regard to a method for working out this time that 
> takes into account both fairness and nice.
> 
> First suppose we have the following metrics available in addition to 
> what's already provided.
> 
> rq->avg_weight_load /* a running average of the weighted load on the CPU */
> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> CPU each scheduling cycle */
> 
> where a scheduling cycle for a task starts when it is placed on the 
> queue after waking or being preempted and ends when it is taken off the 
> CPU either voluntarily or after being preempted.  So 
> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> CPU each time it gets on to the CPU.  Sorry for the long explanation 
> here but I just wanted to make sure there was no chance that "scheduling 
> cycle" would be construed as some mechanism being imposed on the scheduler.)
> 
> We can then define:
> 
> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
> 
> If p is just waking (i.e. it's not on the queue and its load_weight is 
> not included in rq->raw_weighted_load) and we need to queue it, we say 
> that the maximum time (in all fairness) that p should have to wait to 
> get onto the CPU is:
> 
> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> p->load_weight
> 
> Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
> shift right per scheduling cycle of the task.  An additional cost is 
> that you need a shift right to get the nanosecond value from value 
> stored in the task struct. (i.e. the above code is simplified to give 
> the general idea).  The average would be number of cycles based rather 
> than time based and (happily) this simplifies the calculations.
> 
> If the expected time to get onto the CPU (i.e. expected_wait plus the 
> current time) for p is earlier than the equivalent time for the 
> currently running task then preemption of that task would be justified.

I 100% agree on this method because I came to nearly the same conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake up time
is not the most precise criterion for fairness. The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its work. If
t1 should wake up at time W1 and t2 at W2, then the list should be ordered
by comparing W1+T1 and W2+T2.

What I like with this method is that it remains fair with nice tasks because
because in order to renice a task tN, you just have to change TN, and if it
has to run shorter, it can be executed before CPU hogs and stay there for a
very short time.

Also, I found that if we want to respect interactivity, we must conserve a
credit for each task. It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
spend T3 nsec on the CPU, but only N3<T3, then we have a credit C3 computed
like this :

  C3 = MAX(MAX_CREDIT, C3 + T3 - N3)

And if a CPU hog uses more than its assigned time slice due to scheduler
resolution, then C3 can become negative (and bounded too) :

  C3 = MAX(MIN_CREDIT, C3 + T3 - N3)

Now how is the credit used ? Simple: the credit is a part of a timeslice, so
it's systematically added to the computed timeslice when ordering the tasks.
So we indeed order tasks t1 and t2 by W1+T1+C1 and W2+T2+C2.

It means that an interactive task which has not eaten all of timeslice will
accumulate time credit have great chances of being able to run before others
if it wakes up again, and even use slightly more CPU time than others if it
has not used it before. Conversely, if a task eats too much CPU time, it will
be punished and wait longer than others, and run less, to compensate for the
advance it has taken.

Also, what I like with this method is that it can correctly handle the
fork/exit storms which quickly change the per-task allocated CPU time.
Upon fork() or exit(), it should not be too hard to readjust Tx and Cx
for each task and reorder them according to their new completion time.

I've not found a way to include a variable nr_running in the tree and
order the tasks according to an external variable, hence the need to
rescan the tree upon fork/exit for maximum precision. That's where I
stopped working on those ideas. If someone knows how to order the tree
by (Wx+(Tx+Cx)/nr_running) with nr_running which can change at any time
but which is common for everyone in the tree, that would be great.

Best regards,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  4:48   ` Willy Tarreau
@ 2007-04-20  6:02     ` Peter Williams
  2007-04-20  6:21       ` Peter Williams
  2007-04-20  7:26       ` Willy Tarreau
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Williams @ 2007-04-20  6:02 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett

Willy Tarreau wrote:
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> Ingo Molnar wrote:
>>> - bugfix: use constant offset factor for nice levels instead of 
>>>   sched_granularity_ns. Thus nice levels work even if someone sets 
>>>   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
>>>   address the many nice level related suggestions in -v4.
>> I have a suggestion I'd like to make that addresses both nice and 
>> fairness at the same time.  As I understand the basic principle behind 
>> this scheduler it to work out a time by which a task should make it onto 
>> the CPU and then place it into an ordered list (based on this value) of 
>> tasks waiting for the CPU.  I think that this is a great idea and my 
>> suggestion is with regard to a method for working out this time that 
>> takes into account both fairness and nice.
>>
>> First suppose we have the following metrics available in addition to 
>> what's already provided.
>>
>> rq->avg_weight_load /* a running average of the weighted load on the CPU */
>> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
>> CPU each scheduling cycle */
>>
>> where a scheduling cycle for a task starts when it is placed on the 
>> queue after waking or being preempted and ends when it is taken off the 
>> CPU either voluntarily or after being preempted.  So 
>> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
>> CPU each time it gets on to the CPU.  Sorry for the long explanation 
>> here but I just wanted to make sure there was no chance that "scheduling 
>> cycle" would be construed as some mechanism being imposed on the scheduler.)
>>
>> We can then define:
>>
>> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
>>
>> If p is just waking (i.e. it's not on the queue and its load_weight is 
>> not included in rq->raw_weighted_load) and we need to queue it, we say 
>> that the maximum time (in all fairness) that p should have to wait to 
>> get onto the CPU is:
>>
>> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
>> p->load_weight
>>
>> Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
>> shift right per scheduling cycle of the task.  An additional cost is 
>> that you need a shift right to get the nanosecond value from value 
>> stored in the task struct. (i.e. the above code is simplified to give 
>> the general idea).  The average would be number of cycles based rather 
>> than time based and (happily) this simplifies the calculations.
>>
>> If the expected time to get onto the CPU (i.e. expected_wait plus the 
>> current time) for p is earlier than the equivalent time for the 
>> currently running task then preemption of that task would be justified.
> 
> I 100% agree on this method because I came to nearly the same conclusion on
> paper about 1 year ago. What I'd like to add is that the expected wake up time
> is not the most precise criterion for fairness.

It's not the expected wake up time being computed.  It's the expected 
time that the task is selected to run after being put on the queue 
(either as a result of waking or being pre-empted).

I think that your comments below are mainly invalid because of this 
understanding.

> The expected completion
> time is better. When you have one task t1 which is expected to run for T1
> nanosecs and another task t2 which is expected to run for T2, what is
> important for the user for fairness is when the task completes its work. If
> t1 should wake up at time W1 and t2 at W2, then the list should be ordered
> by comparing W1+T1 and W2+T2.

This is a point where misunderstanding results in a wrong conclusion. 
However, the notion of expected completion time is useful.  I'd 
calculate the expected completion time for the task as it goes on to the 
CPU and if while it is running a new task is queued whose expected "on 
CPU" time is before the current task's expected end time you can think 
about preempting when the new tasks "on CPU" task arrives -- how 
hard/easy this would be to implement is moot.

> 
> What I like with this method is that it remains fair with nice tasks because
> because in order to renice a task tN, you just have to change TN, and if it
> has to run shorter, it can be executed before CPU hogs and stay there for a
> very short time.
> 
> Also, I found that if we want to respect interactivity, we must conserve a
> credit for each task.

This is one point where the misunderstanding results in an invalid 
conclusion.  There is no need for credits as the use of the average time 
on CPU each cycle makes the necessary corrections that credits would be 
used to address.

> It is a bounded amount of CPU time left to be used. When
> the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
> spend T3 nsec on the CPU, but only N3<T3, then we have a credit C3 computed
> like this :
> 
>   C3 = MAX(MAX_CREDIT, C3 + T3 - N3)
> 
> And if a CPU hog uses more than its assigned time slice due to scheduler
> resolution, then C3 can become negative (and bounded too) :
> 
>   C3 = MAX(MIN_CREDIT, C3 + T3 - N3)
> 
> Now how is the credit used ? Simple: the credit is a part of a timeslice, so
> it's systematically added to the computed timeslice when ordering the tasks.
> So we indeed order tasks t1 and t2 by W1+T1+C1 and W2+T2+C2.

I think this is overcomplicating matters.

However, some compensating credit might be in order for tasks that get 
pre-empted e.g. base their next expected "on CPU" time on the difference 
between their average on CPU time and they amount they got to use before 
they were pre-empted.

> 
> It means that an interactive task which has not eaten all of timeslice will
> accumulate time credit have great chances of being able to run before others
> if it wakes up again, and even use slightly more CPU time than others if it
> has not used it before. Conversely, if a task eats too much CPU time, it will
> be punished and wait longer than others, and run less, to compensate for the
> advance it has taken.

I think that with this model you can almost do away with the notion of a 
time slice but I need to think about that some more as the idea I have 
for doing that may have some gotchas.

> 
> Also, what I like with this method is that it can correctly handle the
> fork/exit storms which quickly change the per-task allocated CPU time.
> Upon fork() or exit(), it should not be too hard to readjust Tx and Cx
> for each task and reorder them according to their new completion time.
> 
> I've not found a way to include a variable nr_running in the tree and
> order the tasks according to an external variable, hence the need to
> rescan the tree upon fork/exit for maximum precision. That's where I
> stopped working on those ideas. If someone knows how to order the tree
> by (Wx+(Tx+Cx)/nr_running) with nr_running which can change at any time
> but which is common for everyone in the tree, that would be great.

There is only one potential exploit for this method that I can see (but 
that doesn't mean there aren't others) and that is a task that does 
virtually no CPU use when on the CPU and has very short sleeps such as 
the expected "on CPU" time is "right now" (and it's happening at high 
frequency due to the short sleeps), even when other tasks are running, 
due to the average "on CPU" time being zero. This could be defeated by 
using min(average "on CPU" time, some system dependent minimum).

In practice, it might be prudent to use the idea of maximum expected "on 
CPU" time instead of the average (especially if you're going to use this 
for triggering pre-emption).  This would be the average plus two or 
three time the standard deviation and has the down side that you have to 
calculate the standard deviation and that means a square root (of course 
we could use some less mathematically rigorous metric to approximate the 
standard deviation).  The advantage would be that streamer type programs 
such as audio/video applications would have very small standard 
deviations and would hence get earlier expected "on CPU" times than 
other tasks with similar averages but more random distribution.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  6:02     ` Peter Williams
@ 2007-04-20  6:21       ` Peter Williams
  2007-04-20  7:26       ` Willy Tarreau
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Williams @ 2007-04-20  6:21 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett

Peter Williams wrote:
> Willy Tarreau wrote:
>> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>>> Ingo Molnar wrote:
>>>> - bugfix: use constant offset factor for nice levels instead of   
>>>> sched_granularity_ns. Thus nice levels work even if someone sets   
>>>> sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
>>>> address the many nice level related suggestions in -v4.
>>> I have a suggestion I'd like to make that addresses both nice and 
>>> fairness at the same time.  As I understand the basic principle 
>>> behind this scheduler it to work out a time by which a task should 
>>> make it onto the CPU and then place it into an ordered list (based on 
>>> this value) of tasks waiting for the CPU.  I think that this is a 
>>> great idea and my suggestion is with regard to a method for working 
>>> out this time that takes into account both fairness and nice.
>>>
>>> First suppose we have the following metrics available in addition to 
>>> what's already provided.
>>>
>>> rq->avg_weight_load /* a running average of the weighted load on the 
>>> CPU */
>>> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on 
>>> the CPU each scheduling cycle */
>>>
>>> where a scheduling cycle for a task starts when it is placed on the 
>>> queue after waking or being preempted and ends when it is taken off 
>>> the CPU either voluntarily or after being preempted.  So 
>>> p->avg_cpu_per_cycle is just the average amount of time p spends on 
>>> the CPU each time it gets on to the CPU.  Sorry for the long 
>>> explanation here but I just wanted to make sure there was no chance 
>>> that "scheduling cycle" would be construed as some mechanism being 
>>> imposed on the scheduler.)
>>>
>>> We can then define:
>>>
>>> effective_weighted_load = max(rq->raw_weighted_load, 
>>> rq->avg_weighted_load)
>>>
>>> If p is just waking (i.e. it's not on the queue and its load_weight 
>>> is not included in rq->raw_weighted_load) and we need to queue it, we 
>>> say that the maximum time (in all fairness) that p should have to 
>>> wait to get onto the CPU is:
>>>
>>> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
>>> p->load_weight
>>>
>>> Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
>>> shift right per scheduling cycle of the task.  An additional cost is 
>>> that you need a shift right to get the nanosecond value from value 
>>> stored in the task struct. (i.e. the above code is simplified to give 
>>> the general idea).  The average would be number of cycles based 
>>> rather than time based and (happily) this simplifies the calculations.
>>>
>>> If the expected time to get onto the CPU (i.e. expected_wait plus the 
>>> current time) for p is earlier than the equivalent time for the 
>>> currently running task then preemption of that task would be justified.
>>
>> I 100% agree on this method because I came to nearly the same 
>> conclusion on
>> paper about 1 year ago. What I'd like to add is that the expected wake 
>> up time
>> is not the most precise criterion for fairness.
> 
> It's not the expected wake up time being computed.  It's the expected 
> time that the task is selected to run after being put on the queue 
> (either as a result of waking or being pre-empted).
> 
> I think that your comments below are mainly invalid because of this 
> understanding.
> 
>> The expected completion
>> time is better. When you have one task t1 which is expected to run for T1
>> nanosecs and another task t2 which is expected to run for T2, what is
>> important for the user for fairness is when the task completes its 
>> work. If
>> t1 should wake up at time W1 and t2 at W2, then the list should be 
>> ordered
>> by comparing W1+T1 and W2+T2.
> 
> This is a point where misunderstanding results in a wrong conclusion. 
> However, the notion of expected completion time is useful.  I'd 
> calculate the expected completion time for the task as it goes on to the 
> CPU and if while it is running a new task is queued whose expected "on 
> CPU" time is before the current task's expected end time you can think 
> about preempting when the new tasks "on CPU" task arrives -- how 
> hard/easy this would be to implement is moot.
> 
>>
>> What I like with this method is that it remains fair with nice tasks 
>> because
>> because in order to renice a task tN, you just have to change TN, and 
>> if it
>> has to run shorter, it can be executed before CPU hogs and stay there 
>> for a
>> very short time.
>>
>> Also, I found that if we want to respect interactivity, we must 
>> conserve a
>> credit for each task.
> 
> This is one point where the misunderstanding results in an invalid 
> conclusion.  There is no need for credits as the use of the average time 
> on CPU each cycle makes the necessary corrections that credits would be 
> used to address.
> 
>> It is a bounded amount of CPU time left to be used. When
>> the task t3 has the right to use T3 nsecs, and wakes up at W3, if it 
>> does not
>> spend T3 nsec on the CPU, but only N3<T3, then we have a credit C3 
>> computed
>> like this :
>>
>>   C3 = MAX(MAX_CREDIT, C3 + T3 - N3)
>>
>> And if a CPU hog uses more than its assigned time slice due to scheduler
>> resolution, then C3 can become negative (and bounded too) :
>>
>>   C3 = MAX(MIN_CREDIT, C3 + T3 - N3)
>>
>> Now how is the credit used ? Simple: the credit is a part of a 
>> timeslice, so
>> it's systematically added to the computed timeslice when ordering the 
>> tasks.
>> So we indeed order tasks t1 and t2 by W1+T1+C1 and W2+T2+C2.
> 
> I think this is overcomplicating matters.
> 
> However, some compensating credit might be in order for tasks that get 
> pre-empted e.g. base their next expected "on CPU" time on the difference 
> between their average on CPU time and they amount they got to use before 
> they were pre-empted.

Oops.  I made a mistake here it should be just the amount they got this 
time if that's less than their average and otherwise the average.  That 
way if they're booted off early they get back sooner and if they get 
booted off late in the game they wait their normal amount.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  0:10 ` Peter Williams
  2007-04-20  4:48   ` Willy Tarreau
@ 2007-04-20  6:46   ` Ingo Molnar
  2007-04-20  7:32     ` Peter Williams
  2007-04-20 13:15   ` William Lee Irwin III
  2007-04-20 14:21   ` Peter Williams
  3 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2007-04-20  6:46 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> > - bugfix: use constant offset factor for nice levels instead of
> >   sched_granularity_ns. Thus nice levels work even if someone sets 
> >   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >   address the many nice level related suggestions in -v4.
> 
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it 
> onto the CPU and then place it into an ordered list (based on this 
> value) of tasks waiting for the CPU. I think that this is a great idea 
> [...]

yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)

Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler does 
are micro-decisions based on math with the nanosec-granularity values. 
Both the rbtree and nanosec accounting are a straight consequence of 
this too: they are the tools that allow the implementation of this 
concept in the highest-quality way. It's certainly a very exciting 
experiment to me and the feedback 'from the field' is very promising so 
far.

> [...] and my suggestion is with regard to a method for working out 
> this time that takes into account both fairness and nice.
> 
> First suppose we have the following metrics available in addition to 
> what's already provided.
> 
> rq->avg_weight_load /* a running average of the weighted load on the 
> CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends 
> on the CPU each scheduling cycle */

yes. rq->nr_running is really just a first-level approximation of 
rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.

> I appreciate that the notion of basing the expected wait on the task's 
> average cpu use per scheduling cycle is counter intuitive but I 
> believe that (if you think about it) you'll see that it actually makes 
> sense.

hm. So far i tried to not do any statistical approach anywhere: the 
p->wait_runtime metric (which drives the task ordering) is in essence an 
absolutely precise 'integral' of the 'expected runtimes' that the task 
observes and hence is a precise "load-average as observed by the task" 
in itself. Every time we base some metric on an average value we 
introduce noise into the system.

i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq->nr_running, but 
regarding the basic calculations i'd rather lean towards using 
rq->raw_weighted_load. Hm?

your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy period 
while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  6:02     ` Peter Williams
  2007-04-20  6:21       ` Peter Williams
@ 2007-04-20  7:26       ` Willy Tarreau
  1 sibling, 0 replies; 39+ messages in thread
From: Willy Tarreau @ 2007-04-20  7:26 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett

On Fri, Apr 20, 2007 at 04:02:41PM +1000, Peter Williams wrote:
> Willy Tarreau wrote:
> >On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> >>Ingo Molnar wrote:
> >>>- bugfix: use constant offset factor for nice levels instead of 
> >>>  sched_granularity_ns. Thus nice levels work even if someone sets 
> >>>  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >>>  address the many nice level related suggestions in -v4.
> >>I have a suggestion I'd like to make that addresses both nice and 
> >>fairness at the same time.  As I understand the basic principle behind 
> >>this scheduler it to work out a time by which a task should make it onto 
> >>the CPU and then place it into an ordered list (based on this value) of 
> >>tasks waiting for the CPU.  I think that this is a great idea and my 
> >>suggestion is with regard to a method for working out this time that 
> >>takes into account both fairness and nice.
> >>
> >>First suppose we have the following metrics available in addition to 
> >>what's already provided.
> >>
> >>rq->avg_weight_load /* a running average of the weighted load on the CPU 
> >>*/
> >>p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> >>CPU each scheduling cycle */
> >>
> >>where a scheduling cycle for a task starts when it is placed on the 
> >>queue after waking or being preempted and ends when it is taken off the 
> >>CPU either voluntarily or after being preempted.  So 
> >>p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> >>CPU each time it gets on to the CPU.  Sorry for the long explanation 
> >>here but I just wanted to make sure there was no chance that "scheduling 
> >>cycle" would be construed as some mechanism being imposed on the 
> >>scheduler.)
> >>
> >>We can then define:
> >>
> >>effective_weighted_load = max(rq->raw_weighted_load, 
> >>rq->avg_weighted_load)
> >>
> >>If p is just waking (i.e. it's not on the queue and its load_weight is 
> >>not included in rq->raw_weighted_load) and we need to queue it, we say 
> >>that the maximum time (in all fairness) that p should have to wait to 
> >>get onto the CPU is:
> >>
> >>expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> >>p->load_weight
> >>
> >>Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
> >>shift right per scheduling cycle of the task.  An additional cost is 
> >>that you need a shift right to get the nanosecond value from value 
> >>stored in the task struct. (i.e. the above code is simplified to give 
> >>the general idea).  The average would be number of cycles based rather 
> >>than time based and (happily) this simplifies the calculations.
> >>
> >>If the expected time to get onto the CPU (i.e. expected_wait plus the 
> >>current time) for p is earlier than the equivalent time for the 
> >>currently running task then preemption of that task would be justified.
> >
> >I 100% agree on this method because I came to nearly the same conclusion on
> >paper about 1 year ago. What I'd like to add is that the expected wake up 
> >time
> >is not the most precise criterion for fairness.
> 
> It's not the expected wake up time being computed.  It's the expected 
> time that the task is selected to run after being put on the queue 
> (either as a result of waking or being pre-empted).

Sorry, maybe I've not chosen the appropriate words. When I said "wakeup time",
I meant "the date at which the task must reach the CPU". They are the same
when the task is alone, but different when other tasks are run before.

But my goal definitely was to avoid using this alone and use the expected
completion time instead, from which the "on CPU" date can be computed.

The on-CPU time should be reduced if the task ran too long on previous
iterations, and enlarged if the task did not consume all of its last
timeslice, hence the idea behind the credit. I think that it is important
for streaming processes (eg: mp3 players) which will not always consume
the same amount of CPU but are very close to an average after across a
few timeslices.

[...]

> In practice, it might be prudent to use the idea of maximum expected "on 
> CPU" time instead of the average (especially if you're going to use this 
> for triggering pre-emption).

yes

> This would be the average plus two or 
> three time the standard deviation and has the down side that you have to 
> calculate the standard deviation and that means a square root (of course 
> we could use some less mathematically rigorous metric to approximate the 
> standard deviation).

You just have not to do the square root, and compare the unsquared values
instead.

> The advantage would be that streamer type programs 
> such as audio/video applications would have very small standard 
> deviations and would hence get earlier expected "on CPU" times than 
> other tasks with similar averages but more random distribution.

I like this idea.

Regards,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  6:46   ` Ingo Molnar
@ 2007-04-20  7:32     ` Peter Williams
  2007-04-20 12:28       ` Peter Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-20  7:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>>> - bugfix: use constant offset factor for nice levels instead of
>>>   sched_granularity_ns. Thus nice levels work even if someone sets 
>>>   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
>>>   address the many nice level related suggestions in -v4.
>> I have a suggestion I'd like to make that addresses both nice and 
>> fairness at the same time.  As I understand the basic principle behind 
>> this scheduler it to work out a time by which a task should make it 
>> onto the CPU and then place it into an ordered list (based on this 
>> value) of tasks waiting for the CPU. I think that this is a great idea 
>> [...]
> 
> yes, that's exactly the main idea behind CFS, and thanks for the 
> compliment :)
> 
> Under this concept the scheduler never really has to guess: every 
> scheduler decision derives straight from the relatively simple 
> one-sentence (!) scheduling concept outlined above. Everything that 
> tasks 'get' is something they 'earned' before and all the scheduler does 
> are micro-decisions based on math with the nanosec-granularity values. 
> Both the rbtree and nanosec accounting are a straight consequence of 
> this too: they are the tools that allow the implementation of this 
> concept in the highest-quality way. It's certainly a very exciting 
> experiment to me and the feedback 'from the field' is very promising so 
> far.
> 
>> [...] and my suggestion is with regard to a method for working out 
>> this time that takes into account both fairness and nice.
>>
>> First suppose we have the following metrics available in addition to 
>> what's already provided.
>>
>> rq->avg_weight_load /* a running average of the weighted load on the 
>> CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends 
>> on the CPU each scheduling cycle */
> 
> yes. rq->nr_running is really just a first-level approximation of 
> rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.
> 
>> I appreciate that the notion of basing the expected wait on the task's 
>> average cpu use per scheduling cycle is counter intuitive but I 
>> believe that (if you think about it) you'll see that it actually makes 
>> sense.
> 
> hm. So far i tried to not do any statistical approach anywhere: the 
> p->wait_runtime metric (which drives the task ordering) is in essence an 
> absolutely precise 'integral' of the 'expected runtimes' that the task 
> observes and hence is a precise "load-average as observed by the task"

To me this is statistics :-)

> in itself. Every time we base some metric on an average value we 
> introduce noise into the system.
> 
> i definitely agree with your suggestion that CFS should use a 
> nice-scaled metric for 'load' instead of the current rq->nr_running, but 
> regarding the basic calculations i'd rather lean towards using 
> rq->raw_weighted_load. Hm?

This can result in jerkiness (in my experience) but using the smoothed 
version is certainly something that can be tried later rather than 
sooner.  Perhaps just something to bear in mind as a solution to 
"jerkiness" if it manifests.

> 
> your suggestion concentrates on the following scenario: if a task 
> happens to schedule in an 'unlucky' way and happens to hit a busy period 
> while there are many idle periods. Unless i misunderstood your 
> suggestion, that is the main intention behind it, correct?

You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that it's 
entitled to.  In reality, sometimes it will get more and sometimes less 
but on average it should get what it deserves. E.g. If you had two tasks 
with equal nice and both had demands of 90% of a CPU you'd expect them 
each to get about half of the CPU bandwidth.  Now suppose that one of 
them uses 5ms of CPU each time it got onto the CPU and the other uses 
10ms.  If these two tasks just round robin with each other the likely 
outcome is that the one with the 10ms bursts will get twice as much CPU 
as the other but my proposed method should prevent and cause them to get 
roughly the same amount of CPU.  (I believe this was a scenario that 
caused problems with O(1) and required a fix at some stage?)

BTW this has the advantage that the decay rate used in calculating the 
task's statistics can be used to control how quickly the scheduler 
reacts to changes in the task's behaviour.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  7:32     ` Peter Williams
@ 2007-04-20 12:28       ` Peter Williams
  2007-04-21  8:07         ` Peter Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-20 12:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

Peter Williams wrote:
> Ingo Molnar wrote:
>> * Peter Williams <pwil3058@bigpond.net.au> wrote:
>>
>>>> - bugfix: use constant offset factor for nice levels instead of
>>>>   sched_granularity_ns. Thus nice levels work even if someone sets   
>>>> sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
>>>> address the many nice level related suggestions in -v4.
>>> I have a suggestion I'd like to make that addresses both nice and 
>>> fairness at the same time.  As I understand the basic principle 
>>> behind this scheduler it to work out a time by which a task should 
>>> make it onto the CPU and then place it into an ordered list (based on 
>>> this value) of tasks waiting for the CPU. I think that this is a 
>>> great idea [...]
>>
>> yes, that's exactly the main idea behind CFS, and thanks for the 
>> compliment :)
>>
>> Under this concept the scheduler never really has to guess: every 
>> scheduler decision derives straight from the relatively simple 
>> one-sentence (!) scheduling concept outlined above. Everything that 
>> tasks 'get' is something they 'earned' before and all the scheduler 
>> does are micro-decisions based on math with the nanosec-granularity 
>> values. Both the rbtree and nanosec accounting are a straight 
>> consequence of this too: they are the tools that allow the 
>> implementation of this concept in the highest-quality way. It's 
>> certainly a very exciting experiment to me and the feedback 'from the 
>> field' is very promising so far.
>>
>>> [...] and my suggestion is with regard to a method for working out 
>>> this time that takes into account both fairness and nice.
>>>
>>> First suppose we have the following metrics available in addition to 
>>> what's already provided.
>>>
>>> rq->avg_weight_load /* a running average of the weighted load on the 
>>> CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p 
>>> spends on the CPU each scheduling cycle */
>>
>> yes. rq->nr_running is really just a first-level approximation of 
>> rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.
>>
>>> I appreciate that the notion of basing the expected wait on the 
>>> task's average cpu use per scheduling cycle is counter intuitive but 
>>> I believe that (if you think about it) you'll see that it actually 
>>> makes sense.
>>
>> hm. So far i tried to not do any statistical approach anywhere: the 
>> p->wait_runtime metric (which drives the task ordering) is in essence 
>> an absolutely precise 'integral' of the 'expected runtimes' that the 
>> task observes and hence is a precise "load-average as observed by the 
>> task"
> 
> To me this is statistics :-)
> 
>> in itself. Every time we base some metric on an average value we 
>> introduce noise into the system.
>>
>> i definitely agree with your suggestion that CFS should use a 
>> nice-scaled metric for 'load' instead of the current rq->nr_running, 
>> but regarding the basic calculations i'd rather lean towards using 
>> rq->raw_weighted_load. Hm?
> 
> This can result in jerkiness (in my experience) but using the smoothed 
> version is certainly something that can be tried later rather than 
> sooner.  Perhaps just something to bear in mind as a solution to 
> "jerkiness" if it manifests.
> 
>>
>> your suggestion concentrates on the following scenario: if a task 
>> happens to schedule in an 'unlucky' way and happens to hit a busy 
>> period while there are many idle periods. Unless i misunderstood your 
>> suggestion, that is the main intention behind it, correct?
> 
> You misunderstand (that's one of my other schedulers :-)).  This one's 
> based on the premise that if everything happens as the task expects it 
> will get the amount of CPU bandwidth (over this short period) that it's 
> entitled to.  In reality, sometimes it will get more and sometimes less 
> but on average it should get what it deserves. E.g. If you had two tasks 
> with equal nice and both had demands of 90% of a CPU you'd expect them 
> each to get about half of the CPU bandwidth.  Now suppose that one of 
> them uses 5ms of CPU each time it got onto the CPU and the other uses 
> 10ms.  If these two tasks just round robin with each other the likely 
> outcome is that the one with the 10ms bursts will get twice as much CPU 
> as the other but my proposed method should prevent and cause them to get 
> roughly the same amount of CPU.  (I believe this was a scenario that 
> caused problems with O(1) and required a fix at some stage?)
> 
> BTW this has the advantage that the decay rate used in calculating the 
> task's statistics can be used to control how quickly the scheduler 
> reacts to changes in the task's behaviour.

I think that, with this model, if the current task hasn't surrendered 
the CPU when the next task on the queue's "on CPU" time arrives that the 
current task should be pre-empted in favour of that task.  I'm not sure 
what would be the best way to implement this.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  0:10 ` Peter Williams
  2007-04-20  4:48   ` Willy Tarreau
  2007-04-20  6:46   ` Ingo Molnar
@ 2007-04-20 13:15   ` William Lee Irwin III
  2007-04-21  0:23     ` Peter Williams
  2007-04-20 14:21   ` Peter Williams
  3 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-20 13:15 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it onto 
> the CPU and then place it into an ordered list (based on this value) of 
> tasks waiting for the CPU.  I think that this is a great idea and my 
> suggestion is with regard to a method for working out this time that 
> takes into account both fairness and nice.

Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> First suppose we have the following metrics available in addition to 
> what's already provided.
> rq->avg_weight_load /* a running average of the weighted load on the CPU */
> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> CPU each scheduling cycle */

I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> where a scheduling cycle for a task starts when it is placed on the 
> queue after waking or being preempted and ends when it is taken off the 
> CPU either voluntarily or after being preempted.  So 
> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> CPU each time it gets on to the CPU.  Sorry for the long explanation 
> here but I just wanted to make sure there was no chance that "scheduling 
> cycle" would be construed as some mechanism being imposed on the scheduler.)

I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> We can then define:
> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
> If p is just waking (i.e. it's not on the queue and its load_weight is 
> not included in rq->raw_weighted_load) and we need to queue it, we say 
> that the maximum time (in all fairness) that p should have to wait to 
> get onto the CPU is:
> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> p->load_weight

This doesn't look right, probably because the scaling factor of
p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
the ->avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p->load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.
	p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue (whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
	p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.

If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:
	last_sleep = now - p->last_went_to_sleep;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	wait = wait > last_sleep ? wait - last_sleep : 0;

By and large the scheduler can deterministically choose waits on the
runqueue but it doesn't actually do that. Some additional corrections
for delays beyond those decided up-front while on the runqueue or
getting scheduled early on the runqueue may also help, though I'm not
as interested in them as I am the one for sleep:
	last_wait = now - p->last_taken_off_the_cpu;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	if (wait > last_wait) /* didn't wait long enough? */
		wait += wait - last_wait; /* wait the difference longer */
	else if (wait < last_wait) /* waited too long? */
		/* wait less to make up the difference */
		wait -= wait > last_wait - wait ? last_wait - wait : wait;
Where ->last_taken_off_the_cpu represents the time the task was last
taken off the CPU for whatever reason (this may need to be done
differently to handle time variables).

In order to do better, longer-term history is required, but it can't be
a truly infinite history. To attempt to maintain an infinite history of
bandwidth underutilization to be credited too far in the future would
enable potentially long-term overutilization when such credits are
cashed en masse for a sustained period of time. At some point you have
to say "use it or lose it;" over a shorter period of time some smoothing
is still admissible and even desirable.

One might also consider debits owing to non-preemptibility or other
sorts of cpu bandwidth overutilization with similar caveats. A half-
life to an otherwise infinite history of credits and/or debits
specified in absolute time may be appropriate, though it's not quite as
computationally cheap as the above. The accounting for these credits
and debits would take the place of ->last_taken_off_the_cpu above.

Another attack on the problem of CPU bandwidth misallocation would be
to further credit or debit the task according to the ratio of actual
CPU bandwidth to allocated CPU bandwidth in order to compensate
for prior failures to enforce the allocation. Unfortunately, the result
of scaling the artificial inter-arrival time in the obvious way is
degenerate (independent of priority) so it can't be done that simply.
Perhaps it could be used solely to adjust the credit and debit history
contribution to the quasi-inter-arrival time or the difference used
somehow, but I don't care to make more complex proposals than the
first-order correction above (if it's even worthy of being called a
proposal or a correction) for either this or a time-decaying credit
and debit history or even debit accounting at all.

I'd be happy enough to see the correction term to subtract out sleeping
time (i.e. the code snippet with last_sleep). The rest and/or stronger
attempts to factor in sleeping time or bandwidth misallocation I don't
consider so significant, at least not without some demonstration there
is CPU bandwidth misallocation happening.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> I appreciate that the notion of basing the expected wait on the task's 
> average cpu use per scheduling cycle is counter intuitive but I believe 
> that (if you think about it) you'll see that it actually makes sense.

I don't know. It sounds rather standard to me.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20  0:10 ` Peter Williams
                     ` (2 preceding siblings ...)
  2007-04-20 13:15   ` William Lee Irwin III
@ 2007-04-20 14:21   ` Peter Williams
  2007-04-20 14:33     ` Ingo Molnar
  3 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-20 14:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

Peter Williams wrote:
> Ingo Molnar wrote:
>>
>>  - bugfix: use constant offset factor for nice levels instead of    
>> sched_granularity_ns. Thus nice levels work even if someone sets    
>> sched_granularity_ns to 0. NOTE: nice support is still naive, i'll    
>> address the many nice level related suggestions in -v4.
> 
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it onto 
> the CPU and then place it into an ordered list (based on this value) of 
> tasks waiting for the CPU.  I think that this is a great idea and my 
> suggestion is with regard to a method for working out this time that 
> takes into account both fairness and nice.
> 
> First suppose we have the following metrics available in addition to 
> what's already provided.
> 
> rq->avg_weight_load /* a running average of the weighted load on the CPU */
> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> CPU each scheduling cycle */
> 
> where a scheduling cycle for a task starts when it is placed on the 
> queue after waking or being preempted and ends when it is taken off the 
> CPU either voluntarily or after being preempted.  So 
> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> CPU each time it gets on to the CPU.  Sorry for the long explanation 
> here but I just wanted to make sure there was no chance that "scheduling 
> cycle" would be construed as some mechanism being imposed on the 
> scheduler.)
> 
> We can then define:
> 
> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
> 
> If p is just waking (i.e. it's not on the queue and its load_weight is 
> not included in rq->raw_weighted_load) and we need to queue it, we say 
> that the maximum time (in all fairness) that p should have to wait to 
> get onto the CPU is:
> 
> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> p->load_weight

I just realized that this is wrong for the case where p is being woken. 
  In that case the length of the last sleep should be subtracted from 
the above value and if the result is negative pre-empt straight away. 
So expected_wait becomes:

expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight - p->length_of_last_sleep

As the length of the last sleep will be being calculated during the wake 
process there's no real need for it to be a task field and a local 
variable could be used instead.

For a task being requeued during pre-emption the equation would be:

expected_wait = time_just_spent_on_the_cpu * effective_weighted_load / 
p->load_weight

as there was zero sleep since last time on CPU.  Using the actual time 
on the CPU so far this time (instead of the average) will compensate the 
task for being pre-empted.

> 
> Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
> shift right per scheduling cycle of the task.  An additional cost is 
> that you need a shift right to get the nanosecond value from value 
> stored in the task struct. (i.e. the above code is simplified to give 
> the general idea).  The average would be number of cycles based rather 
> than time based and (happily) this simplifies the calculations.

If you don't like using the average CPU time per cycle, you could just 
use the length of time on the CPU for the last time the task was on the 
CPU.  This would simplify things a lot.

I've been thinking about the "jerkiness" (or lack of "smoothness") that 
I said would occur if smoothed averages weren't used and I've realised 
that what I was talking about was observed jerkiness in the dynamic 
priorities of the tasks.  As this scheduler has dispensed with dynamic 
priorities (as far as I can see) the jerkiness probably won't be apparent.

BTW Given that I'm right and dynamic priorities have been dispensed with 
what do you intend exporting (in their place) to user space for display 
by top and similar?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 14:21   ` Peter Williams
@ 2007-04-20 14:33     ` Ingo Molnar
  0 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2007-04-20 14:33 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> BTW Given that I'm right and dynamic priorities have been dispensed 
> with what do you intend exporting (in their place) to user space for 
> display by top and similar?

well i thought of only displaying static ones, i.e. like the current 
patch does.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-18 21:26 ` William Lee Irwin III
  2007-04-18 21:33   ` Ingo Molnar
@ 2007-04-20 19:24   ` Christoph Lameter
  2007-04-20 19:26     ` Siddha, Suresh B
  2007-04-20 19:29     ` William Lee Irwin III
  1 sibling, 2 replies; 39+ messages in thread
From: Christoph Lameter @ 2007-04-20 19:24 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

On Wed, 18 Apr 2007, William Lee Irwin III wrote:

> 
> Mark the runqueues ____cacheline_aligned_in_smp to avoid false sharing.

False sharing for a per cpu data structure? Are we updating that 
structure from other processors?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:24   ` Christoph Lameter
@ 2007-04-20 19:26     ` Siddha, Suresh B
  2007-04-20 19:29     ` William Lee Irwin III
  1 sibling, 0 replies; 39+ messages in thread
From: Siddha, Suresh B @ 2007-04-20 19:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> On Wed, 18 Apr 2007, William Lee Irwin III wrote:
> 
> > 
> > Mark the runqueues ____cacheline_aligned_in_smp to avoid false sharing.
> 
> False sharing for a per cpu data structure? Are we updating that 
> structure from other processors?

yes. during a remote process wakeup.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:24   ` Christoph Lameter
  2007-04-20 19:26     ` Siddha, Suresh B
@ 2007-04-20 19:29     ` William Lee Irwin III
  2007-04-20 19:33       ` Christoph Lameter
  1 sibling, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-20 19:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

On Wed, 18 Apr 2007, William Lee Irwin III wrote:
>> 
>> Mark the runqueues ____cacheline_aligned_in_smp to avoid false sharing.

On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> False sharing for a per cpu data structure? Are we updating that 
> structure from other processors?

Primarily in the load balancer, but also in wakeups.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:29     ` William Lee Irwin III
@ 2007-04-20 19:33       ` Christoph Lameter
  2007-04-20 19:38         ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2007-04-20 19:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> On Wed, 18 Apr 2007, William Lee Irwin III wrote:
> >> 
> >> Mark the runqueues ____cacheline_aligned_in_smp to avoid false sharing.
> 
> On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> > False sharing for a per cpu data structure? Are we updating that 
> > structure from other processors?
> 
> Primarily in the load balancer, but also in wakeups.

That is fairly rare I think. What other variables that are also writtten 
frequently would cause false sharing?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:33       ` Christoph Lameter
@ 2007-04-20 19:38         ` William Lee Irwin III
  2007-04-20 19:44           ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-20 19:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
>>> False sharing for a per cpu data structure? Are we updating that 
>>> structure from other processors?

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> Primarily in the load balancer, but also in wakeups.

On Fri, Apr 20, 2007 at 12:33:13PM -0700, Christoph Lameter wrote:
> That is fairly rare I think. What other variables that are also writtten 
> frequently would cause false sharing?

I wrote that backward, sorry. Cross-CPU wakeups' frequency depend
heavily on the workload. Probably the only other case I can think of
is io_schedule() but that's not really significant.

I'm not really convinced it's all that worthwhile of an optimization,
essentially for the same reasons as you, but presumably there's a
benchmark result somewhere that says it matters. I've just not seen it.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:38         ` William Lee Irwin III
@ 2007-04-20 19:44           ` Christoph Lameter
  2007-04-20 20:03             ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2007-04-20 19:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> I'm not really convinced it's all that worthwhile of an optimization,
> essentially for the same reasons as you, but presumably there's a
> benchmark result somewhere that says it matters. I've just not seen it.

If it is true that we frequently remotely write the per cpu runqueue 
data then we may have a NUMA scalability issue.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 19:44           ` Christoph Lameter
@ 2007-04-20 20:03             ` William Lee Irwin III
  2007-04-20 20:11               ` Siddha, Suresh B
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-20 20:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> I'm not really convinced it's all that worthwhile of an optimization,
>> essentially for the same reasons as you, but presumably there's a
>> benchmark result somewhere that says it matters. I've just not seen it.

On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
> If it is true that we frequently remotely write the per cpu runqueue 
> data then we may have a NUMA scalability issue.

>From the discussion on Suresh's thread, it appears to have sped up a
database benchmark 0.5%.

Last I checked it was workload-dependent, but there were things that
hammer it. I mostly know of the remote wakeup issue, but there could
be other things besides wakeups that do it, too.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 20:03             ` William Lee Irwin III
@ 2007-04-20 20:11               ` Siddha, Suresh B
  2007-04-24 17:39                 ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Siddha, Suresh B @ 2007-04-20 20:11 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christoph Lameter, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

On Fri, Apr 20, 2007 at 01:03:22PM -0700, William Lee Irwin III wrote:
> On Fri, 20 Apr 2007, William Lee Irwin III wrote:
> >> I'm not really convinced it's all that worthwhile of an optimization,
> >> essentially for the same reasons as you, but presumably there's a
> >> benchmark result somewhere that says it matters. I've just not seen it.
> 
> On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
> > If it is true that we frequently remotely write the per cpu runqueue 
> > data then we may have a NUMA scalability issue.
> 
> From the discussion on Suresh's thread, it appears to have sped up a
> database benchmark 0.5%.
> 
> Last I checked it was workload-dependent, but there were things that
> hammer it. I mostly know of the remote wakeup issue, but there could
> be other things besides wakeups that do it, too.

remote wakeup was the main issue and the 0.5% improvement was seen
on a two node platform. Aligning it reduces the number of remote
cachelines that needs to be touched as part of this wakeup.

thanks,
suresh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 13:15   ` William Lee Irwin III
@ 2007-04-21  0:23     ` Peter Williams
  2007-04-21  5:07       ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-21  0:23 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

William Lee Irwin III wrote:
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> I have a suggestion I'd like to make that addresses both nice and 
>> fairness at the same time.  As I understand the basic principle behind 
>> this scheduler it to work out a time by which a task should make it onto 
>> the CPU and then place it into an ordered list (based on this value) of 
>> tasks waiting for the CPU.  I think that this is a great idea and my 
>> suggestion is with regard to a method for working out this time that 
>> takes into account both fairness and nice.
> 
> Hmm. Let me take a look...
> 
> 
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> First suppose we have the following metrics available in addition to 
>> what's already provided.
>> rq->avg_weight_load /* a running average of the weighted load on the CPU */
>> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
>> CPU each scheduling cycle */
> 
> I'm suspicious of mean service times not paired with mean inter-arrival
> times.
> 
> 
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> where a scheduling cycle for a task starts when it is placed on the 
>> queue after waking or being preempted and ends when it is taken off the 
>> CPU either voluntarily or after being preempted.  So 
>> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
>> CPU each time it gets on to the CPU.  Sorry for the long explanation 
>> here but I just wanted to make sure there was no chance that "scheduling 
>> cycle" would be construed as some mechanism being imposed on the scheduler.)
> 
> I went and looked up priority queueing queueing theory garbage and
> re-derived various things I needed. The basics check out. Probably no
> one cares that I checked.
> 
> 
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> We can then define:
>> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
>> If p is just waking (i.e. it's not on the queue and its load_weight is 
>> not included in rq->raw_weighted_load) and we need to queue it, we say 
>> that the maximum time (in all fairness) that p should have to wait to 
>> get onto the CPU is:
>> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
>> p->load_weight

You're right.  The time that the task spent sleeping before being woken 
should be subtracted from this value.  If the answer is less than or 
equal to zero pre-emption should occur.

> 
> This doesn't look right, probably because the scaling factor of
> p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
> the ->avg_weight_load as opposed to a direct estimate of its initial
> delay or waiting time before completing its current requested service.
> 
> p->load_weight/effective_weighted_load more properly represents an
> entitlement to CPU bandwidth.

Yes.  But expected_wait isn't entitlement it's its inverse.

> 	p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
> would be more like the expected time spent on the runqueue

When I went to school that would be just another way of expressing the 
equation that I expressed.

> (whether
> waiting to run or actually running) for a time interval spent in a
> runnable state and the expected time runnable and waiting to run in such
> an interval would be
> 	p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),
> 
> Neither represents the initial delay between entering the runqeueue and
> first acquiring the CPU, but that's a bit hard to figure out without
> deciding the scheduling policy up-front anyway.
> 
> This essentially doesn't look correct because while you want to enforce
> the CPU bandwidth allocation, this doesn't have much to do with that
> apart from the CPU bandwidth appearing as a term. It's more properly
> a rate of service as opposed to a time at which anything should happen
> or a number useful for predicting such. When service should begin more
> properly depends on the other tasks in the system and a number of other
> decisions that are part of the scheduling policy.

This model takes all of those into consideration.  The idea is not just 
to predict but to use the calculated time to decide when to boot the 
current process of the CPU (if it doesn't leave voluntarily) and put 
this one on.  This more or less removes the need to give each task a 
predetermined chunk of CPU when they go on to the CPU.  This should, in 
general, reduce the number context switches as tasks get to run until 
they've finished what they're doing or another task becomes higher 
priority rather than being booted off after an arbitrary time interval. 
  (If this ever gets tried it will be interesting to see if this 
prediction comes true.)

BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
patch (way in the future after Ingo's changes are settled) to try it out 
myself.

> 
> If you want to choose a "quasi-inter-arrival time" to achieve the
> specified CPU bandwidth allocation, this would be it, but to use that
> to actually enforce the CPU bandwidth allocation, you would need to
> take into account the genuine inter-arrival time to choose an actual
> time for service to begin. In other words, this should be a quota for
> the task to have waited. If it's not waited long enough, then it should
> be delayed by the difference to achieve the inter-arrival time you're
> trying to enforce. If it's waited longer, it should be executed
> sooner modulo other constraints, and perhaps even credited for future
> scheduling cycles.

The idea isn't to enforce the bandwidth entitlement to the extent of 
throttling tasks if they exceed their entitlement and there's no other 
tasks ready to use the CPU.  This is mainly because the bandwidth 
entitlement isn't fixed -- it's changing constantly as the number and 
type of runnable tasks changes.

> 
> In order to partially avoid underallocating CPU bandwidth to p, one
> should track the time last spent sleeping and do the following:

Yes I made a mistake in omitting to take into account sleep interval. 
See another e-mail to Ingo correcting this problem.

> 	last_sleep = now - p->last_went_to_sleep;
> 	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
> 	wait = wait > last_sleep ? wait - last_sleep : 0;
> 
> By and large the scheduler can deterministically choose waits on the
> runqueue but it doesn't actually do that. Some additional corrections
> for delays beyond those decided up-front while on the runqueue or
> getting scheduled early on the runqueue may also help, though I'm not
> as interested in them as I am the one for sleep:
> 	last_wait = now - p->last_taken_off_the_cpu;
> 	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
> 	if (wait > last_wait) /* didn't wait long enough? */
> 		wait += wait - last_wait; /* wait the difference longer */
> 	else if (wait < last_wait) /* waited too long? */
> 		/* wait less to make up the difference */
> 		wait -= wait > last_wait - wait ? last_wait - wait : wait;
> Where ->last_taken_off_the_cpu represents the time the task was last
> taken off the CPU for whatever reason (this may need to be done
> differently to handle time variables).
> 
> In order to do better, longer-term history is required,

The half life of the Kalman filter (roughly equivalent to a running 
average) used to calculate the averages determines how much history is 
taken into account.  It could be made configurable (at least, until 
enough real life experience was available to decide on the best value to 
use).

> but it can't be
> a truly infinite history.

I agree.

> To attempt to maintain an infinite history of
> bandwidth underutilization to be credited too far in the future would
> enable potentially long-term overutilization when such credits are
> cashed en masse for a sustained period of time. At some point you have
> to say "use it or lose it;" over a shorter period of time some smoothing
> is still admissible and even desirable.

Yes, that's why I suggest a running average over the last few scheduling 
cycles for the task.  But thinking about it some more I'm now not so 
sure.  The lack of apparent "smoothness" when I've done this sort of 
thing with raw rather than smooth data (in modifications to the current 
dynamic priority based scheduler model) is usually noticed by running 
top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
the actual responsiveness of the system reflects this.  So I'm now 
willing to reserve my judgement on this issue.

> 
> One might also consider debits owing to non-preemptibility or other
> sorts of cpu bandwidth overutilization with similar caveats. A half-
> life to an otherwise infinite history of credits and/or debits
> specified in absolute time may be appropriate, though it's not quite as
> computationally cheap as the above. The accounting for these credits
> and debits would take the place of ->last_taken_off_the_cpu above.

Credits shouldn't be necessary with this model if, instead of using the 
average "on CPU" time per cycle for a pre-empted task when requeuing it, 
you use the time period that it was "an CPU" before it was booted off as 
this should compensate it correctly.

> 
> Another attack on the problem of CPU bandwidth misallocation would be
> to further credit or debit the task according to the ratio of actual
> CPU bandwidth to allocated CPU bandwidth in order to compensate
> for prior failures to enforce the allocation.

The fact that the allocated CPU bandwidth is changing continuously makes 
this not a good idea.  On a system where allocations were made directly 
rather being a side effect of "nice" and the number of runnable 
processes it might be the way to go.

> Unfortunately, the result
> of scaling the artificial inter-arrival time in the obvious way is
> degenerate (independent of priority) so it can't be done that simply.
> Perhaps it could be used solely to adjust the credit and debit history
> contribution to the quasi-inter-arrival time or the difference used
> somehow, but I don't care to make more complex proposals than the
> first-order correction above (if it's even worthy of being called a
> proposal or a correction) for either this or a time-decaying credit
> and debit history or even debit accounting at all.
> 
> I'd be happy enough to see the correction term to subtract out sleeping
> time (i.e. the code snippet with last_sleep).

Yes, this is THE important deficiency in my proposal.

> The rest and/or stronger
> attempts to factor in sleeping time or bandwidth misallocation I don't
> consider so significant, at least not without some demonstration there
> is CPU bandwidth misallocation happening.
> 
> 
> On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
>> I appreciate that the notion of basing the expected wait on the task's 
>> average cpu use per scheduling cycle is counter intuitive but I believe 
>> that (if you think about it) you'll see that it actually makes sense.
> 
> I don't know. It sounds rather standard to me.

That's good to hear.

I also think that (as Ingo would probably prefer) that this mechanism 
might work without using averages i.e. just use the last "on CPU" time. 
  If it does work it would have the advantage of being impervious to the 
possible bad effects of CPU speed changes on those systems where this is 
a feature.  When using averages, I think that special consideration 
would need to be made for variable cpu speeds -- probably not that 
difficult but additional overhead anyway.

I've been giving some thought into a mechanism for pre-empting the 
current task when another task's expected "on CPU" time arrives.  I 
think that it would be sufficient for scheduler_tick() to compare the 
current time (via sched_clock() or whatever) with the "on CPU" for the 
next task on the queue and initiation pre-emption if it's later than 
that time.  This increases the granularity a little but is compensated 
for by the fact it will help with the situation where more than one task 
on the queue has the same expected "on CPU" time in that each of them 
would get at least a tick (if they needed it).

If some form of precise timer was used (instead) to trigger pre-emption 
then, where there is more than one task with the same expected "on CPU" 
time, only the last would get any CPU (and that's only the case if some 
infinite loop doesn't get created).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  0:23     ` Peter Williams
@ 2007-04-21  5:07       ` William Lee Irwin III
  2007-04-21  5:38         ` Peter Williams
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-21  5:07 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

William Lee Irwin III wrote:
>> This essentially doesn't look correct because while you want to enforce
>> the CPU bandwidth allocation, this doesn't have much to do with that
>> apart from the CPU bandwidth appearing as a term. It's more properly
>> a rate of service as opposed to a time at which anything should happen
>> or a number useful for predicting such. When service should begin more
>> properly depends on the other tasks in the system and a number of other
>> decisions that are part of the scheduling policy.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> This model takes all of those into consideration.  The idea is not just 
> to predict but to use the calculated time to decide when to boot the 
> current process of the CPU (if it doesn't leave voluntarily) and put 
> this one on.  This more or less removes the need to give each task a 
> predetermined chunk of CPU when they go on to the CPU.  This should, in 
> general, reduce the number context switches as tasks get to run until 
> they've finished what they're doing or another task becomes higher 
> priority rather than being booted off after an arbitrary time interval. 
>  (If this ever gets tried it will be interesting to see if this 
> prediction comes true.)
> BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
> patch (way in the future after Ingo's changes are settled) to try it out 
> myself.

I think I smoked out what you were doing.


William Lee Irwin III wrote:
>> If you want to choose a "quasi-inter-arrival time" to achieve the
>> specified CPU bandwidth allocation, this would be it, but to use that
>> to actually enforce the CPU bandwidth allocation, you would need to
>> take into account the genuine inter-arrival time to choose an actual
>> time for service to begin. In other words, this should be a quota for
>> the task to have waited. If it's not waited long enough, then it should
>> be delayed by the difference to achieve the inter-arrival time you're
>> trying to enforce. If it's waited longer, it should be executed
>> sooner modulo other constraints, and perhaps even credited for future
>> scheduling cycles.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> The idea isn't to enforce the bandwidth entitlement to the extent of 
> throttling tasks if they exceed their entitlement and there's no other 
> tasks ready to use the CPU.  This is mainly because the bandwidth 
> entitlement isn't fixed -- it's changing constantly as the number and 
> type of runnable tasks changes.

Well, a little hysteresis will end up throttling in such a manner
anyway as a side-effect, or you'll get anomalies. Say two tasks with
equal entitlements compete, where one sleeps for 1/3 of the time and
the other is fully CPU-bound. If only the times when they're in direct
competition are split 50/50, then the CPU-bound task gets 2/3 and the
sleeper 1/3, which is not the intended effect. I don't believe this
model will be very vulnerable to it, though.


William Lee Irwin III wrote:
>> In order to partially avoid underallocating CPU bandwidth to p, one
>> should track the time last spent sleeping and do the following:

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> Yes I made a mistake in omitting to take into account sleep interval. 
> See another e-mail to Ingo correcting this problem.

I took it to be less trivial of an error than it was. No big deal.


William Lee Irwin III wrote:
>> In order to do better, longer-term history is required,

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> The half life of the Kalman filter (roughly equivalent to a running 
> average) used to calculate the averages determines how much history is 
> taken into account.  It could be made configurable (at least, until 
> enough real life experience was available to decide on the best value to 
> use).

A Kalman filter would do better than a running average. I'm all for it.


William Lee Irwin III wrote:
>> To attempt to maintain an infinite history of
>> bandwidth underutilization to be credited too far in the future would
>> enable potentially long-term overutilization when such credits are
>> cashed en masse for a sustained period of time. At some point you have
>> to say "use it or lose it;" over a shorter period of time some smoothing
>> is still admissible and even desirable.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> Yes, that's why I suggest a running average over the last few scheduling 
> cycles for the task.  But thinking about it some more I'm now not so 
> sure.  The lack of apparent "smoothness" when I've done this sort of 
> thing with raw rather than smooth data (in modifications to the current 
> dynamic priority based scheduler model) is usually noticed by running 
> top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
> the actual responsiveness of the system reflects this.  So I'm now 
> willing to reserve my judgement on this issue.

I'm thinking smoothing should be over a relatively short period of time,
probably shorter than human reflexes can discern.


William Lee Irwin III wrote:
>> One might also consider debits owing to non-preemptibility or other
>> sorts of cpu bandwidth overutilization with similar caveats. A half-
>> life to an otherwise infinite history of credits and/or debits
>> specified in absolute time may be appropriate, though it's not quite as
>> computationally cheap as the above. The accounting for these credits
>> and debits would take the place of ->last_taken_off_the_cpu above.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> Credits shouldn't be necessary with this model if, instead of using the 
> average "on CPU" time per cycle for a pre-empted task when requeuing it, 
> you use the time period that it was "an CPU" before it was booted off as 
> this should compensate it correctly.

That was all about trying to do more in the vein of CPU bandwidth
allocation enforcement. It's not the only attack on the problem, and
may not even be a desirable one.


William Lee Irwin III wrote:
>> Another attack on the problem of CPU bandwidth misallocation would be
>> to further credit or debit the task according to the ratio of actual
>> CPU bandwidth to allocated CPU bandwidth in order to compensate
>> for prior failures to enforce the allocation.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> The fact that the allocated CPU bandwidth is changing continuously makes 
> this not a good idea.  On a system where allocations were made directly 
> rather being a side effect of "nice" and the number of runnable 
> processes it might be the way to go.

Where it changes continuously the intended effect is some hysteresis.
I'm thinking of the CPU-bound task vs. the task that sleeps 1/3 of the
time and similar such cases. Interestingly, cfs seems to get the
intended allocation by maintaining an effectively infinite history.
This is another reason I think it's important to strive for the CPU
bandwidth allocation "correctness;" it's already been accomplished in
some respects, albeit without relative prioritization.

Maybe a good test for whether an extension for relative prioritization
works would be to make the load weights for each nice level configurable
via /proc/ and then see if the algorithm successfully allocates shares
of CPU bandwidth in various cases accordingly.


William Lee Irwin III wrote:
>> The rest and/or stronger
>> attempts to factor in sleeping time or bandwidth misallocation I don't
>> consider so significant, at least not without some demonstration there
>> is CPU bandwidth misallocation happening.

Right here I sort of downplayed trying too hard to enforce things, at
least computationally.


William Lee Irwin III wrote:
>> I don't know. It sounds rather standard to me.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> That's good to hear.
> I also think that (as Ingo would probably prefer) that this mechanism 
> might work without using averages i.e. just use the last "on CPU" time. 
>  If it does work it would have the advantage of being impervious to the 
> possible bad effects of CPU speed changes on those systems where this is 
> a feature.  When using averages, I think that special consideration 
> would need to be made for variable cpu speeds -- probably not that 
> difficult but additional overhead anyway.

Timekeeping needs adjustments for all that, so the scheduler should
just be able to call the timekeeping functions that need to be written
for it anyway.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> I've been giving some thought into a mechanism for pre-empting the 
> current task when another task's expected "on CPU" time arrives.  I 
> think that it would be sufficient for scheduler_tick() to compare the 
> current time (via sched_clock() or whatever) with the "on CPU" for the 
> next task on the queue and initiation pre-emption if it's later than 
> that time.  This increases the granularity a little but is compensated 
> for by the fact it will help with the situation where more than one task 
> on the queue has the same expected "on CPU" time in that each of them 
> would get at least a tick (if they needed it).

Another nicety is that you can pretty much predict when you're going to
need to preempt, so you can nuke the periodic scheduling interrupt and
use a one-shot timer for whenever the next preemption is meant to occur.
This nets power savings and less interrupt overhead for userspace.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> If some form of precise timer was used (instead) to trigger pre-emption 
> then, where there is more than one task with the same expected "on CPU" 
> time, only the last would get any CPU (and that's only the case if some 
> infinite loop doesn't get created).

I think you can check for this and adjust their on-cpu times in various
ways and choose what order to run them in. I think it'd end up something
of a missed deadline policy.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  5:07       ` William Lee Irwin III
@ 2007-04-21  5:38         ` Peter Williams
  2007-04-21  7:32           ` Peter Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-21  5:38 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> This essentially doesn't look correct because while you want to enforce
>>> the CPU bandwidth allocation, this doesn't have much to do with that
>>> apart from the CPU bandwidth appearing as a term. It's more properly
>>> a rate of service as opposed to a time at which anything should happen
>>> or a number useful for predicting such. When service should begin more
>>> properly depends on the other tasks in the system and a number of other
>>> decisions that are part of the scheduling policy.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> This model takes all of those into consideration.  The idea is not just 
>> to predict but to use the calculated time to decide when to boot the 
>> current process of the CPU (if it doesn't leave voluntarily) and put 
>> this one on.  This more or less removes the need to give each task a 
>> predetermined chunk of CPU when they go on to the CPU.  This should, in 
>> general, reduce the number context switches as tasks get to run until 
>> they've finished what they're doing or another task becomes higher 
>> priority rather than being booted off after an arbitrary time interval. 
>>  (If this ever gets tried it will be interesting to see if this 
>> prediction comes true.)
>> BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
>> patch (way in the future after Ingo's changes are settled) to try it out 
>> myself.
> 
> I think I smoked out what you were doing.
> 
> 
> William Lee Irwin III wrote:
>>> If you want to choose a "quasi-inter-arrival time" to achieve the
>>> specified CPU bandwidth allocation, this would be it, but to use that
>>> to actually enforce the CPU bandwidth allocation, you would need to
>>> take into account the genuine inter-arrival time to choose an actual
>>> time for service to begin. In other words, this should be a quota for
>>> the task to have waited. If it's not waited long enough, then it should
>>> be delayed by the difference to achieve the inter-arrival time you're
>>> trying to enforce. If it's waited longer, it should be executed
>>> sooner modulo other constraints, and perhaps even credited for future
>>> scheduling cycles.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The idea isn't to enforce the bandwidth entitlement to the extent of 
>> throttling tasks if they exceed their entitlement and there's no other 
>> tasks ready to use the CPU.  This is mainly because the bandwidth 
>> entitlement isn't fixed -- it's changing constantly as the number and 
>> type of runnable tasks changes.
> 
> Well, a little hysteresis will end up throttling in such a manner
> anyway as a side-effect,

Think of this as a calming influence :-)

> or you'll get anomalies. Say two tasks with
> equal entitlements compete, where one sleeps for 1/3 of the time and
> the other is fully CPU-bound. If only the times when they're in direct
> competition are split 50/50, then the CPU-bound task gets 2/3 and the
> sleeper 1/3, which is not the intended effect. I don't believe this
> model will be very vulnerable to it, though.

Nor me.

> 
> 
> William Lee Irwin III wrote:
>>> In order to partially avoid underallocating CPU bandwidth to p, one
>>> should track the time last spent sleeping and do the following:
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Yes I made a mistake in omitting to take into account sleep interval. 
>> See another e-mail to Ingo correcting this problem.
> 
> I took it to be less trivial of an error than it was. No big deal.

No, you were right it was definitely a NON trivial error.

> 
> 
> William Lee Irwin III wrote:
>>> In order to do better, longer-term history is required,
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The half life of the Kalman filter (roughly equivalent to a running 
>> average) used to calculate the averages determines how much history is 
>> taken into account.  It could be made configurable (at least, until 
>> enough real life experience was available to decide on the best value to 
>> use).
> 
> A Kalman filter would do better than a running average. I'm all for it.

As a long time user of Kalman filters I tend to think of them as the 
same thing.  I use the term running average when talking about the idea 
behind a scheduler because I think that more people will understand what 
the general idea is.  When it comes to implementation I always replace 
the idea of "running average" with a roughly equivalent Kalman filter.

> 
> 
> William Lee Irwin III wrote:
>>> To attempt to maintain an infinite history of
>>> bandwidth underutilization to be credited too far in the future would
>>> enable potentially long-term overutilization when such credits are
>>> cashed en masse for a sustained period of time. At some point you have
>>> to say "use it or lose it;" over a shorter period of time some smoothing
>>> is still admissible and even desirable.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Yes, that's why I suggest a running average over the last few scheduling 
>> cycles for the task.  But thinking about it some more I'm now not so 
>> sure.  The lack of apparent "smoothness" when I've done this sort of 
>> thing with raw rather than smooth data (in modifications to the current 
>> dynamic priority based scheduler model) is usually noticed by running 
>> top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
>> the actual responsiveness of the system reflects this.  So I'm now 
>> willing to reserve my judgement on this issue.
> 
> I'm thinking smoothing should be over a relatively short period of time,
> probably shorter than human reflexes can discern.
> 
> 
> William Lee Irwin III wrote:
>>> One might also consider debits owing to non-preemptibility or other
>>> sorts of cpu bandwidth overutilization with similar caveats. A half-
>>> life to an otherwise infinite history of credits and/or debits
>>> specified in absolute time may be appropriate, though it's not quite as
>>> computationally cheap as the above. The accounting for these credits
>>> and debits would take the place of ->last_taken_off_the_cpu above.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Credits shouldn't be necessary with this model if, instead of using the 
>> average "on CPU" time per cycle for a pre-empted task when requeuing it, 
>> you use the time period that it was "an CPU" before it was booted off as 
>> this should compensate it correctly.
> 
> That was all about trying to do more in the vein of CPU bandwidth
> allocation enforcement. It's not the only attack on the problem, and
> may not even be a desirable one.
> 
> 
> William Lee Irwin III wrote:
>>> Another attack on the problem of CPU bandwidth misallocation would be
>>> to further credit or debit the task according to the ratio of actual
>>> CPU bandwidth to allocated CPU bandwidth in order to compensate
>>> for prior failures to enforce the allocation.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The fact that the allocated CPU bandwidth is changing continuously makes 
>> this not a good idea.  On a system where allocations were made directly 
>> rather being a side effect of "nice" and the number of runnable 
>> processes it might be the way to go.
> 
> Where it changes continuously the intended effect is some hysteresis.
> I'm thinking of the CPU-bound task vs. the task that sleeps 1/3 of the
> time and similar such cases. Interestingly, cfs seems to get the
> intended allocation by maintaining an effectively infinite history.
> This is another reason I think it's important to strive for the CPU
> bandwidth allocation "correctness;" it's already been accomplished in
> some respects, albeit without relative prioritization.
> 
> Maybe a good test for whether an extension for relative prioritization
> works would be to make the load weights for each nice level configurable
> via /proc/ and then see if the algorithm successfully allocates shares
> of CPU bandwidth in various cases accordingly.

Yes.  I think that we can fiddle fairly freely with these (subject to 
the fact that they currently cache part of the load balancer's average 
load calculation's need to multiply by SCHED_LOAD_SCALE in their value). 
  From the POV of the smpnice of the load balancer it just wants the 
values to reflect the system's interpretation of nice.  This used to be 
done by tying the values to the time slice allocations associated with 
nice but that is now history so we're free to redesign it.  As you say, 
having the ability to easily try different mappings would make it easier 
to experiment and find the best mapping.

> 
> 
> William Lee Irwin III wrote:
>>> The rest and/or stronger
>>> attempts to factor in sleeping time or bandwidth misallocation I don't
>>> consider so significant, at least not without some demonstration there
>>> is CPU bandwidth misallocation happening.
> 
> Right here I sort of downplayed trying too hard to enforce things, at
> least computationally.

I agree with that approach.  There's no sense spending too much effort 
being extremely precise when the input you use is probably invalid by 
the time you apply the output.

> 
> 
> William Lee Irwin III wrote:
>>> I don't know. It sounds rather standard to me.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> That's good to hear.
>> I also think that (as Ingo would probably prefer) that this mechanism 
>> might work without using averages i.e. just use the last "on CPU" time. 
>>  If it does work it would have the advantage of being impervious to the 
>> possible bad effects of CPU speed changes on those systems where this is 
>> a feature.  When using averages, I think that special consideration 
>> would need to be made for variable cpu speeds -- probably not that 
>> difficult but additional overhead anyway.
> 
> Timekeeping needs adjustments for all that, so the scheduler should
> just be able to call the timekeeping functions that need to be written
> for it anyway.

Unfortunately not.  Things like sleep and wait time need to be real time 
and only "on CPU" time needs to be "equivalent full speed on CPU time" 
in metrics. With a conversion back the other way when used to work out 
expected "on CPU" time.

> 
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> I've been giving some thought into a mechanism for pre-empting the 
>> current task when another task's expected "on CPU" time arrives.  I 
>> think that it would be sufficient for scheduler_tick() to compare the 
>> current time (via sched_clock() or whatever) with the "on CPU" for the 
>> next task on the queue and initiation pre-emption if it's later than 
>> that time.  This increases the granularity a little but is compensated 
>> for by the fact it will help with the situation where more than one task 
>> on the queue has the same expected "on CPU" time in that each of them 
>> would get at least a tick (if they needed it).
> 
> Another nicety is that you can pretty much predict when you're going to
> need to preempt, so you can nuke the periodic scheduling interrupt and
> use a one-shot timer for whenever the next preemption is meant to occur.
> This nets power savings and less interrupt overhead for userspace.

I think the one shot timer is dangerous as it can possibly lead to 
infinite pre-emption loops if several tasks end up with the same "on 
CPU" and the periodic interrupt is therefore better.  Like you I feel 
sad about this as I hoped that scheduler_tick() would disappear.

> 
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> If some form of precise timer was used (instead) to trigger pre-emption 
>> then, where there is more than one task with the same expected "on CPU" 
>> time, only the last would get any CPU (and that's only the case if some 
>> infinite loop doesn't get created).
> 
> I think you can check for this and adjust their on-cpu times in various
> ways and choose what order to run them in. I think it'd end up something
> of a missed deadline policy.

Arguably the latest one should go first as its estimate is more likely 
to be correct being based on more recent info i.e. the other's estimates 
would be based on less runnable tasks and be optimistic. This would 
involve pushing them down the queue and pushing items already on the 
queue downwards is likely to be expensive.  Which is why I prefer the 
less elegant solution of the periodical interrupt.

Of course, one solution might be to just add the adjustment to the key 
time on all tasks on the queue?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  5:38         ` Peter Williams
@ 2007-04-21  7:32           ` Peter Williams
  2007-04-21  7:54             ` Ingo Molnar
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-21  7:32 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett

Peter Williams wrote:
> William Lee Irwin III wrote:
>> William Lee Irwin III wrote:
>> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>>> If some form of precise timer was used (instead) to trigger 
>>> pre-emption then, where there is more than one task with the same 
>>> expected "on CPU" time, only the last would get any CPU (and that's 
>>> only the case if some infinite loop doesn't get created).
>>
>> I think you can check for this and adjust their on-cpu times in various
>> ways and choose what order to run them in. I think it'd end up something
>> of a missed deadline policy.
> 
> Arguably the latest one should go first as its estimate is more likely 
> to be correct being based on more recent info i.e. the other's estimates 
> would be based on less runnable tasks and be optimistic. This would 
> involve pushing them down the queue and pushing items already on the 
> queue downwards is likely to be expensive.  Which is why I prefer the 
> less elegant solution of the periodical interrupt.
> 
> Of course, one solution might be to just add the adjustment to the key 
> time on all tasks on the queue?

I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the queue 
having their "on CPU" forever postponed and we all know that even the 
smallest possibility of starvation will eventually cause problems.

I think there should be a rule: Once a task is on the queue its "on CPU" 
time is immutable.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  7:32           ` Peter Williams
@ 2007-04-21  7:54             ` Ingo Molnar
  2007-04-21  8:33               ` William Lee Irwin III
  2007-04-21 10:37               ` Peter Williams
  0 siblings, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2007-04-21  7:54 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> I retract this suggestion as it's a very bad idea.  It introduces the 
> possibility of starvation via the poor sods at the bottom of the queue 
> having their "on CPU" forever postponed and we all know that even the 
> smallest possibility of starvation will eventually cause problems.
> 
> I think there should be a rule: Once a task is on the queue its "on 
> CPU" time is immutable.

Yeah, fully agreed. Currently i'm using the simple method of 
p->nice_offset, which plainly just moves the per nice level areas of the 
tree far enough apart (by a constant offset) so that lower nice levels 
rarely interact with higher nice levels. Lower nice levels never truly 
starve because rq->fair_clock increases deterministically and currently 
the fair_key values are indeed 'immutable' as you suggest.

In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
then we could 'share' its p->wait_runtime with that nice 19 task and 
copy the signal sender's nice_offset. This would in essence pass the 
right to execute over to the killed task, so that it can tear itself 
down.

This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.

In any case, it is clear that rq->raw_cpu_load should be used instead of 
rq->nr_running, when calculating the fair clock, but i begin to like the 
nice_offset solution too in addition of this: it's effective in practice 
and starvation-free in theory, and most importantly, it's very simple. 
We could even make the nice offset granularity tunable, just in case 
anyone wants to weaken (or strengthen) the effectivity of nice levels. 
What do you think, can you see any obvious (or less obvious) 
showstoppers with this approach?

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 12:28       ` Peter Williams
@ 2007-04-21  8:07         ` Peter Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Williams @ 2007-04-21  8:07 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

Peter Williams wrote:
> Peter Williams wrote:
>> Ingo Molnar wrote:
>>> your suggestion concentrates on the following scenario: if a task 
>>> happens to schedule in an 'unlucky' way and happens to hit a busy 
>>> period while there are many idle periods. Unless i misunderstood your 
>>> suggestion, that is the main intention behind it, correct?
>>
>> You misunderstand (that's one of my other schedulers :-)).  This one's 
>> based on the premise that if everything happens as the task expects it 
>> will get the amount of CPU bandwidth (over this short period) that 
>> it's entitled to.  In reality, sometimes it will get more and 
>> sometimes less but on average it should get what it deserves. E.g. If 
>> you had two tasks with equal nice and both had demands of 90% of a CPU 
>> you'd expect them each to get about half of the CPU bandwidth.  Now 
>> suppose that one of them uses 5ms of CPU each time it got onto the CPU 
>> and the other uses 10ms.  If these two tasks just round robin with 
>> each other the likely outcome is that the one with the 10ms bursts 
>> will get twice as much CPU as the other but my proposed method should 
>> prevent and cause them to get roughly the same amount of CPU.  (I 
>> believe this was a scenario that caused problems with O(1) and 
>> required a fix at some stage?)

Another advantage of this mechanism is that, all else being equal, it 
will tend to run tasks that use short bursts of CPU ahead of those that 
use long bursts and this tends to reduce the overall time spent waiting 
for CPU by all tasks on the system which is good for throughput.  I.e. 
in general, a task that tends to use short bursts of CPU will make other 
tasks wait less time than will one that tends to use long bursts.

So this means that you were right and it is good in the scenario that 
you suggested even though that wasn't the motivation behind the design.

This means that this scheduler should be good for improving latency on 
servers that aren't fully loaded as well as providing good fairness and 
responsiveness when the system is fully loaded.  (Fingers crossed.)

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  7:54             ` Ingo Molnar
@ 2007-04-21  8:33               ` William Lee Irwin III
  2007-04-21  8:57                 ` Ingo Molnar
  2007-04-21 10:37               ` Peter Williams
  1 sibling, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-21  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
> In practice they can starve a bit when one renices thousands of tasks, 
> so i was thinking about the following special-case: to at least make 
> them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
> then we could 'share' its p->wait_runtime with that nice 19 task and 
> copy the signal sender's nice_offset. This would in essence pass the 
> right to execute over to the killed task, so that it can tear itself 
> down.
> This cannot be used to gain an 'unfair advantage' because the signal 
> sender spends its own 'right to execute on the CPU', and because the 
> target task cannot execute any user code anymore when it gets a SIGKILL.

I suppose this is a special case of the dreaded priority inversion. What
of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice
-19 tasks are waiting to acquire? Perhaps rt_mutex should be the default
mutex implementation.


On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
> In any case, it is clear that rq->raw_cpu_load should be used instead of 
> rq->nr_running, when calculating the fair clock, but i begin to like the 
> nice_offset solution too in addition of this: it's effective in practice 
> and starvation-free in theory, and most importantly, it's very simple. 
> We could even make the nice offset granularity tunable, just in case 
> anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> What do you think, can you see any obvious (or less obvious) 
> showstoppers with this approach?

->nice_offset's semantics are not meaningful to the end user, regardless
of whether it's effective. If there is something to be tuned, it should
be relative shares of CPU bandwidth (load_weight) corresponding to each
nice level or something else directly observable. The implementation
could be ->nice_offset, if it suffices.

Suppose a table of nice weights like the following is tuned via /proc/:

-20	21			 0	1
-19	20			 1	0.5
-18	19			 2	0.3333
-17	18			 3	0.25
-16	17			 4	0.2
-15	16			 5	0.1666
-14	15			 6	0.1429
-13	14			 7	0.125
-12	13			 8	0.1111
-11	12			 9	0.1
-10	11			10	0.0909
 -9	10			11	0.0833
 -8	9			12	0.0769
 -7	8			13	0.0714
 -6	7			14	0.0666
 -5	6			15	0.0625
 -4	5			16	0.0588
 -3	4			17	0.0555
 -2	3			18	0.0526
 -1	2			19	0.0476

Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

Now suppose two CPU hogs, one being nice 1 and the other nice -1,
compete against each other. The nice 1 hog should receive 0.5/(2+0.5)
= 20% of the CPU, and the nice -1 hog should receive 80% of the CPU.
A nice -5 hog vs. a nice 0 hog should have the nice 0 hog receive
1/(6+1) = 14.29% of the CPU and the nice -5 hog receive 85.71% of the
CPU. Three hogs, once nice -1, one nice 0, and one nice 1 should have
the nice -1 hog receive 2/(2+1+0.5) = 57.14% of the CPU, the nice 0
hog receive 1/(2+1+0.5) = 28.57% of the CPU, and the nice 1 hog
receive 0.5/(2+1+0.5) = 14.29% of the CPU. And so on. (Note that these
are very strong nice semantics, probably too strong for real use.) For
five hogs with nice numbers ranging from -2 to 2, we'd get:
-2	43.90%
-1	29.27%
 0	14.63%
 1	 7.32%
 2	 4.88%

However this is implemented doesn't matter so long as the semantics
are honored. I wouldn't specify too much beyond competing tasks with
equivalent sleep/wakeup behavior; one may wish to apply bonuses and
penalties to differing sleep/wakeup behavior in the future, even if
such is not now done.

Basically, the effect of nice numbers should be predictable somehow,
and frankly, making it specifiable by the user resolves the issue
of conflicting requirements.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  8:33               ` William Lee Irwin III
@ 2007-04-21  8:57                 ` Ingo Molnar
  2007-04-21 16:23                   ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2007-04-21  8:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett


* William Lee Irwin III <wli@holomorphy.com> wrote:

> I suppose this is a special case of the dreaded priority inversion. 
> What of, say, nice 19 tasks holding fs semaphores and/or mutexes that 
> nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the 
> default mutex implementation.

while i agree that it could be an issue, lock inversion is nothing 
really new, so i'd not go _that_ drastic to convert all mutexes to 
rtmutexes. (i've taken my -rt/PREEMPT_RT hat off)

For example reiser3 based systems get pretty laggy on significant 
reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is 
enabled: reiser3 holds the BKL for extended periods of time so a "make 
-j50" workload can starve it significantly and the tty layer's BKL use 
makes any sort of keyboard (even over ssh) input laggy.

Other locks though are not held this frequently and the mutex 
implementation is pretty fair for waiters anyway. (the semaphore 
implementation is not nearly as much fair, and the Big Kernel Semaphore 
is still struct semaphore based) So i'd really wait for specific 
workloads to trigger problems, and _maybe_ convert certain mutexes to 
rtmutexes, on an as-needed basis.

> > In any case, it is clear that rq->raw_cpu_load should be used instead of 
> > rq->nr_running, when calculating the fair clock, but i begin to like the 
> > nice_offset solution too in addition of this: it's effective in practice 
> > and starvation-free in theory, and most importantly, it's very simple. 
> > We could even make the nice offset granularity tunable, just in case 
> > anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> > What do you think, can you see any obvious (or less obvious) 
> > showstoppers with this approach?
> 
> ->nice_offset's semantics are not meaningful to the end user, 
> regardless of whether it's effective. [...]

yeah, agreed. That's one reason why i didnt make it tunable, it's pretty 
meaningless to the user.

> [...] If there is something to be tuned, it should be relative shares 
> of CPU bandwidth (load_weight) corresponding to each nice level or 
> something else directly observable. The implementation could be 
> ->nice_offset, if it suffices.
> 
> Suppose a table of nice weights like the following is tuned via 
> /proc/:
> 
> -20	21			 0	1
>  -1	2			19	0.0476

> Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

ok, thanks for thinking about it. I have changed the nice weight in 
CVSv5-to-be so that it defaults to something pretty close to your 
suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
set to about 2%. (This something that users requested for some time, the 
default ~5% is a tad high when running reniced SETI jobs, etc.)

the actual percentage scales almost directly with the nice offset 
granularity value, but if this should be exposed to users at all, i 
agree that it would be better to directly expose this as some sort of 
'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
finegrained metric. Percentile is too coarse i think, and using 0.1% 
units isnt intuitive enough i think. The sysctl handler would then 
transform that 'human readable' sysctl value into the appropriate 
internal nice-offset-granularity value (or whatever mechanism the 
implementation ends up using).

I'd not do this as a per-nice-level thing but as a single value that 
rescales the whole nice level range at once. That's alot less easy to 
misconfigure and we've got enough nice levels for users to pick from 
almost arbitrarily, as long as they have the ability to influence the 
max.

does this sound mostly OK to you?

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  7:54             ` Ingo Molnar
  2007-04-21  8:33               ` William Lee Irwin III
@ 2007-04-21 10:37               ` Peter Williams
  2007-04-21 12:21                 ` Peter Williams
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Williams @ 2007-04-21 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> I retract this suggestion as it's a very bad idea.  It introduces the 
>> possibility of starvation via the poor sods at the bottom of the queue 
>> having their "on CPU" forever postponed and we all know that even the 
>> smallest possibility of starvation will eventually cause problems.
>>
>> I think there should be a rule: Once a task is on the queue its "on 
>> CPU" time is immutable.
> 
> Yeah, fully agreed. Currently i'm using the simple method of 
> p->nice_offset, which plainly just moves the per nice level areas of the 
> tree far enough apart (by a constant offset) so that lower nice levels 
> rarely interact with higher nice levels. Lower nice levels never truly 
> starve because rq->fair_clock increases deterministically and currently 
> the fair_key values are indeed 'immutable' as you suggest.
> 
> In practice they can starve a bit when one renices thousands of tasks, 
> so i was thinking about the following special-case: to at least make 
> them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
> then we could 'share' its p->wait_runtime with that nice 19 task and 
> copy the signal sender's nice_offset. This would in essence pass the 
> right to execute over to the killed task, so that it can tear itself 
> down.
> 
> This cannot be used to gain an 'unfair advantage' because the signal 
> sender spends its own 'right to execute on the CPU', and because the 
> target task cannot execute any user code anymore when it gets a SIGKILL.
> 
> In any case, it is clear that rq->raw_cpu_load should be used instead of 
> rq->nr_running, when calculating the fair clock, but i begin to like the 
> nice_offset solution too in addition of this: it's effective in practice 
> and starvation-free in theory, and most importantly, it's very simple. 
> We could even make the nice offset granularity tunable, just in case 
> anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> What do you think, can you see any obvious (or less obvious) 
> showstoppers with this approach?

I haven't had a close look at it but from the above description it 
sounds an order of magnitude more complex than I thought it would be. 
The idea of different nice levels sounds like a recipe for starvation to 
me (if it works the way it sounds like it works).

I guess I'll have to spend more time reading the code because I don't 
seem to be able to make sense of the above description in any way that 
doesn't say "starvation here we come".

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21 10:37               ` Peter Williams
@ 2007-04-21 12:21                 ` Peter Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Williams @ 2007-04-21 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, William Lee Irwin III, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

Peter Williams wrote:
> Ingo Molnar wrote:
>> * Peter Williams <pwil3058@bigpond.net.au> wrote:
>>
>>> I retract this suggestion as it's a very bad idea.  It introduces the 
>>> possibility of starvation via the poor sods at the bottom of the 
>>> queue having their "on CPU" forever postponed and we all know that 
>>> even the smallest possibility of starvation will eventually cause 
>>> problems.
>>>
>>> I think there should be a rule: Once a task is on the queue its "on 
>>> CPU" time is immutable.
>>
>> Yeah, fully agreed. Currently i'm using the simple method of 
>> p->nice_offset, which plainly just moves the per nice level areas of 
>> the tree far enough apart (by a constant offset) so that lower nice 
>> levels rarely interact with higher nice levels. Lower nice levels 
>> never truly starve because rq->fair_clock increases deterministically 
>> and currently the fair_key values are indeed 'immutable' as you suggest.
>>
>> In practice they can starve a bit when one renices thousands of tasks, 
>> so i was thinking about the following special-case: to at least make 
>> them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 
>> task then we could 'share' its p->wait_runtime with that nice 19 task 
>> and copy the signal sender's nice_offset. This would in essence pass 
>> the right to execute over to the killed task, so that it can tear 
>> itself down.
>>
>> This cannot be used to gain an 'unfair advantage' because the signal 
>> sender spends its own 'right to execute on the CPU', and because the 
>> target task cannot execute any user code anymore when it gets a SIGKILL.
>>
>> In any case, it is clear that rq->raw_cpu_load should be used instead 
>> of rq->nr_running, when calculating the fair clock, but i begin to 
>> like the nice_offset solution too in addition of this: it's effective 
>> in practice and starvation-free in theory, and most importantly, it's 
>> very simple. We could even make the nice offset granularity tunable, 
>> just in case anyone wants to weaken (or strengthen) the effectivity of 
>> nice levels. What do you think, can you see any obvious (or less 
>> obvious) showstoppers with this approach?
> 
> I haven't had a close look at it but from the above description it 
> sounds an order of magnitude more complex than I thought it would be. 
> The idea of different nice levels sounds like a recipe for starvation to 
> me (if it works the way it sounds like it works).
> 
> I guess I'll have to spend more time reading the code because I don't 
> seem to be able to make sense of the above description in any way that 
> doesn't say "starvation here we come".

I'm finding it hard to figure out what the underling principle for the 
way you're queuing things by reading the code (that's the trouble with 
reading the code it just tells you what's being done not why -- and 
sometimes it's even hard to figure out what's being done when there's a 
lot of indirection).  sched-design-CFS.txt isn't much help in this area 
either.

Any chance of a brief description of how it's supposed to work?

Key questions are:

How do you decide the key value for a task's position in the queue?

Is it an absolute time or an offset from the current time?

How do you decide when to boot the current task of the queue?  Both at 
wake up of another task and in general play?

Peter
PS I think that you're trying to do too much in one patch.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-21  8:57                 ` Ingo Molnar
@ 2007-04-21 16:23                   ` William Lee Irwin III
  0 siblings, 0 replies; 39+ messages in thread
From: William Lee Irwin III @ 2007-04-21 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Suppose a table of nice weights like the following is tuned via 
>> /proc/:
>> -20	21			 0	1
>>  -1	2			19	0.0476
> > Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> ok, thanks for thinking about it. I have changed the nice weight in 
> CVSv5-to-be so that it defaults to something pretty close to your 
> suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
> set to about 2%. (This something that users requested for some time, the 
> default ~5% is a tad high when running reniced SETI jobs, etc.)

Okay. Maybe what I suggested is too weak vs. too strong. I didn't
actually have it in mind as a proposal for general use, but maybe it is
good for such. I had more in mind tunability in general, but it's all
good. I'd think some curve gentler in intermediate nice levels and
stronger at the tails might be better.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> the actual percentage scales almost directly with the nice offset 
> granularity value, but if this should be exposed to users at all, i 
> agree that it would be better to directly expose this as some sort of 
> 'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
> finegrained metric. Percentile is too coarse i think, and using 0.1% 
> units isnt intuitive enough i think. The sysctl handler would then 
> transform that 'human readable' sysctl value into the appropriate 
> internal nice-offset-granularity value (or whatever mechanism the 
> implementation ends up using).

I vaguely liked specifying the full table, but maybe it's too much
for a real user interface.

4-digit or 5-digit fixed point decimal sounds reasonable.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> I'd not do this as a per-nice-level thing but as a single value that 
> rescales the whole nice level range at once. That's alot less easy to 
> misconfigure and we've got enough nice levels for users to pick from 
> almost arbitrarily, as long as they have the ability to influence the 
> max.
> does this sound mostly OK to you?

For the most part, yes. I've been mostly looking at how effectively
the prioritization algorithms work. I'll be wrapping up writing a
testcase to measure all this soon. The basic idea is to take the
weights as inputs somehow and then check to see that they're honored.

What's appropriate for end-users is a very different thing from what
might be appropriate for me. I won't have trouble fiddling with the
code, so please do design around what the best interface for end-users
might be.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-20 20:11               ` Siddha, Suresh B
@ 2007-04-24 17:39                 ` Christoph Lameter
  2007-04-24 17:42                   ` Siddha, Suresh B
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2007-04-24 17:39 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: William Lee Irwin III, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

On Fri, 20 Apr 2007, Siddha, Suresh B wrote:

> > Last I checked it was workload-dependent, but there were things that
> > hammer it. I mostly know of the remote wakeup issue, but there could
> > be other things besides wakeups that do it, too.
> 
> remote wakeup was the main issue and the 0.5% improvement was seen
> on a two node platform. Aligning it reduces the number of remote
> cachelines that needs to be touched as part of this wakeup.

.5% is usually in the noise ratio. Are you consistently seeing an 
improvement or is that sporadic?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-24 17:39                 ` Christoph Lameter
@ 2007-04-24 17:42                   ` Siddha, Suresh B
  2007-04-24 17:47                     ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Siddha, Suresh B @ 2007-04-24 17:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, William Lee Irwin III, Ingo Molnar,
	linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Siddha, Suresh B wrote:
> 
> > > Last I checked it was workload-dependent, but there were things that
> > > hammer it. I mostly know of the remote wakeup issue, but there could
> > > be other things besides wakeups that do it, too.
> > 
> > remote wakeup was the main issue and the 0.5% improvement was seen
> > on a two node platform. Aligning it reduces the number of remote
> > cachelines that needs to be touched as part of this wakeup.
> 
> .5% is usually in the noise ratio. Are you consistently seeing an 
> improvement or is that sporadic?

No. This is consistent. I am waiting for the perf data on a much much bigger
NUMA box.

Anyhow, this is a straight forward optimization and needs to be done. Do you
have any specific concerns?

thanks,
suresh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-24 17:42                   ` Siddha, Suresh B
@ 2007-04-24 17:47                     ` Christoph Lameter
  2007-04-24 17:50                       ` Siddha, Suresh B
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2007-04-24 17:47 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: William Lee Irwin III, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> > .5% is usually in the noise ratio. Are you consistently seeing an 
> > improvement or is that sporadic?
> 
> No. This is consistent. I am waiting for the perf data on a much much bigger
> NUMA box.
> 
> Anyhow, this is a straight forward optimization and needs to be done. Do you
> have any specific concerns?

Yes there should not be contention on per cpu data in principle. The 
point of per cpu data is for the cpu to have access to contention free 
cachelines.

If the data is contented then it should be moved out of per cpu data and properly 
placed to minimize contention. Otherwise we will get into cacheline 
aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-24 17:47                     ` Christoph Lameter
@ 2007-04-24 17:50                       ` Siddha, Suresh B
  2007-04-24 17:55                         ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Siddha, Suresh B @ 2007-04-24 17:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, William Lee Irwin III, Ingo Molnar,
	linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > Anyhow, this is a straight forward optimization and needs to be done. Do you
> > have any specific concerns?
> 
> Yes there should not be contention on per cpu data in principle. The 
> point of per cpu data is for the cpu to have access to contention free 
> cachelines.
> 
> If the data is contented then it should be moved out of per cpu data and properly 
> placed to minimize contention. Otherwise we will get into cacheline 
> aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

yes, we were planning to move this to a different percpu section, where
all the elements in this new section will be cacheline aligned(both
at the start, aswell as end)

thanks,
suresh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-24 17:50                       ` Siddha, Suresh B
@ 2007-04-24 17:55                         ` Christoph Lameter
  2007-04-24 18:06                           ` Siddha, Suresh B
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2007-04-24 17:55 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: William Lee Irwin III, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett

On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> > On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > > Anyhow, this is a straight forward optimization and needs to be done. Do you
> > > have any specific concerns?
> > 
> > Yes there should not be contention on per cpu data in principle. The 
> > point of per cpu data is for the cpu to have access to contention free 
> > cachelines.
> > 
> > If the data is contented then it should be moved out of per cpu data and properly 
> > placed to minimize contention. Otherwise we will get into cacheline 
> > aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.
> 
> yes, we were planning to move this to a different percpu section, where
> all the elements in this new section will be cacheline aligned(both
> at the start, aswell as end)

I would not call this a per cpu area. It is used by multiple cpus it 
seems. But for 0.5%? on what benchmark? Is is really worth it?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [patch] CFS scheduler, v3
  2007-04-24 17:55                         ` Christoph Lameter
@ 2007-04-24 18:06                           ` Siddha, Suresh B
  0 siblings, 0 replies; 39+ messages in thread
From: Siddha, Suresh B @ 2007-04-24 18:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, William Lee Irwin III, Ingo Molnar,
	linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett

On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > yes, we were planning to move this to a different percpu section, where
> > all the elements in this new section will be cacheline aligned(both
> > at the start, aswell as end)
> 
> I would not call this a per cpu area. It is used by multiple cpus it 
> seems. 

not decided about the name yet. but the area is allocated per cpu and yes,
the data can be referred by other cpus.

> But for 0.5%? on what benchmark? Is is really worth it?

famous database workload :)

I don't think the new section will be added for this 0.5%. This is a straight
fwd optimization, and anyone can plug into this new section in future.

thanks,
suresh

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2007-04-24 18:08 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-18 17:50 [patch] CFS scheduler, v3 Ingo Molnar
2007-04-18 21:26 ` William Lee Irwin III
2007-04-18 21:33   ` Ingo Molnar
2007-04-20 19:24   ` Christoph Lameter
2007-04-20 19:26     ` Siddha, Suresh B
2007-04-20 19:29     ` William Lee Irwin III
2007-04-20 19:33       ` Christoph Lameter
2007-04-20 19:38         ` William Lee Irwin III
2007-04-20 19:44           ` Christoph Lameter
2007-04-20 20:03             ` William Lee Irwin III
2007-04-20 20:11               ` Siddha, Suresh B
2007-04-24 17:39                 ` Christoph Lameter
2007-04-24 17:42                   ` Siddha, Suresh B
2007-04-24 17:47                     ` Christoph Lameter
2007-04-24 17:50                       ` Siddha, Suresh B
2007-04-24 17:55                         ` Christoph Lameter
2007-04-24 18:06                           ` Siddha, Suresh B
2007-04-20  0:10 ` Peter Williams
2007-04-20  4:48   ` Willy Tarreau
2007-04-20  6:02     ` Peter Williams
2007-04-20  6:21       ` Peter Williams
2007-04-20  7:26       ` Willy Tarreau
2007-04-20  6:46   ` Ingo Molnar
2007-04-20  7:32     ` Peter Williams
2007-04-20 12:28       ` Peter Williams
2007-04-21  8:07         ` Peter Williams
2007-04-20 13:15   ` William Lee Irwin III
2007-04-21  0:23     ` Peter Williams
2007-04-21  5:07       ` William Lee Irwin III
2007-04-21  5:38         ` Peter Williams
2007-04-21  7:32           ` Peter Williams
2007-04-21  7:54             ` Ingo Molnar
2007-04-21  8:33               ` William Lee Irwin III
2007-04-21  8:57                 ` Ingo Molnar
2007-04-21 16:23                   ` William Lee Irwin III
2007-04-21 10:37               ` Peter Williams
2007-04-21 12:21                 ` Peter Williams
2007-04-20 14:21   ` Peter Williams
2007-04-20 14:33     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.