linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [ANNOUNCE] RSDL completely fair starvation free interactive  cpu scheduler
@ 2007-03-05  9:45 Nicolas Mailhot
  2007-03-05  9:53 ` Gene Heskett
  0 siblings, 1 reply; 69+ messages in thread
From: Nicolas Mailhot @ 2007-03-05  9:45 UTC (permalink / raw)
  To: linux-kernel

This looks like -mm stuff if you want it in 2.6.22

-- 
Nicolas Mailhot


^ permalink raw reply	[flat|nested] 69+ messages in thread
* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu   scheduler
@ 2007-03-11 22:29 bert hubert
  2007-03-11 22:57 ` Con Kolivas
  0 siblings, 1 reply; 69+ messages in thread
From: bert hubert @ 2007-03-11 22:29 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel

Con,

Recent kernel versions have real problems for me on the interactivity front,
with even a simple 'make' of my C++ program (PowerDNS) causing Firefox to
slow down to a crawl.

RSDL fixed all that, the system is noticeably snappier.

As a case in point, I used to notice when a compile was done because the
system stopped being sluggish.

Today, a few times, I only noticed 'make' was done because the fans of my
computer slowed down.

Thanks for the good work! I'm on 2.6.21-rc3-rsdl-0.29.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
@ 2007-03-08 14:27 Tim Tassonis
  0 siblings, 0 replies; 69+ messages in thread
From: Tim Tassonis @ 2007-03-08 14:27 UTC (permalink / raw)
  To: linux-kernel

Hi Con

Just also wanted to throw in my less than two cents: I applied the patch 
and also have the very strong subjective impression that my system 
"feels" much more responsive than with stock 2.6.20.

Thanks for the great work.

Bye
Tim

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
@ 2007-03-06  4:57 Shawn Starr
  0 siblings, 0 replies; 69+ messages in thread
From: Shawn Starr @ 2007-03-06  4:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Con Kolivas, Willy Tarreau

On Monday 05 March 2007 10:13, Willy Tarreau wrote:
> Con,
>
> I've now given it a try with HZ=250 on my dual-athlon. It works
> beautifully. I also quickly checked that playing mp3 doesn't skip during
> make -j4, and that gears runs fairly smoothly, since those are the
> references people often use.
>
> But with real work, it's excellent too. When I saturate my CPUs by
> injecting HTTP traffic on haproxy, the load is stable and the command line
> perfectly responsive, while in the past the load would oscillate and the
> command line sometimes stopped to respond for a few seconds.
>
> I've also launched my scheddos program (you may remember, the one we did a
> few experiments with). I could not cause any freeze at all. Plain 2.6.20
> had already improved a lot in this area, but above 4 processes/CPU,
> occasional short freezes did still occur. This time, even at 100 processes,
> the system was rather slow (of course!) but just as expected, and nothing
> more.
>
> I also tried the good old "dd if=/dev/zero bs=1|...|dd bs=1 of=/dev/null"
> and it did not cause any trouble.
>
> I will boot 2.6 slightly more often to test the code under various
> conditions, and I will recommend it to a few people I know who tend to
> switch back to 2.4 after one day full of 2.6 jerkiness.
>
> Overall, you have done a great job !
>
> I hope that more people will give it a try, first to help find possible
> remaining bugs, and to pronounce in favour of its inclusion in mainline.

Hi Con/Willy, 

Just to chime on this thread, I've been testing Con's scheduler patches since 
he mentioned them to me. I have to say for desktop performance all processes 
running remain responsive (although with more resource hungry processes 
things get slower but remain responsive, but expected).  I have also been 
testing his scheduler on my work power workstation running 4 VMs with 4GB of 
ram on a dual-core EM64T box, again RSDL performs rather well balancing VM 
resources even while I have xmms running, xchat, beryl and a bunch of other 
stuff.  There is total response to each process.

Thanks, 
Shawn.

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
@ 2007-03-04 20:35 Al Boldi
  2007-03-04 21:49 ` Con Kolivas
  2007-03-06  8:42 ` [ck] " Xavier Bestel
  0 siblings, 2 replies; 69+ messages in thread
From: Al Boldi @ 2007-03-04 20:35 UTC (permalink / raw)
  To: linux-kernel

Con Kolivas wrote:
> > >> >> >This message is to announce the first general public release of
> > >> >> > the "Rotating Staircase DeadLine" cpu scheduler.

Thanks a lot!

> Just to make it clear. The purpose of this scheduler is at all costs to
> maintain absolute fairness no matter what type of load it is put under.

Great!

> This means that if you heavily load up your machine without the use of
> 'nice' then your interactive tasks _will_ slow down proportionately to the
> amount of cpu you use. So doing make -j4 for example will make any other
> task started in taht presence run precisely 1/5th speed, but they will
> still be responsive, have low latency (and audio shouldn't skip for
> example).

That's just what it did, but when you "nice make -j4", things (gears) start 
to stutter.  Is that due to the staircase?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 69+ messages in thread
* [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
@ 2007-03-04  7:00 Con Kolivas
  2007-03-04  7:45 ` [ck] " Con Kolivas
                   ` (4 more replies)
  0 siblings, 5 replies; 69+ messages in thread
From: Con Kolivas @ 2007-03-04  7:00 UTC (permalink / raw)
  To: linux kernel mailing list, ck list

This message is to announce the first general public release of the "Rotating 
Staircase DeadLine" cpu scheduler. 

Based on previous work from the staircase cpu scheduler I set out to design, 
from scratch, a new scheduling policy design which satisfies every 
requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task management.

Available for download are:

 A full rollup of the patch for 2.6.20:
http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch

 Split patches for 2.6.20(which will follow this email):
http://ck.kolivas.org/patches/staircase-deadline/split-out/

 The readme (which will also constitute the rest of this email):
http://ck.kolivas.org/patches/staircase-deadline/rsdl_scheduler.readme


The following readme is also included as documentation in 
Documentation/sched-design.txt


Rotating Staircase Deadline cpu scheduler policy
================================================

Design summary
==============

A novel design which incorporates a foreground-background descending priority
system (the staircase) with runqueue managed minor and major epochs (rotation
and deadline).


Features
========

A starvation free, strict fairness O(1) scalable design with interactivity
as good as the above restrictions can provide. There is no interactivity
estimator, no sleep/run measurements and only simple fixed accounting.
The design has strict enough a design and accounting that task behaviour
can be modelled and maximum scheduling latencies can be predicted by
the virtual deadline mechanism that manages runqueues. The prime concern
in this design is to maintain fairness at all costs determined by nice level,
yet to maintain as good interactivity as can be allowed within the
constraints of strict fairness.


Design description
==================

RSDL works off the principle of providing each task a quota of runtime that
it is allowed to run at each priority level equal to its static priority
(ie. its nice level) and every priority below that. When each task is queued,
the cpu that it is queued onto also keeps a record of that quota. If the
task uses up its quota it is decremented one priority level. Also, if the cpu
notices a quota full has been used for that priority level, it pushes
everything remaining at that priority level to the next lowest priority
level. Once every runtime quota has been consumed of every priority level,
a task is queued on the "expired" array. When no other tasks exist with
quota, the expired array is activated and fresh quotas are handed out. This
is all done in O(1).


Design details
==============

Each cpu has its own runqueue which micromanages its own epochs, and each
task keeps a record of its own entitlement of cpu time. Most of the rest
of these details apply to non-realtime tasks as rt task management is
straight forward.

Each runqueue keeps a record of what major epoch it is up to in the
rq->prio_rotation field which is incremented on each major epoch. It also
keeps a record of quota available to each priority value valid for that
major epoch in rq->prio_quota[].

Each task keeps a record of what major runqueue epoch it was last running
on in p->rotation. It also keeps a record of what priority levels it has
already been allocated quota from during this epoch in a bitmap p->bitmap.

The only tunable that determines all other details is the RR_INTERVAL. This
is set to 6ms (minimum on 1000HZ, higher at different HZ values).

All tasks are initially given a quota based on RR_INTERVAL. This is equal to
RR_INTERVAL between nice values of 0 and 19, and progressively larger for
nice values from -1 to -20. This is assigned to p->quota and only changes
with changes in nice level.

As a task is first queued, it checks in recalc_task_prio to see if it has
run at this runqueue's current priority rotation. If it has not, it will
have its p->prio level set to equal its p->static_prio (nice level) and will
be given a p->time_slice equal to the p->quota, and has its allocation
bitmap bit set in p->bitmap for its static priority (nice value). This
quota is then also added to the current runqueue's rq->prio_quota[p->prio].
It is then queued on the current active priority array.

If a task has already been running during this major epoch, if it has
p->time_slice left and the rq->prio_quota for the task's p->prio still
has quota, it will be placed back on the active array, but no more quota
will be added to either the task or the runqueue quota.

If a task has been running during this major epoch, but does not have
p->time_slice left or the runqueue's prio_quota for this task's p->prio
does not have quota, it will find the next lowest priority in its bitmap
that it has not been allocated quota from. It then gets the a full quota
in p->time_slice and adds that to the quota value for the relevant priority
rq->prio_quota. It is then queued on the current active priority array at
the newly determined lower priority.

If a task has been running during this major epoch, and does not have
any entitlement left in p->bitmap and no time_slice left, it will have its
bitmap cleared, and be queued at its p->static_prio again, but on the expired
priority array. No quota will be allocated until this task is scheduled.

When a task is queued, it has its static_prio bit set in the current
runqueue's rq->static_bitmap, and the relevant bit in the rq->dyn_bitmap.
In order to minimise the number of bitmap lookups, the bitmap of queued
tasks on the expired array is at the end of the same bitmap as the active
array. The number of tasks queued at the current static_prio is kept in
rq->prio_queued[].

During a scheduler_tick where a task is running, the p->time_slice is
decremented, and if it reaches zero then the recalc_task_prio is readjusted
and the task rescheduled.

During a task running tick, the runqueue prio_quota is also decremented. If
it empties then a priority rotation occurs (a major or minor epoch). If the
current runqueue's priority level is better than that of nice 19 tasks, a
minor rotation is performed, otherwise a major rotation will occur.

A minor rotation takes the remaining tasks at this priority level queue and
merges them with a list_splice_tail with the queue from the next lowest
priority level. At this time, any tasks that have been merged will now
have invalid values in p->prio so this must be considered when dequeueing
the task, and for testing for preemption.

A major rotation takes the remaining tasks at this priority level queue and
merges them with a list_splice_tail with the best priority task running on
the expired array, and swaps the priority arrays. The priority quotas are
reset at this time. Any tasks that have been merged will now have invalid
values in p->array and possibly p->prio so this must be considered. The
rq->prio_rotation is incremented at this time.

When a task is dequeued, the dyn_bitmap bit is unset only after testing
that the relevant queue is actually empty since p->prio may be inaccurate
and no hard accounting of the number of tasks at that level is possible.

When selecting a new task for scheduling, after the first dynamic bit is
found on the dyn_bitmap, it is checked to see that a task is really queued
at that priority or if it is a false positive due to the task being
dequeued at a time when its p->prio does not match which queue it is on
after some form of priority rotation. This is a rare occurrence as it tends
to only occur if a task that is already waiting on a runqueue gets dequeued.
If the bitmap value is in the expired array range, a major priority rotation
is performed. If the chosen task has not been running during this major or
minor rotation it has new quota allocated at this time, and added to the
runqueue's quota.


Modelling deadline behaviour
============================

As the accounting in this design is hard and not modified by sleep average
calculations or interactivity modifiers, it is possible to accurately
predict the maximum latency that a task may experience under different
conditions. This is a virtual deadline mechanism enforced by mandatory
runqueue epochs, and not by trying to keep complicated accounting of each
task.

The maximum duration a task can run during one major epoch is determined
by its nice value. Nice 0 tasks can run at 19 different priority levels
for RR_INTERVAL duration during each epoch (the equivalent of nice 0 to nice
19). Nice 10 tasks can run at 9 priority levels for each epoch, and so on.

Therefore the maximum duration a runqueue epoch can take is determined by
the number of tasks running, and their nice level. After that, the maximum
duration it can take before a task can wait before it get scheduled is
determined by the difference between its nice value and the nice value of
the highest priority task queued.

In the following examples, these are _worst case scenarios_ and would rarely
occur, but can be modelled nonetheless to determine the maximum possible
latency.

So for example, if two nice 0 tasks are running, and one has just expired as
another is activated for the first time receiving a full quota for this
runqueue rotation, the first task will wait:

nr_tasks * max_duration + nice_difference * rr_interval
1 * 19 * RR_INTERVAL + 0 = 114ms

In the presence of a nice 10 task, a nice 0 task would wait a maximum of
1 * 10 * RR_INTERVAL + 0 = 60ms

In the presence of a nice 0 task, a nice 10 task would wait a maximum of
1 * 19 * RR_INTERVAL + 9 * RR_INTERVAL = 168ms

Using a more complicated example, if there are 4 tasks running fully cpu
bound, one each at nice -20, nice 0, nice 10 and nice 19, we can calculate
the maximum latency possible for the nice 10 task. Note that -20 tasks are
heavily biased for so this will be a long time, but can be modelled.

The nice -20 task has quota = RR_INTERVAL + 20*RR_INTERVAL = 21*RR_INTERVAL.
It can run at 39 priority levels so its maximum duration =
39 * 21 * RR_INTERVAL.
The nice 0 task works out to
19 * RR_INTERVAL
The nice 19 task works out to
RR_INTERVAL.

So major epoch can take up a maximum of
39 * 21 * RR_INTERVAL + 19 * RR_INTERVAL + RR_INTERVAL = 1229 * RR_INTERVAL;

Then before the nice 10 task will run, the nice -20 and nice 0 task will
run for 28 * 21 * RR_INTERVAL and 9 * RR_INTERVAL respectively for a total
of 597 * RR_INTERVAL.

This means the maximum duration a nice 10 task can wait in the presence of
these other tasks is 1826*RR_INTERVAL. This is a long time of course and is
heavily penalised by the presence of nice -20 tasks which would not be part
of a normal environment.

While this section describes the maximum latency a task can have, this size
latencies will only be seen by fully cpu bound tasks.


Achieving interactivity
=======================

A requirement of this scheduler design was to achieve good interactivity
despite being a completely fair deadline based design. The disadvantage of
designs that try to achieve interactivity is that they usually do so at
the expense of maintaining fairness. As cpu speeds increase, the requirement
for some sort of metered unfairness towards interactive tasks becomes a less
desirable phenomenon, but low latency and fairness remains mandatory to
good interactive performance.

This design relies on the fact that interactive tasks, by their nature,
sleep often. Most fair scheduling designs end up penalising such tasks
indirectly giving them less than their fair possible share because of the
sleep, and have to use a mechanism of bonusing their priority to offset
this based on the duration they sleep. This becomes increasingly inaccurate
as the number of running tasks rises and more tasks spend time waiting on
runqueues rather than sleeping, and it is impossible to tell whether the
task that's waiting on a runqueue only intends to run for a short period and
then sleep again after than runqueue wait. Furthermore, all such designs rely
on a period of time to pass to accumulate some form of statistic on the task
before deciding on how much to give them preference. The shorter this period,
the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
longer this period, the longer it takes for interactive tasks to get low
scheduling latencies and fair cpu.

This design does not measure sleep time at all. Interactive tasks that sleep
often will wake up having consumed very little if any of their quota for
the current major priority rotation. The longer they have slept, the less
likely they are to even be on the current major priority rotation. Once
woken up, though, they get to use up a their full quota for that epoch,
whether part of a quota remains or a full quota. Overall, however, they
can still only run as much cpu time for that epoch as any other task of the
same nice level. This means that two tasks behaving completely differently
from fully cpu bound to waking/sleeping extremely frequently will still
get the same quota of cpu, but the latter will be using its quota for that
epoch in bursts rather than continuously. This guarantees that interactive
tasks get the same amount of cpu as cpu bound ones.

The other requirement of interactive tasks is also to obtain low latencies
for when they are scheduled. Unlike fully cpu bound tasks and the maximum
latencies possible described in the modelling deadline behaviour section
above, tasks that sleep will wake up with quota available usually at the
current runqueue's priority_level or better. This means that the most latency
they are likely to see is one RR_INTERVAL, and often they will preempt the
current task if it is not of a sleeping nature. This then guarantees very
low latency for interactive tasks, and the lowest latencies for the least
cpu bound tasks.

Sunday, 4th March 2007
Con Kolivas

-- 
-ck

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2007-03-18  1:29 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-05  9:45 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Nicolas Mailhot
2007-03-05  9:53 ` Gene Heskett
2007-03-05 10:00   ` Nicolas Mailhot
2007-03-05 15:22   ` Paolo Ciarrocchi
2007-03-05 18:37     ` Gene Heskett
2007-03-05 18:20   ` Lee Revell
2007-03-05 19:19     ` Gene Heskett
2007-03-05 22:40       ` Andrew Morton
2007-03-05 23:19         ` Gene Heskett
2007-03-06  2:23       ` Ed Tomlinson
2007-03-06  2:54         ` Linus Torvalds
2007-03-06  3:36           ` Gene Heskett
2007-03-09  4:04           ` Bill Davidsen
2007-03-09  6:31             ` Linus Torvalds
2007-03-09  7:04               ` Bill Huey
2007-03-09 10:54               ` William Lee Irwin III
2007-03-09 14:54               ` Bill Davidsen
2007-03-09 18:11                 ` Linus Torvalds
2007-03-06 17:50   ` Bill Davidsen
2007-03-06 20:06     ` Con Kolivas
2007-03-09  4:21       ` Bill Davidsen
  -- strict thread matches above, loose matches on Subject: below --
2007-03-11 22:29 bert hubert
2007-03-11 22:57 ` Con Kolivas
2007-03-08 14:27 Tim Tassonis
2007-03-06  4:57 Shawn Starr
2007-03-04 20:35 Al Boldi
2007-03-04 21:49 ` Con Kolivas
     [not found]   ` <45EB45F7.3050208@simon.arlott.org.uk>
2007-03-04 22:27     ` Con Kolivas
2007-03-05 18:29       ` Simon Arlott
2007-03-05 21:36         ` Con Kolivas
2007-03-04 23:13   ` Willy Tarreau
2007-03-04 23:58     ` Con Kolivas
2007-03-05  1:09     ` Gene Heskett
2007-03-06  8:42 ` [ck] " Xavier Bestel
2007-03-06 15:15   ` Al Boldi
2007-03-11 18:11     ` Al Boldi
2007-03-11 21:52       ` Con Kolivas
2007-03-11 22:12         ` Con Kolivas
2007-03-12  4:42           ` Al Boldi
2007-03-12  4:53             ` Con Kolivas
2007-03-12 11:26               ` Al Boldi
2007-03-12 12:52                 ` Con Kolivas
2007-03-12 14:14                   ` Al Boldi
2007-03-12 14:58                     ` [ck] " jos poortvliet
2007-03-12 17:41                       ` Al Boldi
2007-03-12 18:05                     ` Con Kolivas
2007-03-18  1:30                   ` Bill Davidsen
2007-03-04  7:00 Con Kolivas
2007-03-04  7:45 ` [ck] " Con Kolivas
2007-03-04 14:04   ` Con Kolivas
2007-03-04 11:08 ` Gene Heskett
2007-03-04 11:47   ` Con Kolivas
2007-03-04 12:24     ` Gene Heskett
2007-03-04 12:46       ` Con Kolivas
2007-03-04 13:25         ` Gene Heskett
2007-03-04 13:49           ` Con Kolivas
2007-03-04 14:11             ` Gene Heskett
2007-03-05  2:31               ` Zwane Mwaikambo
2007-03-05  3:16                 ` Gene Heskett
2007-03-04 14:36             ` Willy Tarreau
2007-03-04 16:08               ` [ck] " jos poortvliet
2007-03-05 23:05                 ` Bill Davidsen
2007-03-06  0:18                   ` Con Kolivas
2007-03-06  4:41                     ` Willy Tarreau
2007-03-06  5:39                       ` Nicholas Miell
2007-03-06 19:04                       ` jos poortvliet
2007-03-06 21:37                       ` Bill Davidsen
2007-03-06 21:54                         ` Willy Tarreau
2007-03-05 21:52 ` Con Kolivas
2007-03-08  8:53 ` Ingo Molnar
2007-03-08 10:07   ` Con Kolivas
2007-03-08 20:25 ` Fabio Comolli
2007-03-08 20:57   ` Con Kolivas
2007-03-08 21:31     ` Fabio Comolli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).