linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RT task scheduling
@ 2006-04-06  3:25 Darren Hart
  2006-04-06  4:19 ` Peter Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Darren Hart @ 2006-04-06  3:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin

My last mail specifically addresses preempt-rt, but I'd like to know people's 
thoughts regarding this issue in the mainline kernel.  Please see my previous 
post "realtime-preempt scheduling - rt_overload behavior" for a testcase that 
produces unpredictable scheduling results.

Part of the issue here is to define what we consider "correct behavior" for 
SCHED_FIFO realtime tasks.  Do we (A) need to strive for "strict realtime 
priority scheduling" where the NR_CPUS highest priority runnable SCHED_FIFO 
tasks are _always_ running?  Or do we (B) take the best effort approach with 
an upper limit RT priority imbalances, where an imbalance may occur (say at 
wakeup or exit) but will be remedied within 1 tick.  The smpnice patches 
improve load balancing, but don't provide (A).

More details in the previous mail...

Thanks,

--Darren

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  3:25 RT task scheduling Darren Hart
@ 2006-04-06  4:19 ` Peter Williams
  2006-04-06 17:24   ` Darren Hart
  2006-04-06  7:37 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Peter Williams @ 2006-04-06  4:19 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Stultz, John, Siddha,
	Suresh B, Nick Piggin

Darren Hart wrote:
> My last mail specifically addresses preempt-rt, but I'd like to know people's 
> thoughts regarding this issue in the mainline kernel.  Please see my previous 
> post "realtime-preempt scheduling - rt_overload behavior" for a testcase that 
> produces unpredictable scheduling results.
> 
> Part of the issue here is to define what we consider "correct behavior" for 
> SCHED_FIFO realtime tasks.  Do we (A) need to strive for "strict realtime 
> priority scheduling" where the NR_CPUS highest priority runnable SCHED_FIFO 
> tasks are _always_ running?  Or do we (B) take the best effort approach with 
> an upper limit RT priority imbalances, where an imbalance may occur (say at 
> wakeup or exit) but will be remedied within 1 tick.  The smpnice patches 
> improve load balancing, but don't provide (A).
> 
> More details in the previous mail...

I'm currently researching some ideas to improve smpnice that may help in 
this situation.  The basic idea is that as well as trying to equally 
distribute the weighted load among the groups/queues we should also try 
to achieve equal "average load per task" for each group/queue.  (As well 
as helping with problems such as yours, this will help to restore the 
"equal distribution of nr_running" amongst groups/queues aim that is 
implicit without smpnice due to the fact that load is just a smoothed 
version of nr_running.)

In find_busiest_group(), I think that load balancing in the case where 
*imbalance is greater than busiest_load_per_task will tend towards this 
result and also when *imbalance is less than busiest_load_per_task AND 
busiest_load_per_task is less than this_load_per_task.  However, in the 
case where *imbalance is less than busiest_load_per_task AND 
busiest_load_per_task is greater than this_load_per_task this will not 
be the case as the amount of load moved from "busiest" to "this" will be 
less than or equal to busiest_load_per_task and this will actually 
increase the value of busiest_load_per_task.  So, although it will 
achieve the aim of equally distributing the weighted load, it won't help 
the second aim of equal "average load per task" values for groups/queues.

The obvious way to fix this problem is to alter the code so that more 
than busiest_load_per_task is moved from "busiest" to "this" in these 
cases while at the same time ensuring that the imbalance between their 
loads doesn't get any bigger.  I'm working on a patch along these lines.

Changes to find_idlest_group() and try_to_wake_up() taking into account 
the "average load per task" on the candidate queues/groups as well as 
their weighted loads may also help and I'll be looking at them as well. 
  It's not immediately obvious to me how this can be done so any ideas 
would be welcome.  It will likely involve taking the load weight of the 
waking task into account as well.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  3:25 RT task scheduling Darren Hart
  2006-04-06  4:19 ` Peter Williams
@ 2006-04-06  7:37 ` Ingo Molnar
  2006-04-06 14:55   ` Darren Hart
                     ` (4 more replies)
  2006-04-07  9:23 ` Bill Huey
  2006-04-09 13:16 ` Ingo Molnar
  3 siblings, 5 replies; 37+ messages in thread
From: Ingo Molnar @ 2006-04-06  7:37 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin


* Darren Hart <darren@dvhart.com> wrote:

> My last mail specifically addresses preempt-rt, but I'd like to know 
> people's thoughts regarding this issue in the mainline kernel.  Please 
> see my previous post "realtime-preempt scheduling - rt_overload 
> behavior" for a testcase that produces unpredictable scheduling 
> results.

the rt_overload feature i intend to push upstream-wards too, i just 
didnt separate it out of -rt yet.

"RT overload scheduling" is a totally orthogonal mechanism to the SMP 
load-balancer (and this includes smpnice too) that is more or less 
equivalent to having a 'global runqueue' for real-time tasks, without 
the SMP overhead associated with that. If there is no "RT overload" [the 
common case even on Linux systems that _do_ make use of RT tasks 
occasionally], the new mechanism is totally inactive and there's no 
overhead. But once there are more RT tasks than CPUs, the scheduler will 
do "global" decisions for what RT tasks to run on which CPU. To put even 
less overhead on the mainstream kernel, i plan to introduce a new 
SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
make much sense to extend SCHED_RR in that direction.]

my gut feeling is that it would be wrong to integrate this feature into 
smpnice: SCHED_FIFO is about determinism, and smpnice is a fundamentally 
statistical approach. Also, smpnice doesnt have to try as hard to pick 
the right task as rt_overload does, so there would be constant 
'friction' between "overhead" optimizations (dont be over-eager) and 
"latency" optimizations (dont be _under_-eager). So i'm quite sure we 
want this feature separate. [nevertheless i'd happy to be proven wrong 
via some good and working smpnice based solution]

in any case, i'll check your -rt testcase to see why it fails.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  7:37 ` Ingo Molnar
@ 2006-04-06 14:55   ` Darren Hart
  2006-04-06 18:16   ` Darren Hart
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Darren Hart @ 2006-04-06 14:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin

On Thursday 06 April 2006 00:37, Ingo Molnar wrote:
> * Darren Hart <darren@dvhart.com> wrote:
> > My last mail specifically addresses preempt-rt, but I'd like to know
> > people's thoughts regarding this issue in the mainline kernel.  Please
> > see my previous post "realtime-preempt scheduling - rt_overload
> > behavior" for a testcase that produces unpredictable scheduling
> > results.
>
> the rt_overload feature i intend to push upstream-wards too, i just
> didnt separate it out of -rt yet.

Great news!

>
> "RT overload scheduling" is a totally orthogonal mechanism to the SMP
> load-balancer (and this includes smpnice too) that is more or less
> equivalent to having a 'global runqueue' for real-time tasks, without
> the SMP overhead associated with that. If there is no "RT overload" [the
> common case even on Linux systems that _do_ make use of RT tasks
> occasionally], the new mechanism is totally inactive and there's no
> overhead. 

Agreed.  smpnice is geared toward load_balancing (which indicates an imbalance 
already exists).  In order to achieve "system wide strict realtime priority 
scheduling" we need to avoid that priority imbalance altogether.

> But once there are more RT tasks than CPUs, the scheduler will
> do "global" decisions for what RT tasks to run on which CPU. To put even
> less overhead on the mainstream kernel, i plan to introduce a new
> SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt
> make much sense to extend SCHED_RR in that direction.]

I agree that SCHED_RR doesn't need to be included here.  I'm not sure about 
another scheduling policy though.  As you said, the existing mechanism is 
inactive with nr_rt_tasks <= NR_CPUS, applications using more than that (with 
SCHED_FIFO) will likely want the rt_overload feature - as you said, it's 
about predictability and determinism.  As it is now we are using POSIX 
standard scheduling policies - do we want to add a non standard one?  I don't 
see the benefit.

> my gut feeling is that it would be wrong to integrate this feature into
> smpnice: SCHED_FIFO is about determinism, and smpnice is a fundamentally
> statistical approach. Also, smpnice doesnt have to try as hard to pick
> the right task as rt_overload does, so there would be constant
> 'friction' between "overhead" optimizations (dont be over-eager) and
> "latency" optimizations (dont be _under_-eager). So i'm quite sure we
> want this feature separate. [nevertheless i'd happy to be proven wrong
> via some good and working smpnice based solution]
>
> in any case, i'll check your -rt testcase to see why it fails.

Just so I am clear, is the goal then to achieve "system wide strict realtime 
priority scheduling", as opposed to a best effort?  I think this makes the 
most sense and it seems to me that the rt_overload mechanism is intended for 
just that.


Thanks,

--Darren

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  4:19 ` Peter Williams
@ 2006-04-06 17:24   ` Darren Hart
  2006-04-06 23:02     ` Peter Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-06 17:24 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Stultz, John, Siddha,
	Suresh B, Nick Piggin

On Wednesday 05 April 2006 21:19, Peter Williams wrote:
> Darren Hart wrote:
> > My last mail specifically addresses preempt-rt, but I'd like to know
> > people's thoughts regarding this issue in the mainline kernel.  Please
> > see my previous post "realtime-preempt scheduling - rt_overload behavior"
> > for a testcase that produces unpredictable scheduling results.
> >
> > Part of the issue here is to define what we consider "correct behavior"
> > for SCHED_FIFO realtime tasks.  Do we (A) need to strive for "strict
> > realtime priority scheduling" where the NR_CPUS highest priority runnable
> > SCHED_FIFO tasks are _always_ running?  Or do we (B) take the best effort
> > approach with an upper limit RT priority imbalances, where an imbalance
> > may occur (say at wakeup or exit) but will be remedied within 1 tick. 
> > The smpnice patches improve load balancing, but don't provide (A).
> >
> > More details in the previous mail...
>
> I'm currently researching some ideas to improve smpnice that may help in
> this situation.  The basic idea is that as well as trying to equally
> distribute the weighted load among the groups/queues we should also try
> to achieve equal "average load per task" for each group/queue.  (As well
> as helping with problems such as yours, this will help to restore the
> "equal distribution of nr_running" amongst groups/queues aim that is
> implicit without smpnice due to the fact that load is just a smoothed
> version of nr_running.)

Can you elaborate on what you mean by "average load per task" ?  

Also, since smpnice is (correct me if I am wrong) load_balancing, I don't 
think it will prevent the problem from happening, but rather fix it when it 
does.  If we want to prevent it from happening, I think we need to do 
something like the rt_overload code from the RT patchset.

>
> In find_busiest_group(), I think that load balancing in the case where
> *imbalance is greater than busiest_load_per_task will tend towards this
> result and also when *imbalance is less than busiest_load_per_task AND
> busiest_load_per_task is less than this_load_per_task.  However, in the
> case where *imbalance is less than busiest_load_per_task AND
> busiest_load_per_task is greater than this_load_per_task this will not
> be the case as the amount of load moved from "busiest" to "this" will be
> less than or equal to busiest_load_per_task and this will actually
> increase the value of busiest_load_per_task.  So, although it will
> achieve the aim of equally distributing the weighted load, it won't help
> the second aim of equal "average load per task" values for groups/queues.
>
> The obvious way to fix this problem is to alter the code so that more
> than busiest_load_per_task is moved from "busiest" to "this" in these
> cases while at the same time ensuring that the imbalance between their
> loads doesn't get any bigger.  I'm working on a patch along these lines.
>
> Changes to find_idlest_group() and try_to_wake_up() taking into account
> the "average load per task" on the candidate queues/groups as well as
> their weighted loads may also help and I'll be looking at them as well.
>   It's not immediately obvious to me how this can be done so any ideas
> would be welcome.  It will likely involve taking the load weight of the
> waking task into account as well.
>
> Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  7:37 ` Ingo Molnar
  2006-04-06 14:55   ` Darren Hart
@ 2006-04-06 18:16   ` Darren Hart
  2006-04-06 22:35     ` Darren Hart
  2006-04-06 23:06   ` Peter Williams
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-06 18:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin

On Thursday 06 April 2006 00:37, Ingo Molnar wrote:
> * Darren Hart <darren@dvhart.com> wrote:
> > My last mail specifically addresses preempt-rt, but I'd like to know
> > people's thoughts regarding this issue in the mainline kernel.  Please
> > see my previous post "realtime-preempt scheduling - rt_overload
> > behavior" for a testcase that produces unpredictable scheduling
> > results.
>
> the rt_overload feature i intend to push upstream-wards too, i just
> didnt separate it out of -rt yet.
>
> "RT overload scheduling" is a totally orthogonal mechanism to the SMP
> load-balancer (and this includes smpnice too) that is more or less
> equivalent to having a 'global runqueue' for real-time tasks, without
> the SMP overhead associated with that. If there is no "RT overload" [the
> common case even on Linux systems that _do_ make use of RT tasks
> occasionally], the new mechanism is totally inactive and there's no
> overhead. But once there are more RT tasks than CPUs, the scheduler will
> do "global" decisions for what RT tasks to run on which CPU. To put even
> less overhead on the mainstream kernel, i plan to introduce a new
> SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt
> make much sense to extend SCHED_RR in that direction.]
>
> my gut feeling is that it would be wrong to integrate this feature into
> smpnice: SCHED_FIFO is about determinism, and smpnice is a fundamentally
> statistical approach. Also, smpnice doesnt have to try as hard to pick
> the right task as rt_overload does, so there would be constant
> 'friction' between "overhead" optimizations (dont be over-eager) and
> "latency" optimizations (dont be _under_-eager). So i'm quite sure we
> want this feature separate. [nevertheless i'd happy to be proven wrong
> via some good and working smpnice based solution]
>
> in any case, i'll check your -rt testcase to see why it fails.

Just as an example, here is the output a failing test case on a 4way machine 
running 2.6.16-rt13 (a successful run would have a final ball position of 0).

[root@box sched_football]# ./sched_football 4 10
Starting 4 offense threads at priority 15
Starting 4 defense threads at priority 30
Starting referee thread
Game On (10 seconds)!
Game Over!
Final ball position: 5
[root@box sched_football]#

--Darren

>
> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06 18:16   ` Darren Hart
@ 2006-04-06 22:35     ` Darren Hart
  2006-04-07 22:58       ` Vernon Mauery
  0 siblings, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-06 22:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin

On Thursday 06 April 2006 11:16, you wrote:
> On Thursday 06 April 2006 00:37, Ingo Molnar wrote:
> > * Darren Hart <darren@dvhart.com> wrote:
> > > My last mail specifically addresses preempt-rt, but I'd like to know
> > > people's thoughts regarding this issue in the mainline kernel.  Please
> > > see my previous post "realtime-preempt scheduling - rt_overload
> > > behavior" for a testcase that produces unpredictable scheduling
> > > results.
> >
...

> > in any case, i'll check your -rt testcase to see why it fails.
>
> Just as an example, here is the output a failing test case on a 4way
> machine running 2.6.16-rt13 (a successful run would have a final ball
> position of 0).
>
> [root@box sched_football]# ./sched_football 4 10
> Starting 4 offense threads at priority 15
> Starting 4 defense threads at priority 30
> Starting referee thread
> Game On (10 seconds)!
> Game Over!
> Final ball position: 5
> [root@box sched_football]#
>
> --Darren


On a related note, I tried observing the rt stats in /proc/stats while running 
sched_football on 2.6.16-rt13.  The entire log is nearly a MB so I placed it 
on my website for reference 
(http://www.dvhart.com/~dvhart/sched_football_stats.log), an excerpt follows:

---------------------

The following is the output of sched_football run with 1 thread for offense 
and 1 thread for defense on a 4 way machine.  The ball position is irrelevant 
in this case since there are more CPUs than threads (they should all be able 
to run).  What is disturbing is that over the entire run, I never see RT 
tasks on every CPU.  Even though there are usually 5 total runnnable threads, 
we constantly see groupings of 2 and 3 on the runqueues while the others have 
no running rt tasks.

Looking back, I should have added a sleep to the loop - oops - still, I think 
the data is interesting and suggests a problem with sceduling RT tasks across 
all available CPUs.  Does this seem like a valid test to everyone?  Is there 
perhaps some explanation as to why this would be expected (when the cat 
process get's to read the proc information or something) ?

# ./sched_football 1 60
Starting 1 offense threads at priority 15
Starting 1 defense threads at priority 30
Starting referee thread
Game On (60 seconds)!
Game Over!
Final ball position: 20359767

# while true; do clear; cat /proc/stat | grep rt >> sched_football_stats.log; 
done

sched_football_stats.log
------------------------------
rt_overload_schedule: 57768
rt_overload_wakeup:   157501
rt_overload_pulled:   13722934
rt_nr_running(0): 0
rt_nr_running(1): 0
rt_nr_running(2): 0
rt_nr_running(3): 0
rt_overload: 0
rt_overload_schedule: 57769
rt_overload_wakeup:   157514
rt_overload_pulled:   13722937
rt_nr_running(0): 0
rt_nr_running(1): 2
rt_nr_running(2): 3
rt_nr_running(3): 0
rt_overload: 2
...
rt_overload_schedule: 57774
rt_overload_wakeup:   157738
rt_overload_pulled:   13722941
rt_nr_running(0): 0
rt_nr_running(1): 2
rt_nr_running(2): 4
rt_nr_running(3): 0
rt_overload: 2
...
rt_overload_schedule: 57791
rt_overload_wakeup:   158650
rt_overload_pulled:   13722964
rt_nr_running(0): 0
rt_nr_running(1): 2
rt_nr_running(2): 0
rt_nr_running(3): 3
rt_overload: 2
...
rt_overload_schedule: 57808
rt_overload_wakeup:   166924
rt_overload_pulled:   13722973
rt_nr_running(0): 0
rt_nr_running(1): 0
rt_nr_running(2): 0
rt_nr_running(3): 2
rt_overload: 1
rt_overload_schedule: 57808
rt_overload_wakeup:   166927
rt_overload_pulled:   13722973
rt_nr_running(0): 0
rt_nr_running(1): 0
rt_nr_running(2): 0
rt_nr_running(3): 0
rt_overload: 0

---------------------

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06 17:24   ` Darren Hart
@ 2006-04-06 23:02     ` Peter Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2006-04-06 23:02 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Stultz, John, Siddha,
	Suresh B, Nick Piggin

Darren Hart wrote:
> On Wednesday 05 April 2006 21:19, Peter Williams wrote:
>> Darren Hart wrote:
>>> My last mail specifically addresses preempt-rt, but I'd like to know
>>> people's thoughts regarding this issue in the mainline kernel.  Please
>>> see my previous post "realtime-preempt scheduling - rt_overload behavior"
>>> for a testcase that produces unpredictable scheduling results.
>>>
>>> Part of the issue here is to define what we consider "correct behavior"
>>> for SCHED_FIFO realtime tasks.  Do we (A) need to strive for "strict
>>> realtime priority scheduling" where the NR_CPUS highest priority runnable
>>> SCHED_FIFO tasks are _always_ running?  Or do we (B) take the best effort
>>> approach with an upper limit RT priority imbalances, where an imbalance
>>> may occur (say at wakeup or exit) but will be remedied within 1 tick. 
>>> The smpnice patches improve load balancing, but don't provide (A).
>>>
>>> More details in the previous mail...
>> I'm currently researching some ideas to improve smpnice that may help in
>> this situation.  The basic idea is that as well as trying to equally
>> distribute the weighted load among the groups/queues we should also try
>> to achieve equal "average load per task" for each group/queue.  (As well
>> as helping with problems such as yours, this will help to restore the
>> "equal distribution of nr_running" amongst groups/queues aim that is
>> implicit without smpnice due to the fact that load is just a smoothed
>> version of nr_running.)
> 
> Can you elaborate on what you mean by "average load per task" ?  

It's the total weighted load on a run group/queue divided by the 
nr_running for that group/queue.  If this is equal for all groups/queues 
and the total weighted load for them are also equal then the 
distribution of priorities in each group/queue should be similar and 
this will give a high probability that (for an N CPU system) the N 
highest priority tasks will be on different CPUs and hence the highest 
priority task on their CPU.  But these are just tendencies not 
guarantees as it's a statistical process not a deterministic one.

> 
> Also, since smpnice is (correct me if I am wrong) load_balancing, I don't 
> think it will prevent the problem from happening, but rather fix it when it 
> does.  If we want to prevent it from happening, I think we need to do 
> something like the rt_overload code from the RT patchset.

I agree.  Changes to smpnice (as described above) would help with this 
problem (i.e. they'll make the distribution of tasks TEND towards the 
desired state) but would not provide the necessary determinism.  I think 
special measures (such as rt_overload) are required if you want determinism.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  7:37 ` Ingo Molnar
  2006-04-06 14:55   ` Darren Hart
  2006-04-06 18:16   ` Darren Hart
@ 2006-04-06 23:06   ` Peter Williams
  2006-04-07  3:07   ` Bill Huey
  2006-04-08  0:11   ` Peter Williams
  4 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2006-04-06 23:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John, Siddha,
	Suresh B, Nick Piggin

Ingo Molnar wrote:
> * Darren Hart <darren@dvhart.com> wrote:
> 
>> My last mail specifically addresses preempt-rt, but I'd like to know 
>> people's thoughts regarding this issue in the mainline kernel.  Please 
>> see my previous post "realtime-preempt scheduling - rt_overload 
>> behavior" for a testcase that produces unpredictable scheduling 
>> results.
> 
> the rt_overload feature i intend to push upstream-wards too, i just 
> didnt separate it out of -rt yet.
> 
> "RT overload scheduling" is a totally orthogonal mechanism to the SMP 
> load-balancer (and this includes smpnice too) that is more or less 
> equivalent to having a 'global runqueue' for real-time tasks, without 
> the SMP overhead associated with that. If there is no "RT overload" [the 
> common case even on Linux systems that _do_ make use of RT tasks 
> occasionally], the new mechanism is totally inactive and there's no 
> overhead. But once there are more RT tasks than CPUs, the scheduler will 
> do "global" decisions for what RT tasks to run on which CPU. To put even 
> less overhead on the mainstream kernel, i plan to introduce a new 
> SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> make much sense to extend SCHED_RR in that direction.]
> 
> my gut feeling is that it would be wrong to integrate this feature into 
> smpnice: SCHED_FIFO is about determinism, and smpnice is a fundamentally 
> statistical approach. Also, smpnice doesnt have to try as hard to pick 
> the right task as rt_overload does, so there would be constant 
> 'friction' between "overhead" optimizations (dont be over-eager) and 
> "latency" optimizations (dont be _under_-eager). So i'm quite sure we 
> want this feature separate. [nevertheless i'd happy to be proven wrong 
> via some good and working smpnice based solution]

I was thinking about this over night and came to similar conclusions. 
I.e. for RT tasks it's really a problem of selecting the right CPU at 
wake up time rather than a general load balancing problem.  The solution 
that I thought of was different (though) and involved modifying 
wake_idle() so that when the woken task was high priority as well as 
looking for idle CPUs it looked for the one with the lowest priority 
current task and if it couldn't find an idle CPU it returned the one 
with the lowest priority current task.  The aim was to maximize the 
probability that the newly woken task went straight onto a CPU (either 
by finding an idle one or preemption).

Although aimed at this specific problem, this solution would also help 
smpnice to attain equal "average load per task" values for groups/queues 
which I think is a desirable secondary aim to equatable distribution of 
weighted load.  If both of these aims are met I think a natural outcome 
would be that the highest priority tasks are well distributed among the 
CPUs (but, as you imply, this would be a statistical trend rather than 
an deterministic).

In summary, I think that smpnice can be modified in ways that will help 
with this problem but if you need determinism then special measures are 
probably necessary.

Peter
---
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  7:37 ` Ingo Molnar
                     ` (2 preceding siblings ...)
  2006-04-06 23:06   ` Peter Williams
@ 2006-04-07  3:07   ` Bill Huey
  2006-04-07  7:11     ` Ingo Molnar
  2006-04-08  0:11   ` Peter Williams
  4 siblings, 1 reply; 37+ messages in thread
From: Bill Huey @ 2006-04-07  3:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin

On Thu, Apr 06, 2006 at 09:37:53AM +0200, Ingo Molnar wrote:
> do "global" decisions for what RT tasks to run on which CPU. To put even 
> less overhead on the mainstream kernel, i plan to introduce a new 
> SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> make much sense to extend SCHED_RR in that direction.]

Ingo,

You should consider for a moment to allow for the binding of a thread to
a CPU to determine the behavior of a SCHED_FIFO class task instead of 
creating a new run category. Trying to anticipate the behavior of a
thread via a kernel semantic removes control from the applications using
them and into the kernel which is often incorrect about those assumption.

Con might have comments the matter that might be useful.

The current IPI method you use to balance RT is fine. I don't think it
should be changed for the general SCHED_FIFO case, but I do think that
binding a thread to a certain CPU, or set of CPUs, would simplify various
control cases, where the default case is to rebalance immediately via an
IPI. Having a specific case only running on a specific CPU or CPU set
should be the only mechanism to either "isolate" a CPU or create a
condition that's more deterministic than a global policy that -rt is
doing now.
 
This is consistent to how I would use it in an app and wouldn't create
clutter in scheduling code that is already under minor refactoring.

> my gut feeling is that it would be wrong to integrate this feature into 
> smpnice: SCHED_FIFO is about determinism, and smpnice is a fundamentally 
> statistical approach. Also, smpnice doesnt have to try as hard to pick 
> the right task as rt_overload does, so there would be constant 
> 'friction' between "overhead" optimizations (dont be over-eager) and 
> "latency" optimizations (dont be _under_-eager). So i'm quite sure we 
> want this feature separate. [nevertheless i'd happy to be proven wrong 
> via some good and working smpnice based solution]

Please consider the ideas I've mentioned above and research how other
RTOSes deal with these critical issues.

Thanks

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07  3:07   ` Bill Huey
@ 2006-04-07  7:11     ` Ingo Molnar
  2006-04-07  8:39       ` Bill Huey
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-07  7:11 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Thu, Apr 06, 2006 at 09:37:53AM +0200, Ingo Molnar wrote:
> > do "global" decisions for what RT tasks to run on which CPU. To put even 
> > less overhead on the mainstream kernel, i plan to introduce a new 
> > SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> > make much sense to extend SCHED_RR in that direction.]
> 
> You should consider for a moment to allow for the binding of a thread 
> to a CPU to determine the behavior of a SCHED_FIFO class task instead 
> of creating a new run category. [...]

That is already possible and has been possible for years.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07  7:11     ` Ingo Molnar
@ 2006-04-07  8:39       ` Bill Huey
  2006-04-07  9:11         ` Bill Huey
  2006-04-07  9:19         ` Ingo Molnar
  0 siblings, 2 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07  8:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin

On Fri, Apr 07, 2006 at 09:11:25AM +0200, Ingo Molnar wrote:
> * Bill Huey <billh@gnuppy.monkey.org> wrote:
> 
> > On Thu, Apr 06, 2006 at 09:37:53AM +0200, Ingo Molnar wrote:
> > > do "global" decisions for what RT tasks to run on which CPU. To put even 
> > > less overhead on the mainstream kernel, i plan to introduce a new 
> > > SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> > > make much sense to extend SCHED_RR in that direction.]
> > 
> > You should consider for a moment to allow for the binding of a thread 
> > to a CPU to determine the behavior of a SCHED_FIFO class task instead 
> > of creating a new run category. [...]
> 
> That is already possible and has been possible for years.

I know that this is already the case. What I'm saying is that the creation
of new globally scheduled run case isn't necessarly if you have a robust
thread to CPU binding mechanism, the key here is "robust". I'm suggesting
that you and the gang think in terms of that, in a first-class manner, for
app development instead of creating a new run category that would be almost
certainly abused by naive developers. IMO, the discussion should be about
that as well as setting aside a CPU for a dedicated task, popularly termed
"CPU isolations", that excludes any other task from running on it. This
something that was used fairly heavily under IRIX and is highly useful in
RT development.

The RT rebalancing discussion should be oriented toward manual techniques
for dealing with this on an app basis and not automatic load balancing
stuff or anything like that. IMO, going down this direction is basically
trying to solve a problem with the wrong tool set.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07  8:39       ` Bill Huey
@ 2006-04-07  9:11         ` Bill Huey
  2006-04-07  9:19         ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 01:39:31AM -0700, Bill Huey wrote:
[Me and Ingo's comments about creation of a new run class and thread
binding...]
> The RT rebalancing discussion should be oriented toward manual techniques
> for dealing with this on an app basis and not automatic load balancing
> stuff or anything like that. IMO, going down this direction is basically
> trying to solve a problem with the wrong tool set.

If some kind of automatic load balance is the focus, then extending the
notion of a CPU package for multicore processors sharing the same cache
controller and memory would be a better track to take. I saw this in the
-mm tree over a year ago an I haven't looked at the scheduler code recently
to see if it made into the mainline. If the RT load balancing is to be
extended, it should take into consideration whether the migration of a
thread should go to a core that's closer or farther away in terms of
memory hierarchy instead of just grabbing the first non-RT task running
CPU and hijacking it to run that RT task.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07  8:39       ` Bill Huey
  2006-04-07  9:11         ` Bill Huey
@ 2006-04-07  9:19         ` Ingo Molnar
  2006-04-07 10:39           ` Bill Huey
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-07  9:19 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Fri, Apr 07, 2006 at 09:11:25AM +0200, Ingo Molnar wrote:
> > * Bill Huey <billh@gnuppy.monkey.org> wrote:
> > 
> > > On Thu, Apr 06, 2006 at 09:37:53AM +0200, Ingo Molnar wrote:
> > > > do "global" decisions for what RT tasks to run on which CPU. To put even 
> > > > less overhead on the mainstream kernel, i plan to introduce a new 
> > > > SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> > > > make much sense to extend SCHED_RR in that direction.]
> > > 
> > > You should consider for a moment to allow for the binding of a thread 
> > > to a CPU to determine the behavior of a SCHED_FIFO class task instead 
> > > of creating a new run category. [...]
> > 
> > That is already possible and has been possible for years.
> 
> I know that this is already the case. What I'm saying is that the 
> creation of new globally scheduled run case isn't necessarly if you 
> have a robust thread to CPU binding mechanism, [...]

-ENOPARSE. CPU binding brings with itself obvious disadvantages that 
some applications are not ready to pay. CPU binding restricts the 
scheduler from achieving best resource utilization. That may be fine for 
some applications, but is not good enough for a good number of 
applications. So in no way can any 'CPU binding mechanism' (which 
already exists in multiple forms) replace the need and desire for a 
globally scheduled class of RT tasks.

> [...] the key here is "robust". [...]

-ENOPARSE. CPU binding is CPU binding. Could you outline an example of a 
"non-robust" CPU binding solution?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  3:25 RT task scheduling Darren Hart
  2006-04-06  4:19 ` Peter Williams
  2006-04-06  7:37 ` Ingo Molnar
@ 2006-04-07  9:23 ` Bill Huey
  2006-04-09 13:16 ` Ingo Molnar
  3 siblings, 0 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07  9:23 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Wed, Apr 05, 2006 at 08:25:04PM -0700, Darren Hart wrote:
> Part of the issue here is to define what we consider "correct behavior" for 
> SCHED_FIFO realtime tasks.  Do we (A) need to strive for "strict realtime 
> priority scheduling" where the NR_CPUS highest priority runnable SCHED_FIFO 
> tasks are _always_ running?  Or do we (B) take the best effort approach with 
> an upper limit RT priority imbalances, where an imbalance may occur (say at 
> wakeup or exit) but will be remedied within 1 tick.  The smpnice patches 
> improve load balancing, but don't provide (A).

I regret getting into this discussion late, but it should always be (A)
if you're building a kernel for strict RT usage. (B) is for a system that's
more general purpose. It's not a "one policy fits all" kind of problem.

The search costs of (A) could be be significant and may degrade system
performance. Optimizations for that case is for another discussion.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07  9:19         ` Ingo Molnar
@ 2006-04-07 10:39           ` Bill Huey
  2006-04-07 10:51             ` Ingo Molnar
  2006-04-07 14:56             ` Darren Hart
  0 siblings, 2 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07 10:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 11:19:46AM +0200, Ingo Molnar wrote:
> -ENOPARSE. CPU binding brings with itself obvious disadvantages that 
> some applications are not ready to pay. CPU binding restricts the 
> scheduler from achieving best resource utilization. That may be fine for 
> some applications, but is not good enough for a good number of 
> applications. So in no way can any 'CPU binding mechanism' (which 
> already exists in multiple forms) replace the need and desire for a 
> globally scheduled class of RT tasks.

You're discussing a different problem than what I'm talking about. Some
of it probably is terminology, some it is actually multipule problems
that need to be seperated out and discussed individually. I'm open for
any help on this matter to communicate my view on this topic.

I'm talking about two scenarios, (1) a high performance strict RT usage
of SCHED_FIFO and friends verses (2) a kernel that might have a more
general purpose orientation that would be looser about rebalancing,
permitting some RT task of lower priority to run above a higher priority
for performance sake, what ever...

> > [...] the key here is "robust". [...]
> 
> -ENOPARSE. CPU binding is CPU binding. Could you outline an example of a 
> "non-robust" CPU binding solution?

We're having a terminology problem here... let's pull back a bit.

Any solution that depends on a general purpose algorithm, heuristic or
any of thing of that nature, that means any kind of load rebalancing,
may interfere with RT app of type (1).

RT applications tend to want explicit control over the scheduling of
the system with as little interference from the kernel as possible. The
general purpose policies (RT rebalancing) of the Linux kernel can impede
RT apps from getting at CPU time more directly. If you have a SCHED_FIFO
task, it's by default globally rebalanced, right ? Well, the run queue
search and IPI is going to add to the maximum deterministic response time
of that thread.

How do you avoid that ? Well, give the RT app designer the decision to
hot wire a thread to a CPU so that the rebalancing logic never gets
triggered. That'll improve latency and the expense of rebalancing. This
is up to the discretion of the RT app designer and that person is best
suited to decide where and when that RT thread will run. Current Linux
kernel policies are sufficient to meet at least part of this requirement
fairly well. This does not address the general case (2) and is a
different problem.

Here I've though of it in terms of letting a per CPU binding suggest
to the scheduler what kind of rebalancing policy I want verses letting
a run class determine it. In my opinion, creation of SCHED_FIFO_GLOBAL
is unneeded because of it. What I meant by "robust" was that folks
should consider "fully" if binding address the solution or not before
introducing a new run category. I'm asking "does this address the
problem entirely ?" if not, *then* consider a new run category only after
this fails. That's what I'm saying.

As a side note, consider extended this notion of thread binding so that
it excludes any other thread from running on a CPU that might make the
cache cold for that bounded thread. That'll give maximum latency
performance (sub-microsecond). That's more thinking in terms of CPU
binding.

Does this help with the communication ?

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 10:39           ` Bill Huey
@ 2006-04-07 10:51             ` Ingo Molnar
  2006-04-07 11:14               ` Bill Huey
  2006-04-07 14:56             ` Darren Hart
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-07 10:51 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Fri, Apr 07, 2006 at 11:19:46AM +0200, Ingo Molnar wrote:
> > -ENOPARSE. CPU binding brings with itself obvious disadvantages that 
> > some applications are not ready to pay. CPU binding restricts the 
> > scheduler from achieving best resource utilization. That may be fine for 
> > some applications, but is not good enough for a good number of 
> > applications. So in no way can any 'CPU binding mechanism' (which 
> > already exists in multiple forms) replace the need and desire for a 
> > globally scheduled class of RT tasks.
> 
> You're discussing a different problem than what I'm talking about. 
> [...]

no, i'm discussing precisely the point you raised:

>>> You should consider for a moment to allow for the binding of a 
>>> thread to a CPU to determine the behavior of a SCHED_FIFO class task 
>>> instead of creating a new run category. [...]

with the observation that 1) binding is already possible [so your 
suggestion is apparently knocking on open doors] 2) binding is a 
separate mechanism (not adequate for all workloads) and it is thus 
orthogonal to what i'm trying to achieve with the "RT overload" stuff.  
Really simple and straightforward observations i think.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 10:51             ` Ingo Molnar
@ 2006-04-07 11:14               ` Bill Huey
  2006-04-07 11:29                 ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Bill Huey @ 2006-04-07 11:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 12:51:41PM +0200, Ingo Molnar wrote:
> no, i'm discussing precisely the point you raised:

Oh boy.

> >>> You should consider for a moment to allow for the binding of a 
> >>> thread to a CPU to determine the behavior of a SCHED_FIFO class task 
> >>> instead of creating a new run category. [...]
> 
> with the observation that 1) binding is already possible [so your 
> suggestion is apparently knocking on open doors] 2) binding is a 
> separate mechanism (not adequate for all workloads) and it is thus 
> orthogonal to what i'm trying to achieve with the "RT overload" stuff.  
> Really simple and straightforward observations i think.

This is going to take some time to get the terminology right. It's
late now and I'll have to continue this tomorrow.

First thing's first, SCHED_FIFO_GLOBAL for what you want in the main
line is the same thing as SCHED_FIFO in -rt, right ?

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 11:14               ` Bill Huey
@ 2006-04-07 11:29                 ` Ingo Molnar
  2006-04-07 22:18                   ` Bill Huey
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-07 11:29 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> > >>> You should consider for a moment to allow for the binding of a 
> > >>> thread to a CPU to determine the behavior of a SCHED_FIFO class task 
> > >>> instead of creating a new run category. [...]
> > 
> > with the observation that 1) binding is already possible [so your 
> > suggestion is apparently knocking on open doors] 2) binding is a 
> > separate mechanism (not adequate for all workloads) and it is thus 
> > orthogonal to what i'm trying to achieve with the "RT overload" stuff.  
> > Really simple and straightforward observations i think.

> First thing's first, SCHED_FIFO_GLOBAL for what you want in the main 
> line is the same thing as SCHED_FIFO in -rt, right ?

yes.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 10:39           ` Bill Huey
  2006-04-07 10:51             ` Ingo Molnar
@ 2006-04-07 14:56             ` Darren Hart
  2006-04-07 21:06               ` Bill Huey
  1 sibling, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-07 14:56 UTC (permalink / raw)
  To: Bill Huey
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin

On Friday 07 April 2006 03:39, Bill Huey wrote:
> On Fri, Apr 07, 2006 at 11:19:46AM +0200, Ingo Molnar wrote:
> > -ENOPARSE. CPU binding brings with itself obvious disadvantages that
> > some applications are not ready to pay. CPU binding restricts the
> > scheduler from achieving best resource utilization. That may be fine for
> > some applications, but is not good enough for a good number of
> > applications. So in no way can any 'CPU binding mechanism' (which
> > already exists in multiple forms) replace the need and desire for a
> > globally scheduled class of RT tasks.
>
> You're discussing a different problem than what I'm talking about. Some
> of it probably is terminology, some it is actually multipule problems
> that need to be seperated out and discussed individually. I'm open for
> any help on this matter to communicate my view on this topic.
>
> I'm talking about two scenarios, (1) a high performance strict RT usage
> of SCHED_FIFO and friends verses (2) a kernel that might have a more
> general purpose orientation that would be looser about rebalancing,
> permitting some RT task of lower priority to run above a higher priority
> for performance sake, what ever...
>
> > > [...] the key here is "robust". [...]
> >
> > -ENOPARSE. CPU binding is CPU binding. Could you outline an example of a
> > "non-robust" CPU binding solution?
>
> We're having a terminology problem here... let's pull back a bit.
>
> Any solution that depends on a general purpose algorithm, heuristic or
> any of thing of that nature, that means any kind of load rebalancing,
> may interfere with RT app of type (1).

The rt-overload mechanism is distinct from load balancing.  RT overload 
attempts to make sure the NR_CPUS highest priority runnable tasks are running 
on each of the available CPUs.  This isn't load balancing, this is "system 
wide strict realtime priority scheduling" (SWSRPS).  This scheduling should 
take place even if there are 1000 non RT tasks on CPU 0 and none on all the 
others.  The load balancer would be responsible for distributing those 1000 
non rt tasks to all the CPUs.

>
> RT applications tend to want explicit control over the scheduling of
> the system with as little interference from the kernel as possible. The
> general purpose policies (RT rebalancing) of the Linux kernel can impede
> RT apps from getting at CPU time more directly. 

I don't feel that SWSRPS in anyway interferes with realtime applications.  If 
an application does not explicitly set a cpu affinity, then the kernel should 
assume the task can run on any CPU and should make SWSRPS decisions 
accordingly.  In fact, in my experience, applications expect this type of 
scheduling - and don't consider it an interference.

> If you have a SCHED_FIFO
> task, it's by default globally rebalanced, right ? Well, the run queue
> search and IPI is going to add to the maximum deterministic response time
> of that thread

Actually the SWSRPS is what makes the scheduling deterministic.  That 
determinism comes at a cost, but without it it doesn't exist at all on an SMP 
machine.  So saying it "adds to the maximum deterministic response time" 
doesn't really make any sense.

> . 
>
> How do you avoid that ? Well, give the RT app designer the decision to
> hot wire a thread to a CPU so that the rebalancing logic never gets
> triggered. That'll improve latency and the expense of rebalancing. This
> is up to the discretion of the RT app designer and that person is best
> suited to decide where and when that RT thread will run. Current Linux
> kernel policies are sufficient to meet at least part of this requirement
> fairly well. This does not address the general case (2) and is a
> different problem.

It's my impression (RT folks please pipe up if it's wrong) that if you don't 
care about priority scheduling (i.e. it's OK for a lower priority task to run 
while one with a higher priority sits waiting on another queue), then you 
don't use SCHED_FIFO.  So I don't think case 2 is really valid.

>
> Here I've though of it in terms of letting a per CPU binding suggest
> to the scheduler what kind of rebalancing policy I want verses letting
> a run class determine it. In my opinion, creation of SCHED_FIFO_GLOBAL
> is unneeded because of it. What I meant by "robust" was that folks
> should consider "fully" if binding address the solution or not before
> introducing a new run category. I'm asking "does this address the
> problem entirely ?" if not, *then* consider a new run category only after
> this fails. That's what I'm saying.
>
> As a side note, consider extended this notion of thread binding so that
> it excludes any other thread from running on a CPU that might make the
> cache cold for that bounded thread. That'll give maximum latency
> performance (sub-microsecond). That's more thinking in terms of CPU
> binding.

You've made a lot of references to binding tasks to CPUs (or forbidding them, 
which is essentially a bind to non-forbidden CPUs).  While application 
developers can certainly take this approach, the linux kernel should schedule 
realtime tasks globally according to priority in the generic case.

--Darren


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 14:56             ` Darren Hart
@ 2006-04-07 21:06               ` Bill Huey
  2006-04-07 22:37                 ` Darren Hart
  2006-04-08  7:16                 ` Ingo Molnar
  0 siblings, 2 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07 21:06 UTC (permalink / raw)
  To: Darren Hart
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 07:56:20AM -0700, Darren Hart wrote:
> The rt-overload mechanism is distinct from load balancing.  RT overload 
> attempts to make sure the NR_CPUS highest priority runnable tasks are running 
> on each of the available CPUs.  This isn't load balancing, this is "system 
> wide strict realtime priority scheduling" (SWSRPS).  This scheduling should 
> take place even if there are 1000 non RT tasks on CPU 0 and none on all the 
> others.  The load balancer would be responsible for distributing those 1000 
> non rt tasks to all the CPUs.

I'm quite aware of what you're saying as well as a much of the contents of
the -rt patch. Please don't assume that I'm not aware of this. The -rt patch
doesn't do SWSRPS, but it could with more expensive checking. I'm not saying
anything against it. In fact, I'm clearly for it if you read the other posts.
Please read my other posts.

> > RT applications tend to want explicit control over the scheduling of
> > the system with as little interference from the kernel as possible. The
> > general purpose policies (RT rebalancing) of the Linux kernel can impede
> > RT apps from getting at CPU time more directly. 
> 
> I don't feel that SWSRPS in anyway interferes with realtime applications.  If 
> an application does not explicitly set a cpu affinity, then the kernel should 
> assume the task can run on any CPU and should make SWSRPS decisions 
> accordingly.  In fact, in my experience, applications expect this type of 
> scheduling - and don't consider it an interference.

It will with the IPI forcing the rescheduling. I've seen up to 20 usec hits
for the IPI on p3 hardware. Some RT might like tighter guarantees. That
extends the maxium deterministic latency by the time that it does IPI an
along with the response.

> Actually the SWSRPS is what makes the scheduling deterministic.  That 
> determinism comes at a cost, but without it it doesn't exist at all on an SMP 
> machine.  So saying it "adds to the maximum deterministic response time" 
> doesn't really make any sense.

Above comments...

> It's my impression (RT folks please pipe up if it's wrong) that if you don't 
> care about priority scheduling (i.e. it's OK for a lower priority task to run 
> while one with a higher priority sits waiting on another queue), then you 
> don't use SCHED_FIFO.  So I don't think case 2 is really valid.

Yeah, this is not good. There's a serious communication disconnect here and
it's not going to be easily bridged. Please read what I said and think about
the usage scenarios that I've mentioned more carefully. The -rt patch already
has this kind of mechanism whether you're aware of this or not (unless somethings
changed since I last looked). It's not as aggressively doing SWSRPS as what
you're saying but it serves things nicely for now.

Do you understand this ? The -rt patch has yet to be target for a doing strict
RT work and your suggestion/patches might be one step closer to that goal. The
counter effect to that is that you're going to effect the general case latency
case with SCHED_FIFO via that rescheduling hit "all of the time" with IPIs and
stuff. I'm 'ok' with that. In fact, that's what I want. I'm not ok with Ingo
creating a new scheduling class to get around that aggressive rebalancing
policy preserving the current SCHED_FIFO behavior in the main line. It's a hack
IMO and I'll bet you that nobody cared about that behavior in the first place.

I'd much rather see that some kind of SWSRPS go into the main line kernel (RT
scheduling is pretty goofy already so folks probaby won't mind SWSRPS replacing
it) and let thread binding to a CPU restore any loss of latency by bypassing
the SWSRPS logic. Are you tracking me ? My concerns here are the latency
and overhead of SWSRPS and they are definitely significant.

> You've made a lot of references to binding tasks to CPUs (or forbidding them, 
> which is essentially a bind to non-forbidden CPUs).  While application 
> developers can certainly take this approach, the linux kernel should schedule 
> realtime tasks globally according to priority in the generic case.

For a default RT oriented kernel, yes, I agree, but that's not what's in -rt
and it serves the purpose for now. Looser scheduling constraints aren't really
effecting the current crop of RT applications and that's ok since it's a bit
fast than a pure SWSRPS solution. For those apps having a more general, looser,
case semantic doesn't effect things dramatically and might even be useful.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 11:29                 ` Ingo Molnar
@ 2006-04-07 22:18                   ` Bill Huey
  0 siblings, 0 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-07 22:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 01:29:56PM +0200, Ingo Molnar wrote:
> * Bill Huey <billh@gnuppy.monkey.org> wrote:
> > First thing's first, SCHED_FIFO_GLOBAL for what you want in the main 
> > line is the same thing as SCHED_FIFO in -rt, right ?
> 
> yes.

Ok, good, we're getting some where. IMO, SCHED_FIFO_GLOBAL doesn't add
anything to existing Linux scheduling policies for it to be really
distinct than SCHED_FIFO. I'd much see that feature collapsed into a
into SCHED_FIFO intrinsically. In fact, that's the way it should be in
a solid RTOS. Creation of run categories that are so similar doesn't
really add to the "meaning" of the system, since SCHED_FIFO was kind
of hammered in the first place, just make SCHED_FIFO do that strict
priority stuff across all processors as a default property.

Whether this belongs in the main line or not is questionable. My guess
is probably not. But I definitely think it should go into -rt and it
would be much more warmly received by folks developing on that kernel
patch to know that SCHED_FIFO has this strict behavior. It's actually
needed in that patch IMO. Following ?

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 21:06               ` Bill Huey
@ 2006-04-07 22:37                 ` Darren Hart
  2006-04-07 23:36                   ` Bill Huey
  2006-04-08  7:16                 ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-07 22:37 UTC (permalink / raw)
  To: Bill Huey
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin

On Friday 07 April 2006 14:06, Bill Huey wrote:
> On Fri, Apr 07, 2006 at 07:56:20AM -0700, Darren Hart wrote:
> > The rt-overload mechanism is distinct from load balancing.  RT overload
> > attempts to make sure the NR_CPUS highest priority runnable tasks are
> > running on each of the available CPUs.  This isn't load balancing, this
> > is "system wide strict realtime priority scheduling" (SWSRPS).  This
> > scheduling should take place even if there are 1000 non RT tasks on CPU 0
> > and none on all the others.  The load balancer would be responsible for
> > distributing those 1000 non rt tasks to all the CPUs.
>
> I'm quite aware of what you're saying as well as a much of the contents of
> the -rt patch. Please don't assume that I'm not aware of this. The -rt
> patch doesn't do SWSRPS, but it could with more expensive checking. I'm not
> saying anything against it. In fact, I'm clearly for it if you read the
> other posts. Please read my other posts.


OK First off - let's assume we both read eachothers posts before 
commenting :-)  I'll try to address some of the areas where we seem to be 
talking passed eachother, please let me know if we agree on the following:

First, when I refer to "System Wide Strict Realtime Priority 
Scheduling" (SWSRPS) I mean it in the absolute sense.  There is no "nearly" 
or "almost", it is either SWSRPS or it is not.  Everything else is just load 
balancing (relaxed, aggressive, whatever).  Since it's my term, I get to 
define it ;-) (and if anyone can come up with a better term, please do - 
SWSRPS is admittedly really lame)

My first question to the community was where do we want to end up?  RT 
scheduling on SMP is relatively new and some RT specs don't even address it 
(http://www.rtsj.org for example, see 
http://www.rtsj.org/specjavadoc/sched_overview-summary.html section 
"Semantics and Requirements Governing the Base Scheduler").  It is my belief 
(and I think you agree?) that because Realtime tasks expect deterministic 
behavior, there should be support for SWSRPS in the kernel.

There will be some overhead involved with the SWSRPS implementation, we want 
to minimize that.  The current attempts use IPI which is higher overhead than 
would be ideal.  You mentioned 20 us, less than 10us would be better (subject 
to hardware limitations of course).

CPU binding can be used by the application developer to fine tune a complex 
realtime application and avoid some of the overhead involved with SWSRPS.  
When I read your comments it sounds like you are mentioning this as a new 
feature, which as you know it isn't - so can you rephrase what you mean by 
this?


>
> > > RT applications tend to want explicit control over the scheduling of
> > > the system with as little interference from the kernel as possible. The
> > > general purpose policies (RT rebalancing) of the Linux kernel can
> > > impede RT apps from getting at CPU time more directly.
> >
> > I don't feel that SWSRPS in anyway interferes with realtime applications.
> >  If an application does not explicitly set a cpu affinity, then the
> > kernel should assume the task can run on any CPU and should make SWSRPS
> > decisions accordingly.  In fact, in my experience, applications expect
> > this type of scheduling - and don't consider it an interference.
>
> It will with the IPI forcing the rescheduling. I've seen up to 20 usec hits
> for the IPI on p3 hardware. Some RT might like tighter guarantees. That
> extends the maxium deterministic latency by the time that it does IPI an
> along with the response.

I'm not arguing that SWSRPS won't impose some overhead, it definitely will.  
I'd like to see better than 20us as well.  My points was without it, we don't 
have determinism, and that's a "bad thing".  (We don't have SWSRPS now, I 
understand that, but there is work towards it.).

>
> > Actually the SWSRPS is what makes the scheduling deterministic.  That
> > determinism comes at a cost, but without it it doesn't exist at all on an
> > SMP machine.  So saying it "adds to the maximum deterministic response
> > time" doesn't really make any sense.
>
> Above comments...
>
> > It's my impression (RT folks please pipe up if it's wrong) that if you
> > don't care about priority scheduling (i.e. it's OK for a lower priority
> > task to run while one with a higher priority sits waiting on another
> > queue), then you don't use SCHED_FIFO.  So I don't think case 2 is really
> > valid.
>
> Yeah, this is not good. There's a serious communication disconnect here and
> it's not going to be easily bridged. Please read what I said and think
> about the usage scenarios that I've mentioned more carefully. The -rt patch
> already has this kind of mechanism whether you're aware of this or not
> (unless somethings changed since I last looked). It's not as aggressively
> doing SWSRPS as what you're saying but it serves things nicely for now.

Which mechanism are you referring to?  Do you mean rt_overload?  or the 
load_balancer maybe?  Either way, they do not achieve SWSRPS, but I think the 
rt_overload code is close.

>
> Do you understand this ? The -rt patch has yet to be target for a doing
> strict RT work and your suggestion/patches might be one step closer to that
> goal. 

I think I've addressed that above...

> The counter effect to that is that you're going to effect the general 
> case latency case with SCHED_FIFO via that rescheduling hit "all of the
> time" with IPIs and stuff. I'm 'ok' with that. In fact, that's what I want.

Well, the rt_overload code doesn't kick in unless one or more run_queues has 2 
or more runnable RT tasks.  So it won't be "all the time" - but certainly 
most of the time on any SMP machine running a serious RT application.  And we 
agree, it will impose some overhead - but we want to minimize that.  Perhaps 
some kind of a runtime switch to enable the rt_overload code would be 
appropriate?

> I'm not ok with Ingo creating a new scheduling class to get around that
> aggressive rebalancing policy preserving the current SCHED_FIFO behavior in
> the main line. It's a hack IMO and I'll bet you that nobody cared about
> that behavior in the first place.

I'd prefer not to add another scheduling policy.  Although some might argue 
that a runtime switch for rt_overload is effectively the same thing... I 
might even argue that :-)

>
> I'd much rather see that some kind of SWSRPS go into the main line kernel
> (RT scheduling is pretty goofy already so folks probaby won't mind SWSRPS
> replacing it) 

I agree, and I think Ingo does to, he mentioned wanting to push rt_overload to 
mainline.  Ingo?

> and let thread binding to a CPU restore any loss of latency 
> by bypassing the SWSRPS logic. Are you tracking me ? My concerns here are
> the latency and overhead of SWSRPS and they are definitely significant.

So if an application binds an RT task to a CPU it needs to be excluded from 
some of the SWSRPS logic in order to reduce overhead?  It sounds nice, but 
I'm not sure it's possible unless all the tasks are bound to CPUs.  Take a 4 
CPU system.  Task A is bound to CPU0.  3 other RT tasks are about to become 
runnable, the SWSRPS logic will still have to account for an RT task on CPU0 
so it doesn't get bumped.  What did you have in mind?

>
> > You've made a lot of references to binding tasks to CPUs (or forbidding
> > them, which is essentially a bind to non-forbidden CPUs).  While
> > application developers can certainly take this approach, the linux kernel
> > should schedule realtime tasks globally according to priority in the
> > generic case.
>
> For a default RT oriented kernel, yes, I agree, but that's not what's in
> -rt and it serves the purpose for now. Looser scheduling constraints aren't
> really effecting the current crop of RT applications and that's ok since
> it's a bit fast than a pure SWSRPS solution. For those apps having a more
> general, looser, case semantic doesn't effect things dramatically and might
> even be useful.

OK so you said earlier that:

	> The counter effect to that is that you're going to effect the general 
	> case latency case with SCHED_FIFO via that rescheduling hit "all of 
	> the time" with IPIs and stuff. I'm 'ok' with that. In fact, that's 
	> what I want. 

These two paragraphs appear contradictory to me, it's unclear to me what you 
are trying to accomplish.  Would you like to see SWSRPS in the mainline 
kernel or not?

Has this cleared some things up?  If not, let me know what else needs 
clarification.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Realtime Linux Team

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06 22:35     ` Darren Hart
@ 2006-04-07 22:58       ` Vernon Mauery
  0 siblings, 0 replies; 37+ messages in thread
From: Vernon Mauery @ 2006-04-07 22:58 UTC (permalink / raw)
  To: Darren Hart
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin

On Thursday 06 April 2006 15:35, you wrote:
> On Thursday 06 April 2006 11:16, you wrote:
> > On Thursday 06 April 2006 00:37, Ingo Molnar wrote:
> > > * Darren Hart <darren@dvhart.com> wrote:

[snip]

> On a related note, I tried observing the rt stats in /proc/stats while
> running sched_football on 2.6.16-rt13.  The entire log is nearly a MB so I
> placed it on my website for reference
> (http://www.dvhart.com/~dvhart/sched_football_stats.log), an excerpt
> follows:
>
> ---------------------
>
> The following is the output of sched_football run with 1 thread for offense
> and 1 thread for defense on a 4 way machine.  The ball position is
> irrelevant in this case since there are more CPUs than threads (they should
> all be able to run).  What is disturbing is that over the entire run, I
> never see RT tasks on every CPU.  Even though there are usually 5 total
> runnnable threads, we constantly see groupings of 2 and 3 on the runqueues
> while the others have no running rt tasks.
>
> Looking back, I should have added a sleep to the loop - oops - still, I
> think the data is interesting and suggests a problem with sceduling RT
> tasks across all available CPUs.  Does this seem like a valid test to
> everyone?  Is there perhaps some explanation as to why this would be
> expected (when the cat process get's to read the proc information or
> something) ?

A better way to do the logging loop would be to write a program that simply 
reads the contents of /proc/stat, filters for rt and writes out to your log 
and run that at a high realtime priority.  That way, the logging task will be 
(more) sure to be capturing the data for the other 3 CPUs constantly and get 
a clearer picture of what is happening.  Naturally, the monitoring task would 
interfere with the execution of the rest of the football game, but it should 
just hog the single CPU while the rest of the game could play out on the 
remaining three.  And from the logs below, it seemed to only use two CPUs 
part of the time anyway.

Having a constantly running logger would also avoid all the forky-execy the 
bash loop has.

--Vernon

> # ./sched_football 1 60
> Starting 1 offense threads at priority 15
> Starting 1 defense threads at priority 30
> Starting referee thread
> Game On (60 seconds)!
> Game Over!
> Final ball position: 20359767
>
> # while true; do clear; cat /proc/stat | grep rt >>
> sched_football_stats.log; done
>
> sched_football_stats.log
> ------------------------------
> rt_overload_schedule: 57768
> rt_overload_wakeup:   157501
> rt_overload_pulled:   13722934
> rt_nr_running(0): 0
> rt_nr_running(1): 0
> rt_nr_running(2): 0
> rt_nr_running(3): 0
> rt_overload: 0
> rt_overload_schedule: 57769
> rt_overload_wakeup:   157514
> rt_overload_pulled:   13722937
> rt_nr_running(0): 0
> rt_nr_running(1): 2
> rt_nr_running(2): 3
> rt_nr_running(3): 0
> rt_overload: 2
> ...
> rt_overload_schedule: 57774
> rt_overload_wakeup:   157738
> rt_overload_pulled:   13722941
> rt_nr_running(0): 0
> rt_nr_running(1): 2
> rt_nr_running(2): 4
> rt_nr_running(3): 0
> rt_overload: 2
> ...
> rt_overload_schedule: 57791
> rt_overload_wakeup:   158650
> rt_overload_pulled:   13722964
> rt_nr_running(0): 0
> rt_nr_running(1): 2
> rt_nr_running(2): 0
> rt_nr_running(3): 3
> rt_overload: 2
> ...
> rt_overload_schedule: 57808
> rt_overload_wakeup:   166924
> rt_overload_pulled:   13722973
> rt_nr_running(0): 0
> rt_nr_running(1): 0
> rt_nr_running(2): 0
> rt_nr_running(3): 2
> rt_overload: 1
> rt_overload_schedule: 57808
> rt_overload_wakeup:   166927
> rt_overload_pulled:   13722973
> rt_nr_running(0): 0
> rt_nr_running(1): 0
> rt_nr_running(2): 0
> rt_nr_running(3): 0
> rt_overload: 0
>
> ---------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 22:37                 ` Darren Hart
@ 2006-04-07 23:36                   ` Bill Huey
  2006-04-08  3:01                     ` Steven Rostedt
  0 siblings, 1 reply; 37+ messages in thread
From: Bill Huey @ 2006-04-07 23:36 UTC (permalink / raw)
  To: Darren Hart
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Fri, Apr 07, 2006 at 03:37:37PM -0700, Darren Hart wrote:
> OK First off - let's assume we both read eachothers posts before 
> commenting :-)  I'll try to address some of the areas where we seem to be 
> talking passed eachother, please let me know if we agree on the following:

Right, let's try that, please.

> First, when I refer to "System Wide Strict Realtime Priority 
> Scheduling" (SWSRPS) I mean it in the absolute sense.  There is no "nearly" 
> or "almost", it is either SWSRPS or it is not.  Everything else is just load 
> balancing (relaxed, aggressive, whatever).  Since it's my term, I get to 
> define it ;-) (and if anyone can come up with a better term, please do - 
> SWSRPS is admittedly really lame)

I agree with that terminology. I would have just said "strict priority
across all processors". RTOS systems don't typically have interactive
scheduling policies so that stuff is never even in consideration. In fact,
I'm completely ignoring that in my posts. It doesn't exist in my mental
landscape. I don't care about the load balancer at all, really. RT
scheduling policies are quite difference and serve apps in very different
ways like bandwidth schedulers, etc...

That's why sched-plugin is potentally so important for -rt since the
current crop of -rt scheduling policies are very basic. It's also
important because I don't want somebody's research scheduler included
into the main line either or have to deal churns that force time
consuming prohibitive forward porting of scheduler patches. Different
topic though.

> My first question to the community was where do we want to end up?  RT 
> scheduling on SMP is relatively new and some RT specs don't even address it 
> (http://www.rtsj.org for example, see 
> http://www.rtsj.org/specjavadoc/sched_overview-summary.html section 
> "Semantics and Requirements Governing the Base Scheduler").  It is my belief 
> (and I think you agree?) that because Realtime tasks expect deterministic 
> behavior, there should be support for SWSRPS in the kernel.

Not in the main line kernel, but definitely in -rt. The main line kernel
is very far off from an RTOS and really is a place that doesn't respect
this strict priority policy. It's needed in -rt, arguably badly.

> There will be some overhead involved with the SWSRPS implementation, we want 
> to minimize that.  The current attempts use IPI which is higher overhead than 
> would be ideal.  You mentioned 20 us, less than 10us would be better (subject 
> to hardware limitations of course).

Better to bypass it completely for 0 nanoseconds for some special app cases.

> CPU binding can be used by the application developer to fine tune a complex 
> realtime application and avoid some of the overhead involved with SWSRPS.  
> When I read your comments it sounds like you are mentioning this as a new 
> feature, which as you know it isn't - so can you rephrase what you mean by 
> this?

I'm saying stop using a general purpose mechanism for a situation that
isn't general purpose. The high end of RT apps will do all sorts of tricks
to get that extra performance. What I'm saying to you "KERNEL FOLKS" is to
understand the programming issues behind these RT apps. It's not hard.

I'm having trouble getting folks to understand this problem in a
comprehensive manner. It's not one magic technique, it's a series of
them working together for a particular end goal. Let me find some
papers...linuxdevices.com. The mentality for this discussion is
completely off which is why I'm getting frustrated with it.

> I'm not arguing that SWSRPS won't impose some overhead, it definitely will.  
> I'd like to see better than 20us as well.  My points was without it, we don't 
> have determinism, and that's a "bad thing".  (We don't have SWSRPS now, I 
> understand that, but there is work towards it.).

Yes, I agree, but the only ready for it is the -rt patch set.

> > Yeah, this is not good. There's a serious communication disconnect here and
> > it's not going to be easily bridged. Please read what I said and think
> > about the usage scenarios that I've mentioned more carefully. The -rt patch
> > already has this kind of mechanism whether you're aware of this or not
> > (unless somethings changed since I last looked). It's not as aggressively
> > doing SWSRPS as what you're saying but it serves things nicely for now.
> 
> Which mechanism are you referring to?  Do you mean rt_overload?  or the 
> load_balancer maybe?  Either way, they do not achieve SWSRPS, but I think the 
> rt_overload code is close.

Yeah, rt overload or what ever Ingo calls it. It's "ok" for now. That
means it makes some decisions about distributing the RT load across
which at least allows it to run, but isn't a strictly obeying priority
across processors.

> Well, the rt_overload code doesn't kick in unless one or more run_queues has 2 
> or more runnable RT tasks.  So it won't be "all the time" - but certainly 
> most of the time on any SMP machine running a serious RT application.  And we 
> agree, it will impose some overhead - but we want to minimize that.  Perhaps 
> some kind of a runtime switch to enable the rt_overload code would be 
> appropriate?

Listen to me, "maximum deterministic latency". Do you understand what that
is ? This is what I'm talking about. RTOS folks care about "maximum deterministic
latency" times, are you ? :)

> I'd prefer not to add another scheduling policy.  Although some might argue 
> that a runtime switch for rt_overload is effectively the same thing... I 
> might even argue that :-)
> 
> > I'd much rather see that some kind of SWSRPS go into the main line kernel
> > (RT scheduling is pretty goofy already so folks probaby won't mind SWSRPS
> > replacing it) 
> 
> I agree, and I think Ingo does to, he mentioned wanting to push rt_overload to 
> mainline.  Ingo?

It really should go into -rt. It's needed there and that variant of the
kernel can deliver a time granularity that can respect a strict priority.
Please think in terms of -rt. -rt rules.
 
> > and let thread binding to a CPU restore any loss of latency 
> > by bypassing the SWSRPS logic. Are you tracking me ? My concerns here are
> > the latency and overhead of SWSRPS and they are definitely significant.
> 
> So if an application binds an RT task to a CPU it needs to be excluded from 
> some of the SWSRPS logic in order to reduce overhead?  It sounds nice, but 
> I'm not sure it's possible unless all the tasks are bound to CPUs.  Take a 4 
> CPU system.  Task A is bound to CPU0.  3 other RT tasks are about to become 
> runnable, the SWSRPS logic will still have to account for an RT task on CPU0 
> so it doesn't get bumped.  What did you have in mind?

Let the RT app dudes determine that. Seperate the general case from the
specific RT app case usage. They are completely different things, don't
mix them up in your head. They have to be address differently. I'm
repeating myself, again and again and again... but I'm ok, what ever works. :)

> > For a default RT oriented kernel, yes, I agree, but that's not what's in
> > -rt and it serves the purpose for now. Looser scheduling constraints aren't
> > really effecting the current crop of RT applications and that's ok since
> > it's a bit fast than a pure SWSRPS solution. For those apps having a more
> > general, looser, case semantic doesn't effect things dramatically and might
> > even be useful.
> 
> OK so you said earlier that:
> 
> 	> The counter effect to that is that you're going to effect the general 
> 	> case latency case with SCHED_FIFO via that rescheduling hit "all of 
> 	> the time" with IPIs and stuff. I'm 'ok' with that. In fact, that's 
> 	> what I want. 
> 
> These two paragraphs appear contradictory to me, it's unclear to me what you 
> are trying to accomplish.  Would you like to see SWSRPS in the mainline 
> kernel or not?

No, but I like to see this in -rt definitely. Main line is just too wacked
out to really utilize this properly. Plus, anybody seriously using SCHED_FIFO
on a 2.6 kernel is going to be using -rt anyways.

> Has this cleared some things up?  If not, let me know what else needs 
> clarification.

Yes, but you should really work to clarify terminology. Is this better ?

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  7:37 ` Ingo Molnar
                     ` (3 preceding siblings ...)
  2006-04-07  3:07   ` Bill Huey
@ 2006-04-08  0:11   ` Peter Williams
  4 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2006-04-08  0:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John, Siddha,
	Suresh B, Nick Piggin

Ingo Molnar wrote:
> * Darren Hart <darren@dvhart.com> wrote:
> 
>> My last mail specifically addresses preempt-rt, but I'd like to know 
>> people's thoughts regarding this issue in the mainline kernel.  Please 
>> see my previous post "realtime-preempt scheduling - rt_overload 
>> behavior" for a testcase that produces unpredictable scheduling 
>> results.
> 
> the rt_overload feature i intend to push upstream-wards too, i just 
> didnt separate it out of -rt yet.
> 
> "RT overload scheduling" is a totally orthogonal mechanism to the SMP 
> load-balancer (and this includes smpnice too) that is more or less 
> equivalent to having a 'global runqueue' for real-time tasks, without 
> the SMP overhead associated with that. If there is no "RT overload" [the 
> common case even on Linux systems that _do_ make use of RT tasks 
> occasionally], the new mechanism is totally inactive and there's no 
> overhead. But once there are more RT tasks than CPUs, the scheduler will 
> do "global" decisions for what RT tasks to run on which CPU.

Is this good enough?  Isn't it possible with the current 
try_to_wake_up() implementation (with or without smpnice) for two RT 
tasks to end up on the same CPU even when there are less RT tasks than CPUs?

> To put even 
> less overhead on the mainstream kernel, i plan to introduce a new 
> SCHED_FIFO_GLOBAL scheduling policy to trigger this behavior. [it doesnt 
> make much sense to extend SCHED_RR in that direction.]

I wouldn't have thought a new policy was necessary.  Why not just do 
this for all SCHED_FIFO (or even all RT) tasks?

BTW Does any body in the real world actually use SCHED_RR?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 23:36                   ` Bill Huey
@ 2006-04-08  3:01                     ` Steven Rostedt
  2006-04-08  4:28                       ` Vernon Mauery
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Rostedt @ 2006-04-08  3:01 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, Ingo Molnar, linux-kernel, Thomas Gleixner, Stultz,
	John, Peter Williams, Siddha, Suresh B, Nick Piggin

Hi Bill,

I'm just catching up on this thread.  Is your main concern that a High
prio task is going to be unnecessary delayed because there's a lower RT
task on the same CPU and time is needed to push it off to another CPU?
It's late, so forgive me if this is a stupid question ;)


On Fri, 2006-04-07 at 16:36 -0700, Bill Huey wrote:

> > Has this cleared some things up?  If not, let me know what else needs 
> > clarification.
> 
> Yes, but you should really work to clarify terminology. Is this better ?

Goes both ways :)

-- Steve

PS. It's really good to see you back on LKML.  I've missed your posts.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  3:01                     ` Steven Rostedt
@ 2006-04-08  4:28                       ` Vernon Mauery
  2006-04-08  4:45                         ` Steven Rostedt
  0 siblings, 1 reply; 37+ messages in thread
From: Vernon Mauery @ 2006-04-08  4:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Bill Huey, Darren Hart, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Stultz, John, Peter Williams, Siddha, Suresh B,
	Nick Piggin

On Friday 07 April 2006 20:01, Steven Rostedt wrote:
> Hi Bill,
>
> I'm just catching up on this thread.  Is your main concern that a High
> prio task is going to be unnecessary delayed because there's a lower RT
> task on the same CPU and time is needed to push it off to another CPU?
> It's late, so forgive me if this is a stupid question ;)

What I have gathered from this thread is that there are two important (and 
partially conflicting) requirements for better real-time support.

1) Deterministic scheduling algorithms (SWSRPS).  Basically, with uniprocessor 
systems (or smp with a global run queue), it was really easy to say, run the 
highest priority task in the queue.  But when there are several queues that 
are independent of each other, it is difficult.  According to SWSRPS, nr_cpus 
highest priority runnable tasks should _always_ be running (regardless of 
which queue they are on).  This might mean that there are longer latencies a) 
to determine the nr_cpus highest priority tasks and b) because of cache 
issues.

2) Maximum deterministic latency.  A task should be able to say that if it 
relinquishes the processor for now, MAX_LATENCY nanoseconds (or ticks or 
whatever you want to measure time in) later, it will be back in time to meet 
a deadline.

As I understand it, real time is all about determinism.  But there are several 
places where we have to focus on determinism to make it all behave as it 
should.

Priority A > B > C
If a lower priority task C gets run just because it is the highest in that 
CPU's run queue while there is a higher priority task B is sleeping while A 
runs (on a 2 proc system), this is WRONG.  But then again, we need to make 
sure that we can determine the maximum latency to preempt C to run B and try 
to minimize that.

Poof!  More smoke in the air.  I hope that clears it up.

--Vernon

> On Fri, 2006-04-07 at 16:36 -0700, Bill Huey wrote:
> > > Has this cleared some things up?  If not, let me know what else needs
> > > clarification.
> >
> > Yes, but you should really work to clarify terminology. Is this better ?
>
> Goes both ways :)
>
> -- Steve
>
> PS. It's really good to see you back on LKML.  I've missed your posts.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  4:28                       ` Vernon Mauery
@ 2006-04-08  4:45                         ` Steven Rostedt
  0 siblings, 0 replies; 37+ messages in thread
From: Steven Rostedt @ 2006-04-08  4:45 UTC (permalink / raw)
  To: Vernon Mauery
  Cc: Bill Huey, Darren Hart, Ingo Molnar, linux-kernel,
	Thomas Gleixner, Stultz, John, Peter Williams, Siddha, Suresh B,
	Nick Piggin

Hi Vernon,


On Fri, 2006-04-07 at 21:28 -0700, Vernon Mauery wrote:

> 1) Deterministic scheduling algorithms (SWSRPS).  Basically, with uniprocessor 
> systems (or smp with a global run queue), it was really easy to say, run the 
> highest priority task in the queue.  But when there are several queues that 
> are independent of each other, it is difficult.  According to SWSRPS, nr_cpus 
> highest priority runnable tasks should _always_ be running (regardless of 
> which queue they are on).  This might mean that there are longer latencies a) 
> to determine the nr_cpus highest priority tasks and b) because of cache 
> issues.

Yep, and task cpu dancing.  Everytime a High prio task preempts a lower
prio RT task, that RT task might be pushed to another CPU.

> 
> 2) Maximum deterministic latency.  A task should be able to say that if it 
> relinquishes the processor for now, MAX_LATENCY nanoseconds (or ticks or 
> whatever you want to measure time in) later, it will be back in time to meet 
> a deadline.

Yep, but the more important thing than latency, is to make your
deadline.  Sometimes people forget that and just concentrate on latency.
But that's another story.


> 
> As I understand it, real time is all about determinism.  But there are several 
> places where we have to focus on determinism to make it all behave as it 
> should.
> 
> Priority A > B > C
> If a lower priority task C gets run just because it is the highest in that 
> CPU's run queue while there is a higher priority task B is sleeping while A 
> runs (on a 2 proc system), this is WRONG.

Argh, terminology is killing us all.  For this to be wrong, B isn't
"sleeping" it's "waiting" while in the run state.  "Sleeping" means that
it's not on the run queue and is just waiting for some event.  Which
would be OK for C to run then.  But if B is on the run queue and in the
the TASK_RUNNING state, it would be wrong for C to be running somewhere
where B could be running.

>   But then again, we need to make 
> sure that we can determine the maximum latency to preempt C to run B and try 
> to minimize that.

And here I don't know of another way besides an IPI to preempt C.  If C
is in userspace, how would you preempt C right a way if B suddenly wakes
up on the runqueue of A?

> 
> Poof!  More smoke in the air.  I hope that clears it up.

It's as clear as my face was in High School ;)

-- Steve



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-07 21:06               ` Bill Huey
  2006-04-07 22:37                 ` Darren Hart
@ 2006-04-08  7:16                 ` Ingo Molnar
  2006-04-08  7:25                   ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-08  7:16 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Fri, Apr 07, 2006 at 07:56:20AM -0700, Darren Hart wrote:
> > The rt-overload mechanism is distinct from load balancing.  RT overload 
> > attempts to make sure the NR_CPUS highest priority runnable tasks are running 
> > on each of the available CPUs.  This isn't load balancing, this is "system 
> > wide strict realtime priority scheduling" (SWSRPS).  This scheduling should 
> > take place even if there are 1000 non RT tasks on CPU 0 and none on all the 
> > others.  The load balancer would be responsible for distributing those 1000 
> > non rt tasks to all the CPUs.
> 
> I'm quite aware of what you're saying as well as a much of the 
> contents of the -rt patch. Please don't assume that I'm not aware of 
> this. The -rt patch doesn't do SWSRPS, [...]

to the contrary, the "RT overload" code in the -rt tree does strict, 
system-wide RT priority scheduling. That's the whole point of it.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  7:16                 ` Ingo Molnar
@ 2006-04-08  7:25                   ` Ingo Molnar
  2006-04-08  7:54                     ` Bill Huey
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-08  7:25 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Ingo Molnar <mingo@elte.hu> wrote:

> * Bill Huey <billh@gnuppy.monkey.org> wrote:
> 
> > I'm quite aware of what you're saying as well as a much of the 
> > contents of the -rt patch. Please don't assume that I'm not aware of 
> > this. The -rt patch doesn't do SWSRPS, [...]
> 
> to the contrary, the "RT overload" code in the -rt tree does strict, 
> system-wide RT priority scheduling. That's the whole point of it.

so after this "clarification of terminology" i hope you are in picture 
now, so could you please explain to me what you meant by:

> > You should consider for a moment to allow for the binding of a 
> > thread to a CPU to determine the behavior of a SCHED_FIFO class task 
> > instead of creating a new run category. [...]

to me it still makes no sense, and much of the followups were based on 
this. Or were you simply confused about what the scheduling code in -rt 
does precisely? Did that get clarified now?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  7:25                   ` Ingo Molnar
@ 2006-04-08  7:54                     ` Bill Huey
  2006-04-08  8:03                       ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Bill Huey @ 2006-04-08  7:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Sat, Apr 08, 2006 at 09:25:30AM +0200, Ingo Molnar wrote:
> > to the contrary, the "RT overload" code in the -rt tree does strict, 
> > system-wide RT priority scheduling. That's the whole point of it.
> 
> so after this "clarification of terminology" i hope you are in picture 
> now, so could you please explain to me what you meant by:

> > > You should consider for a moment to allow for the binding of a 
> > > thread to a CPU to determine the behavior of a SCHED_FIFO class task 
> > > instead of creating a new run category. [...]
> 
> to me it still makes no sense, and much of the followups were based on 
> this. Or were you simply confused about what the scheduling code in -rt 
> does precisely? Did that get clarified now?

The last time I looked at it I thought it did something pretty simplistic
in that it just dumped any RT thread to another CPU but didn't do it in
a strict manner with regard to priority. Maybe that's changed or else I
didn't pay attention to it that as carefully as I thought.

As far as CPU binding goes, I'm wanting a method of getting around the
latency of the rt overload logic in certain cases at the expense of
rebalancing. That's what I ment by it.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  7:54                     ` Bill Huey
@ 2006-04-08  8:03                       ` Ingo Molnar
  2006-04-08 10:02                         ` Bill Huey
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-08  8:03 UTC (permalink / raw)
  To: Bill Huey
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> The last time I looked at it I thought it did something pretty 
> simplistic in that it just dumped any RT thread to another CPU but 
> didn't do it in a strict manner with regard to priority. Maybe that's 
> changed or else I didn't pay attention to it that as carefully as I 
> thought.

well as Darren's testcase shows, it might still have some bug - but the 
mechanism is intended to be strict. (the implementation had a couple of 
strictness bugs (they show up as long latencies on SMP) but those were 
ironed out months ago.)

> As far as CPU binding goes, I'm wanting a method of getting around the 
> latency of the rt overload logic in certain cases at the expense of 
> rebalancing. That's what I ment by it.

yeah, that certainly makes sense, and it's one reason why i'm thinking 
about the separate SCHED_FIFO_GLOBAL policy for 'globally scheduled' RT 
tasks, while still keeping the current lightweight non-global RT 
scheduling. Global scheduling either means a global lock, or as in the 
-rt implementation means a "global IPI", but there's always a nontrivial 
"global" cost involved.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-08  8:03                       ` Ingo Molnar
@ 2006-04-08 10:02                         ` Bill Huey
  0 siblings, 0 replies; 37+ messages in thread
From: Bill Huey @ 2006-04-08 10:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Stultz, John,
	Peter Williams, Siddha, Suresh B, Nick Piggin, Bill Huey (hui)

On Sat, Apr 08, 2006 at 10:03:49AM +0200, Ingo Molnar wrote:
> * Bill Huey <billh@gnuppy.monkey.org> wrote:
> > As far as CPU binding goes, I'm wanting a method of getting around the 
> > latency of the rt overload logic in certain cases at the expense of 
> > rebalancing. That's what I ment by it.
> 
> yeah, that certainly makes sense, and it's one reason why i'm thinking 
> about the separate SCHED_FIFO_GLOBAL policy for 'globally scheduled' RT 
> tasks, while still keeping the current lightweight non-global RT 
> scheduling. Global scheduling either means a global lock, or as in the 
> -rt implementation means a "global IPI", but there's always a nontrivial 
> "global" cost involved.

This is an extremely tricky problem which is why I lean toward manual
techniques to resolve the issue. All automatic policies seem to fail for
one reason or another. Significant thought needs to be put to this
problem and you might not be able to effective address all parts of it.

It goes far beyond the conventions of SCHED_FIFO itself and really
forces one to think about what "priority" really is in the context of
the -rt patch, whether the priority range needs to be extended to a much
larger range (0-512 or larger) and other issues regarding that. I see
-rt SCHED_FIFO as a basic building block for other policies (done by
researchers) that can be faked in userspace (like scheduler activations),
but policies vary greatly given a typical load characteritic or demands
of a -rt app. No single policy can fullfill those needs and it's still
largely a research topic in many areas. Rebalancing in allocation
schedulers is still voodoo in SMP environments for example.

Please take this into consideration when thinking about SCHED_FIFO_GLOBAL
and related topics.

bill


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-06  3:25 RT task scheduling Darren Hart
                   ` (2 preceding siblings ...)
  2006-04-07  9:23 ` Bill Huey
@ 2006-04-09 13:16 ` Ingo Molnar
  2006-04-09 17:25   ` Darren Hart
  3 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2006-04-09 13:16 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin


* Darren Hart <darren@dvhart.com> wrote:

> My last mail specifically addresses preempt-rt, but I'd like to know 
> people's thoughts regarding this issue in the mainline kernel.  Please 
> see my previous post "realtime-preempt scheduling - rt_overload 
> behavior" for a testcase that produces unpredictable scheduling 
> results.

thanks for the testcase! It indeed triggered a bug in the -rt tree's 
"RT-overload" balancing feature. The nature of the bug made it trigger 
much less likely on 2-way boxes (where i do most of my -rt testing), 
probably that's why it didnt get discovered before. I've uploaded the 
-rt14 tree with this bug fixed - does it fix the failures for you?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-09 13:16 ` Ingo Molnar
@ 2006-04-09 17:25   ` Darren Hart
  2006-04-09 18:31     ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Darren Hart @ 2006-04-09 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin

On Sunday 09 April 2006 06:16, Ingo Molnar wrote:
> * Darren Hart <darren@dvhart.com> wrote:
> > My last mail specifically addresses preempt-rt, but I'd like to know
> > people's thoughts regarding this issue in the mainline kernel.  Please
> > see my previous post "realtime-preempt scheduling - rt_overload
> > behavior" for a testcase that produces unpredictable scheduling
> > results.
>
> thanks for the testcase! It indeed triggered a bug in the -rt tree's
> "RT-overload" balancing feature. The nature of the bug made it trigger
> much less likely on 2-way boxes (where i do most of my -rt testing),
> probably that's why it didnt get discovered before. I've uploaded the
> -rt14 tree with this bug fixed - does it fix the failures for you?

I ran the test 100 times, no failures!  Looks good to me.

# for ((i=0;i<100;i++)); do ./sched_football 4 10 | grep "Final ball position" 
| tee sched_football_ball.log; sleep 1; done
Final ball position: 0
...
Final ball position: 0

Looking at the patch, it looks like the problem was a race on the runqueue 
lock - when we called double_runqueue_lock(a,b) we risked dropping the lock 
on b, giving another CPU opportunity to grab it and pull our next task.  The 
API change to double_runqueue_lock() and checking the new return code in 
balance_rt_tasks() is what fixed the issue.  Is that accurate?

I was doing some testing to see why the RT tasks don't appear to be evenly 
distributed across the CPUs (in my earlier post using the output 
of /proc/stat).  I was wondering if the load_balancing code might be 
interfering with the balance_rt_tasks() code.  Should the normal 
load_balancer even touch RT tasks in the presence of balance_rt_tasks() ?  
I'm thinking not.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Realtime Linux Team
Phone: 503 578 3185
  T/L: 775 3185

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RT task scheduling
  2006-04-09 17:25   ` Darren Hart
@ 2006-04-09 18:31     ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2006-04-09 18:31 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Stultz, John, Peter Williams,
	Siddha, Suresh B, Nick Piggin


* Darren Hart <dvhltc@us.ibm.com> wrote:

> > -rt14 tree with this bug fixed - does it fix the failures for you?
> 
> I ran the test 100 times, no failures!  Looks good to me.
> 
> # for ((i=0;i<100;i++)); do ./sched_football 4 10 | grep "Final ball position" 
> | tee sched_football_ball.log; sleep 1; done
> Final ball position: 0
> ...
> Final ball position: 0

cool!

> Looking at the patch, it looks like the problem was a race on the 
> runqueue lock - when we called double_runqueue_lock(a,b) we risked 
> dropping the lock on b, giving another CPU opportunity to grab it and 
> pull our next task.  The API change to double_runqueue_lock() and 
> checking the new return code in balance_rt_tasks() is what fixed the 
> issue.  Is that accurate?

this was one problem, yes. There was another problem too: the code 
checked against rq->curr, while it has to consider the 'next' task (the 
current task might just be about to leave the CPU at the point we do the 
rebalancing). (A third problem was an efficiency issue: the code 
indiscriminately pulled all RT tasks it found eligible - while it should 
only have pulled ones that preempt the next CPU.)

> I was doing some testing to see why the RT tasks don't appear to be 
> evenly distributed across the CPUs (in my earlier post using the 
> output of /proc/stat). [...]

could you explain this in a bit more detailed way? [what is the behavior 
you observe, and what would be your expectation.]

> [...] I was wondering if the load_balancing code might be interfering 
> with the balance_rt_tasks() code.  Should the normal load_balancer 
> even touch RT tasks in the presence of balance_rt_tasks() ?  I'm 
> thinking not.

if there is RT overload then running the highest-prio RT tasks trumps 
any other consideration - including load_balance(). Also, same-prio 
SCHED_FIFO tasks can starve each other indefinitely, so there's not much 
the load-balancer can do.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-04-09 18:33 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-06  3:25 RT task scheduling Darren Hart
2006-04-06  4:19 ` Peter Williams
2006-04-06 17:24   ` Darren Hart
2006-04-06 23:02     ` Peter Williams
2006-04-06  7:37 ` Ingo Molnar
2006-04-06 14:55   ` Darren Hart
2006-04-06 18:16   ` Darren Hart
2006-04-06 22:35     ` Darren Hart
2006-04-07 22:58       ` Vernon Mauery
2006-04-06 23:06   ` Peter Williams
2006-04-07  3:07   ` Bill Huey
2006-04-07  7:11     ` Ingo Molnar
2006-04-07  8:39       ` Bill Huey
2006-04-07  9:11         ` Bill Huey
2006-04-07  9:19         ` Ingo Molnar
2006-04-07 10:39           ` Bill Huey
2006-04-07 10:51             ` Ingo Molnar
2006-04-07 11:14               ` Bill Huey
2006-04-07 11:29                 ` Ingo Molnar
2006-04-07 22:18                   ` Bill Huey
2006-04-07 14:56             ` Darren Hart
2006-04-07 21:06               ` Bill Huey
2006-04-07 22:37                 ` Darren Hart
2006-04-07 23:36                   ` Bill Huey
2006-04-08  3:01                     ` Steven Rostedt
2006-04-08  4:28                       ` Vernon Mauery
2006-04-08  4:45                         ` Steven Rostedt
2006-04-08  7:16                 ` Ingo Molnar
2006-04-08  7:25                   ` Ingo Molnar
2006-04-08  7:54                     ` Bill Huey
2006-04-08  8:03                       ` Ingo Molnar
2006-04-08 10:02                         ` Bill Huey
2006-04-08  0:11   ` Peter Williams
2006-04-07  9:23 ` Bill Huey
2006-04-09 13:16 ` Ingo Molnar
2006-04-09 17:25   ` Darren Hart
2006-04-09 18:31     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).