linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Re: OSDL Bug 3770
@ 2004-12-20 17:02 Loic Domaigne
  2004-12-21 10:09 ` Ingo Molnar
  2004-12-21 11:09 ` Nick Piggin
  0 siblings, 2 replies; 8+ messages in thread
From: Loic Domaigne @ 2004-12-20 17:02 UTC (permalink / raw)
  To: piggin; +Cc: nptl, Linux-Kernel, mingo

Hello Nick,

Thanks for your reply! 
 
L = Loic 
N = Nick 

N> lkml: We're discussing the fact that on SMP machines, our realtime 
N> scheduling policies are per-CPU only. This caused a problem where a 
N> high priority task on one CPU caused all lower priority tasks on that 
N> CPU to be starved, while tasks on another CPU with the same low 
N> priority were able to run.

That summary should readily motivate you to make a patch ;-) 

But thing are a bit worse actually. It is easily to build an example 
where a lower priority thread is executing while a higer priority thread
is waiting. For instance, something like: 

CPU0:
Thread with prio 30 gets the CPU.
Thread with prio 25 is waiting.

CPU1:
Thread with prio 20 gets the CPU.
Thread with prio 15 is waiting.

This is a corollary of the OSDL Bug #3770. A practical example has been 
built by Sebastien Decugis. 


L> Although POSIX legally permits such implementation for realtime 
L> policy on SMP machines, this implementation is clearly *NOT* 
L> REASONABLE.
N>
N> Well I haven't done much in the realtime area... but nobody has 
N> complained till now.

But now you have some complains ;-) 


L> The reason is extremely simple: the application *CANNOT* necessarily 
L> known that it gets stuck behind a higher-priority thread (though it 
L> could had run on another CPU if the scheduler had decided otherwise). 
L> That's *NOT* doable to program in a deterministic fashion in such 
L> "realtime"-environement
N>
N>
N> You could use CPU binding. I'd argue that this may be nearly a
N> requirement for any realtime system of significant complexity on 
N> an SMP system.

Agree. Real-world system will likely want to have a control on which 
CPU the threads runs on SMP machine. 

Does Linux tolerate hard CPU binding? By hard CPU binding, I mean 
that the application tells the scheduler "I want to run there", 
and the scheduler schedules the thread(s) "there" regardless if it 
makes sense or not ( The decision is left to the application). 

With such hard CPU binding, it seems to me that our "unfortunate 
behavior" isn't problematic anymore. Because the application can gain 
control again over the scheduler (so to speak). 

On the other hand, if the scheduler might ignore the CPU binding 
(thus, not hard binding, but rather CPU affinity), then I am afraid 
that the issue might remain problematic.  


N> *But*, notice that the program in question did not run on UP and
N> randomly fail on SMP, rather it would not work on single processor 
N> AT ALL.

The test is of course "broken" in some sense, and does nothing useful... 
But the principal goal of this test is showing a potential issue with 
a few lines of code. The fact that the test doesn't work on UP machine 
is _irrelevant_ here. 

Relevant is: if I have compute bound threads of various priority, some 
high priority threads might stuck behind some low priority threads. 
That's not really realtime friendly, isn't it? 


N> The driver really needs to sleep, use a mutex, use a lower priority,
N> or  something in order for it to work.
N>
L>
L> NO! It is not the responsability of the application to fix that 
L> behavior! We can in our case because 'we know', but some 
L> applications don't!!!
L>
N>
N> That's a bit hand-wavy ;) but I don't dismiss it out of hand because 
N> as I said, I'm not so familiar with this area. I would be interested 
N> in an example of some application where this matters, and which 
N> absolutely can't use any synchronisation primitives.

Perhaps it would help to consider a real-world example. Here is an 
example inspired from some Air Traffic Control Systems. 

(*) You have one thread that computes the position of an airplane or 
    "track". This thread has the highest priority because all other 
    tasks like collision avoidance computation etc. depends of 
    airplanes track.

    This thread is essentially compute bound. It tells other 
    interested threads when new data for the airplanes tracks are 
    available (using condvars and co) and goahead with the next tracks. 
    
(*) You have another bunch of threads that computes waiting for relevant 
    stuff for air traffic control, for instance if two trajectories 
    shall interesect. 

    These threads are essentially compute bound too. They are first 
    waiting for the new airplaines positions (from the previous thread) 
    and then proceeds with their computations.

Interesting is that the computation of the airplaines tracks and 
collision avoidance overlap. But, when required, the tracks computation 
has precedence over other tasks. 

The collision avoidance computation has nice scalability property. 
You compute the possible collision pairwise between 2 airplanes. All 
pairs can be computed independantly. 

Now it is not usual that the computations to be quite CPU intensive, and 
so SMP machines are used. With the current Linux scheduling decision, 
some threads responsible for the collision avoidance might get stuck, 
or do I miss something? 

[
Note aside- Of course, it is a simplified version of real systems. 
The Truth is, as usual, more awful. 

And yes! Linux has a niche in Air Traffic Control business
]  


N> If it were that easy, we might have a single queue for _all_ tasks.
N> 
N> The main problem is the cost of synchronisation and cacheline 
N> sharing. A secondary problem is that of CPU affinities - moving a 
N> task to another CPU nearly always has some non zero cost in terms 
N> of cache (and in case of NUMA, memory) efficiency.
N> 
N> Our global queue scheduler was basically crap for more than 4 CPUs. 
N> We could give RT tasks a global queue with little impact to non-RT 
N> workloads (in fact, I think early iterations of the 2.6 scheduler 
N> trialed this)... but let's not cripple the RT apps that do the right 
N> thing (and need scalablility).
 
I do not have the necessary knowledge on the Linux scheduler. And I 
do not pretend that implementing an efficient scheduler on a lot 
number of target platforms as Linux does is an easy thing ( actually, 
I quite admire the work done here! ).  

Perhaps the best alternative would to offer hard CPU binding (if 
not already provided). Hard CPU binding is likely needed by the 
HPC guys too (likely working, on a bunch of SMP machines connected 
together by a high-speed network). 


N> another problem is that scheduling may not be O(1) anymore, 
N> if you have CPU affinity bindings in place.

Designing an O(p) scheduler, where @p is the number of processors is 
theoritically 'easy'. When a thread has been picked-up and assigned to 
a CPU with the O(1) scheduling algorithm, re-iterate the same algorithm 
on the remaining threads / unbound CPU. Since usually p is constant, 
the scheduling algorithm remains hence O(1) for the SMP machine. 
However it doesn't scale with the number of processors. Which can make 
the algorith not really practical...


N> To summaries, I believe that if per-CPU RT queues is allowed within 
N> POSIX, then we want to go with the sanest possible implementation, 
N> and force any broken apps to fix themselves.... 

POSIX is rather vague concerning the scheduling on SMP machine, because 
no real consensus could be reached between the Posix.1b and the Posix.1c 
guys. As a result, POSIX says timidely: 

"For application threads with scheduling allocation domains of size 
greater than one, the rules defined for SCHED_FIFO, SCHED_RR, and 
SCHED_SPORADIC shall be used in an implementation-defined manner."

With such loophole, the term "broken apps" is a kind of fuzzy. 
Minds you, but scheduling the highest prio thread on a CPU and 
blocking the other threads (ignoring hence the SMP feature, and 
having the same scheduling as on an UP) is a perfectly legal 
POSIX implementation... 
With such implementation, you get suddently a lot of 'broken apps' (!) 

In order words, POSIX allows you to put vinegar, strawberry jam, 
red-hot chili pepper and mayo on your hambuger. Whether it really 
tastes is another matter ;-) 


Cheers!
Loic.

-- 
--
// Sender address goes to /dev/null (!!) 
// Use my 32/64 bits, ANSI C89, compliant email-address instead:   

unsigned y[]=
{0,34432,26811,16721,41866,63119,61007,48155,26147,10986};
void x(z){putchar(z);}; unsigned t; 
main(i){if(i<10){t=(y[i]*47560)%65521;x(t>>8);x(t&255);main(++i);}}

Psssst! Mit GMX Handyrechnung senken: http://www.gmx.net/de/go/mail
100 FreeSMS/Monat (GMX TopMail), 50 (GMX ProMail), 10 (GMX FreeMail)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: OSDL Bug 3770
  2004-12-20 17:02 Re: OSDL Bug 3770 Loic Domaigne
@ 2004-12-21 10:09 ` Ingo Molnar
  2004-12-21 10:20   ` Loic Domaigne
  2004-12-21 10:22   ` [nptl] " Sebastien Decugis
  2004-12-21 11:09 ` Nick Piggin
  1 sibling, 2 replies; 8+ messages in thread
From: Ingo Molnar @ 2004-12-21 10:09 UTC (permalink / raw)
  To: Loic Domaigne; +Cc: piggin, nptl, Linux-Kernel


* Loic Domaigne <loic-dev@gmx.net> wrote:

> N> lkml: We're discussing the fact that on SMP machines, our realtime 
> N> scheduling policies are per-CPU only. This caused a problem where a 
> N> high priority task on one CPU caused all lower priority tasks on that 
> N> CPU to be starved, while tasks on another CPU with the same low 
> N> priority were able to run.
> 
> That summary should readily motivate you to make a patch ;-)

note that my -RT patchset includes scheduler changes that implement
"global RT scheduling" on SMP systems. Give it a go, it's at:

   http://redhat.com/~mingo/realtime-preempt/

you have to enable CONFIG_PREEMPT_RT to active this feature. I've
designed this code to not hurt non-RT scheduling, and i've optimized
performance for the 'lightly loaded case' (which is the most common to
occur on mainline-using systems).

A very short description of the design: there's a global 'RT overload
counter' - which is zero and causes no overhead if there is at most 1 RT
task in every runqueue.  (i.e. at most 2 RT tasks on a 2-way system, at
most 4 RT tasks on a 4-way system, etc.) If the system gets into 'RT
overload' mode (e.g. the third RT task gets activated on a 2-way box),
then the scheduler starts to balance the RT tasks agressively. Also,
whenever an RT task is preempted on a CPU, or is woken up but cannot
preempt a higher-prio RT task on a given CPU, then it's 'pushed' to
other CPUs if possible. This design avoids global locking (it avoids a
global runqueue), which simplifies things immensely. (I first tried a
global runqueue for RT tasks but the complexity impact was much bigger.)

(note that these scheduler changes are resonably self-contained and do
not depend on other parts of PREEMPT_RT, so in theory they could be
added to mainline too, after some time - given lots of testing and broad
agreement.)

anyway, the first step would be to try this scheduler and give feedback
of how well it works for you :-)

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: OSDL Bug 3770
  2004-12-21 10:09 ` Ingo Molnar
@ 2004-12-21 10:20   ` Loic Domaigne
  2004-12-21 10:22   ` [nptl] " Sebastien Decugis
  1 sibling, 0 replies; 8+ messages in thread
From: Loic Domaigne @ 2004-12-21 10:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: piggin, nptl, Linux-Kernel

Hi Ingo, 

> note that my -RT patchset includes scheduler changes that implement
> "global RT scheduling" on SMP systems. Give it a go, it's at:
> 
>    http://redhat.com/~mingo/realtime-preempt/

That looks good, at least from a theoritical standpoint.  
We shall give a go with Seb and post the results. 

Thanks! 
Loic. 

-- 
--
// Sender address goes to /dev/null (!!) 
// Use my 32/64 bits, ANSI C89, compliant email-address instead:   

unsigned y[]=
{0,34432,26811,16721,41866,63119,61007,48155,26147,10986};
void x(z){putchar(z);}; unsigned t; 
main(i){if(i<10){t=(y[i]*47560)%65521;x(t>>8);x(t&255);main(++i);}}

Psssst! Mit GMX Handyrechnung senken: http://www.gmx.net/de/go/mail
100 FreeSMS/Monat (GMX TopMail), 50 (GMX ProMail), 10 (GMX FreeMail)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nptl] Re: OSDL Bug 3770
  2004-12-21 10:09 ` Ingo Molnar
  2004-12-21 10:20   ` Loic Domaigne
@ 2004-12-21 10:22   ` Sebastien Decugis
  2004-12-24 15:00     ` Sebastien Decugis
  1 sibling, 1 reply; 8+ messages in thread
From: Sebastien Decugis @ 2004-12-21 10:22 UTC (permalink / raw)
  To: nptl; +Cc: Loic Domaigne, piggin, Linux-Kernel

> note that my -RT patchset includes scheduler changes that implement
> "global RT scheduling" on SMP systems. Give it a go, it's at:
> 
>    http://redhat.com/~mingo/realtime-preempt/
> 
> you have to enable CONFIG_PREEMPT_RT to active this feature. 

[...]
> 
> anyway, the first step would be to try this scheduler and give feedback
> of how well it works for you :-)

Thanks Ingo, this sounds very nice!

I'll try this hopefully before Christmas and give a status with this
patch applied.

Best regards,
Seb.

-- 
-------------------------------
Sebastien DECUGIS
NPTL Test & Trace Project
http://nptl.bullopensource.org/

"You may fail if you try.
You -will- fail if you don't."


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OSDL Bug 3770
  2004-12-20 17:02 Re: OSDL Bug 3770 Loic Domaigne
  2004-12-21 10:09 ` Ingo Molnar
@ 2004-12-21 11:09 ` Nick Piggin
  2004-12-21 12:06   ` Loic Domaigne
  1 sibling, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2004-12-21 11:09 UTC (permalink / raw)
  To: Loic Domaigne; +Cc: nptl, Linux-Kernel, mingo



Loic Domaigne wrote:

>Hello Nick,
>
>Thanks for your reply! 
> 
>L = Loic 
>N = Nick 
>
>N> lkml: We're discussing the fact that on SMP machines, our realtime 
>N> scheduling policies are per-CPU only. This caused a problem where a 
>N> high priority task on one CPU caused all lower priority tasks on that 
>N> CPU to be starved, while tasks on another CPU with the same low 
>N> priority were able to run.
>
>That summary should readily motivate you to make a patch ;-) 
>
>But thing are a bit worse actually. It is easily to build an example 
>where a lower priority thread is executing while a higer priority thread
>is waiting. For instance, something like: 
>
>CPU0:
>Thread with prio 30 gets the CPU.
>Thread with prio 25 is waiting.
>
>CPU1:
>Thread with prio 20 gets the CPU.
>Thread with prio 15 is waiting.
>
>

Yep.


[snip]

>L> The reason is extremely simple: the application *CANNOT* necessarily 
>L> known that it gets stuck behind a higher-priority thread (though it 
>L> could had run on another CPU if the scheduler had decided otherwise). 
>L> That's *NOT* doable to program in a deterministic fashion in such 
>L> "realtime"-environement
>N>
>N>
>N> You could use CPU binding. I'd argue that this may be nearly a
>N> requirement for any realtime system of significant complexity on 
>N> an SMP system.
>
>Agree. Real-world system will likely want to have a control on which 
>CPU the threads runs on SMP machine. 
>
>Does Linux tolerate hard CPU binding? By hard CPU binding, I mean 
>that the application tells the scheduler "I want to run there", 
>and the scheduler schedules the thread(s) "there" regardless if it 
>makes sense or not ( The decision is left to the application). 
>
>With such hard CPU binding, it seems to me that our "unfortunate 
>behavior" isn't problematic anymore. Because the application can gain 
>control again over the scheduler (so to speak). 
>
>On the other hand, if the scheduler might ignore the CPU binding 
>(thus, not hard binding, but rather CPU affinity), then I am afraid 
>that the issue might remain problematic.  
>
>

Yes, it does support hard CPU binding - sched_setaffinity

[snip interesting dialogue]

Thanks for your detailed comments, they were interesting.

I hope that the fact we have hard CPU binding is a sufficient
solution to the problem.

Thanks
Nick


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OSDL Bug 3770
  2004-12-21 11:09 ` Nick Piggin
@ 2004-12-21 12:06   ` Loic Domaigne
  2004-12-21 13:32     ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Domaigne @ 2004-12-21 12:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: nptl, Linux-Kernel, mingo

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="us-ascii", Size: 1079 bytes --]

Hello Nick, 

> >Does Linux tolerate hard CPU binding? By hard CPU binding, I mean 
> >that the application tells the scheduler "I want to run there", 
> >and the scheduler schedules the thread(s) "there" regardless if it 
> >makes sense or not ( The decision is left to the application). 
> 
> Yes, it does support hard CPU binding - sched_setaffinity

Yes, I believe that /sched_setaffinity()/ offers a practical solution to the
problem we are faced. 

But I am eager to try the RT-patchset of Ingo. 


> [snip interesting dialogue]
> 
> Thanks for your detailed comments, they were interesting.

... Glad to hear. You're welcome!  



Cheers,  
Loic. 

-- 
--
// Sender address goes to /dev/null (!!) 
// Use my 32/64 bits, ANSI C89, compliant email-address instead:   

unsigned y[]=
{0,34432,26811,16721,41866,63119,61007,48155,26147,10986};
void x(z){putchar(z);}; unsigned t; 
main(i){if(i<10){t=(y[i]*47560)%65521;x(t>>8);x(t&255);main(++i);}}

+++ Sparen Sie mit GMX DSL +++ http://www.gmx.net/de/go/dsl
AKTION für Wechsler: DSL-Tarife ab 3,99 EUR/Monat + Startguthaben

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OSDL Bug 3770
  2004-12-21 12:06   ` Loic Domaigne
@ 2004-12-21 13:32     ` Ingo Molnar
  0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2004-12-21 13:32 UTC (permalink / raw)
  To: Loic Domaigne; +Cc: Nick Piggin, nptl, Linux-Kernel


* Loic Domaigne <loic-dev@gmx.net> wrote:

> > Yes, it does support hard CPU binding - sched_setaffinity
> 
> Yes, I believe that /sched_setaffinity()/ offers a practical solution
> to the problem we are faced. 

that's the short-term workaround. Another model for CPU-bound RT tasks
is the use of isolcpus. (see Documentation/kernel-parameters.txt)

but that's the thinking behind current RT scheduling: no global sorting
of priorities is done on SMP, but if you know the priorities and the
workload in advance you can manually bind them to specific CPUs.

> But I am eager to try the RT-patchset of Ingo. 

this is obviously more experimental stuff, and feedback is welcome. It
is the current 'playground' for RT related scheduling features.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nptl] Re: OSDL Bug 3770
  2004-12-21 10:22   ` [nptl] " Sebastien Decugis
@ 2004-12-24 15:00     ` Sebastien Decugis
  0 siblings, 0 replies; 8+ messages in thread
From: Sebastien Decugis @ 2004-12-24 15:00 UTC (permalink / raw)
  To: nptl; +Cc: Loic Domaigne, piggin, Linux-Kernel

> > note that my -RT patchset includes scheduler changes that implement
> > "global RT scheduling" on SMP systems. Give it a go, it's at:
> > 
> >    http://redhat.com/~mingo/realtime-preempt/
> > 
> > you have to enable CONFIG_PREEMPT_RT to active this feature. 
> 
> [...]
> > 
> > anyway, the first step would be to try this scheduler and give feedback
> > of how well it works for you :-)

Well, I finally was able to test it. But... the test case hangs the
machine with the kernel patch applied :/

No problem applying the patch, nor compiling the kernel.

Then, it randomly fails booting the new kernel, hanging on "hardware
detection" in init process (looks like an issue with the IDE controler
where only the floppy and dvdrom are plugged) but this may be due to
rc3, not to your patch.

When the kernel boots, it behaves consistently, but when I run the test
case it never returns, and even the magic-SysRq does not work anymore.
I've also tested in single-user mode, the behavior is the same.

Sorry I can't tell more about the reason why it hangs, I don't know how
to debug the kernel :/ But I can provide you with the test case binary
in case the compiled source code does not hang for you (as for Nick).

Best regards, and Joyeux Noel!
Seb.

-- 
-------------------------------
Sebastien DECUGIS
NPTL Test & Trace Project
http://nptl.bullopensource.org/

"You may fail if you try.
You -will- fail if you don't."


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-12-24 14:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-20 17:02 Re: OSDL Bug 3770 Loic Domaigne
2004-12-21 10:09 ` Ingo Molnar
2004-12-21 10:20   ` Loic Domaigne
2004-12-21 10:22   ` [nptl] " Sebastien Decugis
2004-12-24 15:00     ` Sebastien Decugis
2004-12-21 11:09 ` Nick Piggin
2004-12-21 12:06   ` Loic Domaigne
2004-12-21 13:32     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).