Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 18:10 Mark_H_Johnson
  2004-12-09 19:40 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 18:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
>> >But you do have set your reference irq (soundcard) to the highest prio
>> >in the PREEMPT_RT case? I just ask to make sure.
>>
>> Yes, but then I have ALL the IRQ's at the highest priority (plus a
couple
>> other /0 and /1 tasks). [...]
>
>that is the fundamental problem i believe: your 'CPU loop' gets delayed
>by them.

They should not get delayed by them any more than in the PREEMPT_DESKTOP
configuration (other than the threading overhead which we've separately
said should be modest). They should be delayed by them less since we CAN
migrate the RT task away from the IRQ task (at least until I get the case
where multiple concurrent IRQ or /# threads keep both CPU's busy).

>> [...] Please note, I only use latencytest (an audio application) to
>> get an idea of RT performance on a desktop machine before I consider
>> using the kernel for my real application.
>
>but you never want your real application be delayed by things like IDE
>processing or networking workloads, correct?
For the most part, that I/O workload IS because I have the RT application
running. That was one of my points. I cannot reliably starve any of
those activities. The disk reads in my real application simulate a disk
read from a real world device. That data is needed for RT processing
in the simulated system. Some of the network traffic is also RT since
we generate a data stream that is interpreted in real time by other
systems.

>The only thing that should
>have higher priority than your application is the event thread that
>handles the hardware from which you get events. I.e. the soundcard IRQ
>in your case (plus the timer IRQ thread, because your task is also
>timing out).
For the test at my desktop I CAN do that but CHOOSE to not do that
since the real application has to handle the additional overhead.
Again, the set up I have is more of an apples to apples comparison.

>i'm not sure what the primary event source for your application is, but
>i bet it's not the IDE irq thread, nor the network IRQ thread.
I said previously the primary time source is from the shared memory
interface on the PCI bus for the specific application I described.
I could make that higher priority than the rest.

Actually we do use network messages to synchronize with a system that
is not in the cluster. At 20 Hz, we send a network message that
basically means "start execution" to that other system. It cannot
be delayed much either.

>so you are seeing the _inverse_ of advances in the -RT kernel: it's
>getting better and better at preempting your prio 30 CPU loop with the
>higher-prio RT tasks. I.e. the lower-prio CPU loop gets worse and worse
>latencies.
As I stated before (and I think you agree) the overhead of the setup
I have now for PREEMPT_RT should be comparable to that for PREEMPT_DESKTOP.
Neither should have a great advantage / disadvantage over the other.
The overhead for threading is certainly present in _RT but should
be offset to some extent by the improved migration opportunities.
The measurements however, do not seem to confirm that assessment.
Either the measurements are broke or the system is and in either case
should be fixed.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 18:10 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Mark_H_Johnson
@ 2004-12-09 19:40 ` Ingo Molnar
  2004-12-09 19:58 ` Ingo Molnar
  2004-12-10 23:42 ` Steven Rostedt
  2 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 19:40 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >> Yes, but then I have ALL the IRQ's at the highest priority (plus a
> >> couple other /0 and /1 tasks). [...]
> >
> > that is the fundamental problem i believe: your 'CPU loop' gets 
> > delayed by them.
> 
> They should not get delayed by them any more than in the
> PREEMPT_DESKTOP configuration [...]

just to make sure we are talking about the same thing. Do you mean
PREEMPT_DESKTOP with IRQ threading disabled?

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 18:10 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Mark_H_Johnson
  2004-12-09 19:40 ` Ingo Molnar
@ 2004-12-09 19:58 ` Ingo Molnar
  2004-12-10 23:42 ` Steven Rostedt
  2 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 19:58 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

on SMP, latencytest + all IRQ threads (and ksoftirqd) at prio 99 +
PREEMPT_RT is not comparable to PREEMPT_DESKTOP (with no IRQ threading).

The -RT kernel will 'split' hardirq and softirq workloads and migrate
them to different CPUs - giving them a higher total throughput. Also, on
PREEMPT_DESKTOP the IRQs will most likely go to one CPU only, and most
softirq processing will be concentrated on that CPU too. Furthermore, 
the -RT kernel will agressively distribute highprio RT tasks.

latencytest under your priority setup measures an _inverse_ scenario. (a
CPU hog executing at a lower priority than all IRQ traffic) I'd not be
surprised at all if it had higher latencies under -RT than under
PREEMPT_DESKTOP. It's not clear-cut which one 'wins' though: because
even this inverse scenario will have benefits in the -RT case: due to
SCHED_OTHER workloads not interfering with this lower-prio RT task as
much. But i'd expect there to be a constant moving of the 'benchmark
result' forward and backwards, even if -RT only improves things - this
is the nature of such an inverse priority setup.

so this setup generates two conflicting parameters which are inverse to
each other, and the 'sum' of these two parameters ends up fluctuating
wildly. Pretty much like the results you are getting. The two parameters
are: latency of the prio 30 task, and latency of the highprio tasks. The
better the -RT kernel gets, the better the prio 30 tasks's priorities
get relative to SCHED_OTHER tasks - but the worse they also get, due to
the better handling of higher-prio tasks. Where the sum ends, whether
it's a "win" or a "loss" depends on the workload, how much highprio
activity the lowprio threads generate, etc.

if you really want to put all IRQ traffic on the same priority level
then a fairer comparison would be to bind all IRQ (via smp_affinity) and
ksoftirq (via taskset) threads to CPU#0, and to bind latencytest's
CPU-loop to CPU#1. (and do the same in the PREEMPT_DESKTOP case)

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 18:10 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Mark_H_Johnson
  2004-12-09 19:40 ` Ingo Molnar
  2004-12-09 19:58 ` Ingo Molnar
@ 2004-12-10 23:42 ` Steven Rostedt
  2004-12-11 16:59   ` john cooper
                     ` (2 more replies)
  2 siblings, 3 replies; 72+ messages in thread
From: Steven Rostedt @ 2004-12-10 23:42 UTC (permalink / raw)
  To: Mark Johnson
  Cc: Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, LKML, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

On Thu, 2004-12-09 at 12:10 -0600, Mark_H_Johnson@raytheon.com wrote:
> >but you never want your real application be delayed by things like IDE
> >processing or networking workloads, correct?
> For the most part, that I/O workload IS because I have the RT application
> running. That was one of my points. I cannot reliably starve any of
> those activities. The disk reads in my real application simulate a disk
> read from a real world device. That data is needed for RT processing
> in the simulated system. Some of the network traffic is also RT since
> we generate a data stream that is interpreted in real time by other
> systems.

[RFC]  Has there been previously any thought of adding priority
inheriting wait queues. With the IRQs that run as threads, have hooks in
the code that allows a driver or socket layer to attach a thread to a
wait queue, and when a RT priority task waits on the queue, a function
is call to increase (if needed) the priority of the attached thread. I
know that this would take some work, and would make the normal kernel
and RT diverge more, but it would really help to solve the problem of a
high priority process waiting for an interrupt that can be starved by
other high priority processes.

Just a thought.

-- Steve


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10 23:42 ` Steven Rostedt
@ 2004-12-11 16:59   ` john cooper
  2004-12-12  4:36     ` Steven Rostedt
  2004-12-11 17:59   ` Esben Nielsen
  2004-12-13 22:31   ` Ingo Molnar
  2 siblings, 1 reply; 72+ messages in thread
From: john cooper @ 2004-12-11 16:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt, john cooper

Steven Rostedt wrote:

> [RFC]  Has there been previously any thought of adding priority
> inheriting wait queues. With the IRQs that run as threads, have hooks in
> the code that allows a driver or socket layer to attach a thread to a
> wait queue, and when a RT priority task waits on the queue, a function
> is call to increase (if needed) the priority of the attached thread. I
> know that this would take some work, and would make the normal kernel
> and RT diverge more, but it would really help to solve the problem of a
> high priority process waiting for an interrupt that can be starved by
> other high priority processes.

I think there are two issues here.  One being as above which
addresses allowing the server thread to compete for CPU time
at a priority equal to its highest waiting client.  Essentially
the server needs no inherent priority of its own, rather its
priority is automatically inherited.  The semantics seem
straightforward even in the general case of servers themselves
becoming clients of other servers.

Another issue is the fact the server thread is effectively
non-preemptive.  Otherwise a newly arrived waiter of priority
higher than a client currently being serviced would receive
immediate attention.  One problem to be solved here is how to
save/restore client context when a "context switch" is required.

-john

-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-11 16:59   ` john cooper
@ 2004-12-12  4:36     ` Steven Rostedt
  2004-12-13 23:45       ` john cooper
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-12  4:36 UTC (permalink / raw)
  To: john cooper
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

On Sat, 2004-12-11 at 11:59 -0500, john cooper wrote:
> Steven Rostedt wrote:
> 
> > [RFC]  Has there been previously any thought of adding priority
> > inheriting wait queues. With the IRQs that run as threads, have hooks in
> > the code that allows a driver or socket layer to attach a thread to a
> > wait queue, and when a RT priority task waits on the queue, a function
> > is call to increase (if needed) the priority of the attached thread. I
> > know that this would take some work, and would make the normal kernel
> > and RT diverge more, but it would really help to solve the problem of a
> > high priority process waiting for an interrupt that can be starved by
> > other high priority processes.
> 
> I think there are two issues here.  One being as above which
> addresses allowing the server thread to compete for CPU time
> at a priority equal to its highest waiting client.  Essentially
> the server needs no inherent priority of its own, rather its
> priority is automatically inherited.  The semantics seem
> straightforward even in the general case of servers themselves
> becoming clients of other servers.
> 

I agree with you on this.

> Another issue is the fact the server thread is effectively
> non-preemptive.  Otherwise a newly arrived waiter of priority
> higher than a client currently being serviced would receive
> immediate attention.  One problem to be solved here is how to
> save/restore client context when a "context switch" is required.

I don't quite understand your point here. 

Say you have process A at prio 20 that waits on a queue with server S. S
becomes prio 20 and starts to run. Then it is preempted by process B at
prio 30 which then comes to wait on the server's queue. Server S becomes
prio 30 and finishes process A's work, then checks the queue again and
finds process B and starts working on process B's work still at prio 30.
The time of process B is still bounded (predictable).

So it's similar to a mutex and priority inheritance. We can look at
process A taking lock L and then when process B blocks on lock L,
process A inherits process B's priority (B being greater prio than A).
The difference is that the work is being done within a mutex as suppose
to a server. The work to keep track of what priorities are being
inherited is even easier than mutexs, since you have a process (the
server) to just point to which process it has inherited, and a wait
queue to store which process needs to be inherited next when the server
wakes up the currently inherited process.

-- Steve

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-12  4:36     ` Steven Rostedt
@ 2004-12-13 23:45       ` john cooper
  2004-12-14 13:01         ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: john cooper @ 2004-12-13 23:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt, john cooper

Steven Rostedt wrote:

>>Another issue is the fact the server thread is effectively
>>non-preemptive.  Otherwise a newly arrived waiter of priority
>>higher than a client currently being serviced would receive
>>immediate attention.  One problem to be solved here is how to
>>save/restore client context when a "context switch" is required.
> 
> 
> I don't quite understand your point here. 
> 
> Say you have process A at prio 20 that waits on a queue with server S. S
> becomes prio 20 and starts to run. Then it is preempted by process B at
> prio 30 which then comes to wait on the server's queue. Server S becomes
> prio 30 and finishes process A's work, then checks the queue again and
> finds process B and starts working on process B's work still at prio 30.
> The time of process B is still bounded (predictable).

My point was the server thread in the above scenario is
non-preemptable.  Otherwise upon B soliciting service from
S, A's work by S would be preempted and attention would be
given immediately to B.

This may very well be a concession to simplicity in the
design.  The server context on behalf of client A would need
to be saved [somewhere] when B caused the preemption and
restored when A's priority deemed doing so.

For a mutex, the priority promotion of 'anything of lower
priority in my way' to logical completion is needed to
preserve the semantics of a mutex, ie: mutex ownership cannot
be preempted.  However in general this doesn't hold for the
server thread model.  We could redirect the server
immediately to a different client at the cost of additional
context switching -- a compromise to consider.

Again this is the general case.  It is likely for critical
sections to exist in the server thread where preemption must
be disabled analogous to the kernel/cpu preemption model.

> ...The work to keep track of what priorities are being
> inherited is even easier than mutexs...

The dependency chain does exist here as for mutexes if we
allow servers to wait on other servers.  Note in this usage
a preemptive server model favors preemption over priority
propagation unless the target server is itself blocked.

Note here it is more obvious [at least to me] circular
dependencies are to be disallowed.  With mutexes, especially
of the reader/writer variety, circular ownership
dependencies can go unnoticed which will confound the
priority promotion logic.

-john

-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-13 23:45       ` john cooper
@ 2004-12-14 13:01         ` Steven Rostedt
  2004-12-14 14:28           ` john cooper
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-14 13:01 UTC (permalink / raw)
  To: john cooper
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

On Mon, 2004-12-13 at 18:45 -0500, john cooper wrote:
> Steven Rostedt wrote:
> > 
> > I don't quite understand your point here. 
> > 
> > Say you have process A at prio 20 that waits on a queue with server S. S
> > becomes prio 20 and starts to run. Then it is preempted by process B at
> > prio 30 which then comes to wait on the server's queue. Server S becomes
> > prio 30 and finishes process A's work, then checks the queue again and
> > finds process B and starts working on process B's work still at prio 30.
> > The time of process B is still bounded (predictable).
> 
> My point was the server thread in the above scenario is
> non-preemptable.  Otherwise upon B soliciting service from
> S, A's work by S would be preempted and attention would be
> given immediately to B.
> 

Why must the server be non-preemptable?  Have you written code for a
server that can immediately switch to another clients request? I've
tried, and it's not easy.  The work of the server would only process one
client at a time. In that regard, the server is "non-preemptable", but
in services to be done, not in context switching. B would preempt server
S just because B is a higher priority. But when B puts itself to sleep
on S's wait queue, S would then inherit B's priority. But S would still
be finishing A's work. When S finished A's work, it would go directly on
to B's work.  This is just like a mutex. Think of the code within a
mutex as a service, and the Server is just the thread that happens to be
doing the work. So Process A would go into the mutex, become Sa, then
when B wanted to go into the mutex, Sa would inherit B's priority, and
when it's finish with A's work, it would become Sb.

> This may very well be a concession to simplicity in the
> design.  The server context on behalf of client A would need
> to be saved [somewhere] when B caused the preemption and
> restored when A's priority deemed doing so.
> 
Server S would not know that B is on the wait queue, except that B has
increased S's priority. S would still work on A's request, so the only
saving for S would be in the S's stack when B preempted it.

> For a mutex, the priority promotion of 'anything of lower
> priority in my way' to logical completion is needed to
> preserve the semantics of a mutex, ie: mutex ownership cannot
> be preempted.  However in general this doesn't hold for the
> server thread model.  We could redirect the server
> immediately to a different client at the cost of additional
> context switching -- a compromise to consider.
> 

How would you redirect the Server?  If server S is working on A's work,
(let's make it easy and use the example of a web server) A sends S a
request to serve page P, S goes to retrieve P, then B comes along and
request page Q, how would you write the code to know to stop working on
getting P and start getting Q, S is a single thread, doing the work, not
multiple instances.

> Again this is the general case.  It is likely for critical
> sections to exist in the server thread where preemption must
> be disabled analogous to the kernel/cpu preemption model.
> 
> > ...The work to keep track of what priorities are being
> > inherited is even easier than mutexs...
> 
> The dependency chain does exist here as for mutexes if we
> allow servers to wait on other servers.  Note in this usage
> a preemptive server model favors preemption over priority
> propagation unless the target server is itself blocked.
> 

If you have a single thread working as the server, how do you go about
writing code that can have that thread stop a task in the middle and
start doing something else. Although there may not be a need to do
certain things non-preemptively, but a server (should I say server
thread), only does one task at a time, giving it a same functionality as
a mutex.

> Note here it is more obvious [at least to me] circular
> dependencies are to be disallowed.  With mutexes, especially
> of the reader/writer variety, circular ownership
> dependencies can go unnoticed which will confound the
> priority promotion logic.
> 

I agree with you here, since a process can only have one server working
on it at a time. But this can become a problem, if you have one server
working for another server. If server S needs something from server X
then X needs something from server S, and they both are waiting. But
that would already have shown up in the kernel.

The whole point I'm trying to make is that today, when a high priority
process goes onto a wait queue, the process that will server that rt
process may be of a lower priority than other processes lower that the
original rt process that is waiting.  So you have a case of priority
inversion within processes serving other processes.

-- Steve

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-14 13:01         ` Steven Rostedt
@ 2004-12-14 14:28           ` john cooper
  2004-12-14 16:53             ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: john cooper @ 2004-12-14 14:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt, john cooper

Steven Rostedt wrote:

> Why must the server be non-preemptable?  Have you written code for a
> server that can immediately switch to another clients request? I've
> tried, and it's not easy.

Yes I have and agree it is not trivial.  What is required is
for the server's context in the process of servicing client A
to be saved with A and context of the new client to be loaded.

One way this can be accomplished is to save the server context
on the stack of the client which it is preempting.  This is
quite similar to the model of a CPU (server) and how task
(client) preemption is effected in the kernel.

> The work of the server would only process one
> client at a time. In that regard, the server is "non-preemptable", but
> in services to be done, not in context switching. B would preempt server
> S just because B is a higher priority. But when B puts itself to sleep
> on S's wait queue, S would then inherit B's priority.

In this scenario B is not preempting A's service, but rather elevating
S to the priority of B in hope S can compete for system resource at
B's priority.  In a preemptive scenario B would not sleep on S as its
priority would redirect S from A to B.  The only time an arriving high
priority client would promote/block over preemption would be if S
itself was blocked (unavailable).

> But S would still
> be finishing A's work. When S finished A's work, it would go directly on
> to B's work.  This is just like a mutex...

Yes the above is just like a mutex.  However the model of a server
thread offers opportunity for greater server (resource) availability
while retaining correct semantics vs. that of a mutex (resource).
The server's service of a client in general may be preempted while
the ownership of a mutex may not.

> Server S would not know that B is on the wait queue, except that B has
> increased S's priority. S would still work on A's request, so the only
> saving for S would be in the S's stack when B preempted it.

True in the non-preemptive case as there is no reason to
notify S.  In the preemptive case S would receive the
equivalent of an interrupt upon B's arrival.

> How would you redirect the Server?  If server S is working on A's work,
> (let's make it easy and use the example of a web server) A sends S a
> request to serve page P, S goes to retrieve P, then B comes along and
> request page Q, how would you write the code to know to stop working on
> getting P and start getting Q, S is a single thread, doing the work, not
> multiple instances.

True.  However abstractly S would be preempted and save its
current context 'with' A.  S then would restore (load) B's
(initial) context and begin service.

A userspace example is probably not the easiest place to start.
Such saving and restoration of context is much less of a logistical
issue for in-kernel mechanisms.

Although I've implemented similar server models in userspace where
the available server interruptive mechanism boils down to sending
of a signal.  The enforcement of critical sections in the
server is effected by blocking signal delivery (preemption).

> If you have a single thread working as the server, how do you go about
> writing code that can have that thread stop a task in the middle and
> start doing something else. Although there may not be a need to do
> certain things non-preemptively, but a server (should I say server
> thread), only does one task at a time, giving it a same functionality as
> a mutex.

The server in the preemptive model virtualizes a CPU.  We can
preempt the service of the CPU during execution in a task (client)
by interrupting the CPU.  This results in saving of the CPU context
during the service of the client on the stack of the preempted client.

> I agree with you here, since a process can only have one server working
> on it at a time. But this can become a problem, if you have one server
> working for another server. If server S needs something from server X
> then X needs something from server S, and they both are waiting. But
> that would already have shown up in the kernel.

Yes, circular dependencies are illegal while strictly ordered
dependencies are allowable.  This is similar to enforcement of
a mutex/lock acquisition hierarchy.

> The whole point I'm trying to make is that today, when a high priority
> process goes onto a wait queue, the process that will server that rt
> process may be of a lower priority than other processes lower that the
> original rt process that is waiting.  So you have a case of priority
> inversion within processes serving other processes.

Agreed.  And I do think this mechanism has merit irrespective
of the preemption model -- I wouldn't expect a preemptive
approach to be available in the first prototype.

I'd hazard other likely sources of battle history dealing with
client/server/preemption issues to be found in papers dealing with
microkernel [who?] design of about a decade and a half ago.

-john

-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-14 14:28           ` john cooper
@ 2004-12-14 16:53             ` Steven Rostedt
  0 siblings, 0 replies; 72+ messages in thread
From: Steven Rostedt @ 2004-12-14 16:53 UTC (permalink / raw)
  To: john cooper
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

On Tue, 2004-12-14 at 09:28 -0500, john cooper wrote:

> Agreed.  And I do think this mechanism has merit irrespective
> of the preemption model -- I wouldn't expect a preemptive
> approach to be available in the first prototype.
> 
> I'd hazard other likely sources of battle history dealing with
> client/server/preemption issues to be found in papers dealing with
> microkernel [who?] design of about a decade and a half ago.

OK, I understand what you're saying. Oh, and tell Scott I said "Hi".

-- Steve


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10 23:42 ` Steven Rostedt
  2004-12-11 16:59   ` john cooper
@ 2004-12-11 17:59   ` Esben Nielsen
  2004-12-11 18:59     ` Steven Rostedt
  2004-12-13 22:31   ` Ingo Molnar
  2 siblings, 1 reply; 72+ messages in thread
From: Esben Nielsen @ 2004-12-11 17:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt


On Fri, 10 Dec 2004, Steven Rostedt wrote:

> On Thu, 2004-12-09 at 12:10 -0600, Mark_H_Johnson@raytheon.com wrote:
> > >but you never want your real application be delayed by things like IDE
> > >processing or networking workloads, correct?
> > For the most part, that I/O workload IS because I have the RT application
> > running. That was one of my points. I cannot reliably starve any of
> > those activities. The disk reads in my real application simulate a disk
> > read from a real world device. That data is needed for RT processing
> > in the simulated system. Some of the network traffic is also RT since
> > we generate a data stream that is interpreted in real time by other
> > systems.
> 
> [RFC]  Has there been previously any thought of adding priority
> inheriting wait queues. With the IRQs that run as threads, have hooks in
> the code that allows a driver or socket layer to attach a thread to a
> wait queue, and when a RT priority task waits on the queue, a function
> is call to increase (if needed) the priority of the attached thread. I
> know that this would take some work, and would make the normal kernel
> and RT diverge more, but it would really help to solve the problem of a
> high priority process waiting for an interrupt that can be starved by
> other high priority processes.
> 
> Just a thought.
>
I am not sure I understand you correctly.

If it is a general method of making priority sorting on  wait-queues: Yes,
certainly! The highest priority task nearly always ought to be woken
first.

But in a lot of cases you send messages from high to low and visa verse
without wanting to move their priorities by doing so. If forinstance you
want a IRQ-thread to be increased in priority when a RT task listens to
packets from that device I think it is a bad idea. The developer should
himself set the priorities right. The device might use a lot of CPU in
some cases. By increasing it's priority you might destroy the RT
properties of all the tasks in between. In general you don't know.
 
> -- Steve
> 
Esben


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-11 17:59   ` Esben Nielsen
@ 2004-12-11 18:59     ` Steven Rostedt
  2004-12-11 19:50       ` Esben Nielsen
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-11 18:59 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Sat, 2004-12-11 at 18:59 +0100, Esben Nielsen wrote:
> On Fri, 10 Dec 2004, Steven Rostedt wrote:

> I am not sure I understand you correctly.
> 
> If it is a general method of making priority sorting on  wait-queues: Yes,
> certainly! The highest priority task nearly always ought to be woken
> first.
> 
> But in a lot of cases you send messages from high to low and visa verse
> without wanting to move their priorities by doing so. If forinstance you
> want a IRQ-thread to be increased in priority when a RT task listens to
> packets from that device I think it is a bad idea. The developer should
> himself set the priorities right. The device might use a lot of CPU in
> some cases. By increasing it's priority you might destroy the RT
> properties of all the tasks in between. In general you don't know.
>  

Actually, I was thinking of something more configurable (and so, more
complex).  The main problem I've seen in general, is to differentiate
services for RT tasks and others. So if a RT task is waiting for some
disk activity while other RT tasks are running, the IRQ thread (or
whatever will service the disk) may be starved. I agree that this is
really more of a design issue, but I thought that there may be ways to
facilitate the RT design by setting flags in a task before it reads from
disk, so in case the RT task blocks waiting for a disk read, the disk
serving thread would inherit the priority of that task. One could argue
that the task could simply increase the service provider's priority
before doing the read, but than it may not block, and this would be a
waist.

I guess servicing in general is very hard to predict, so a RT task must
have all its information read and stored somewhere that it can receive
in a predictable amount of time, and not on disk or someplace that takes
another task to do the request that handles other tasks as well (thus
complicating the priority scheme).  As for sockets, I did my Master's
thesis on setting up RT sockets that are handle separately from other
sockets with a protocol that allows for incoming packets to quickly be
determined that they are RT packets and can go right to where they are
needed. 

I just wanted to bring up this discussion, I guess a general approach is
too difficult and not worth the effort.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-11 18:59     ` Steven Rostedt
@ 2004-12-11 19:50       ` Esben Nielsen
  2004-12-11 22:34         ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: Esben Nielsen @ 2004-12-11 19:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Sat, 11 Dec 2004, Steven Rostedt wrote:

> On Sat, 2004-12-11 at 18:59 +0100, Esben Nielsen wrote:
> > On Fri, 10 Dec 2004, Steven Rostedt wrote:
> 
> > I am not sure I understand you correctly.
> > 
> > If it is a general method of making priority sorting on  wait-queues: Yes,
> > certainly! The highest priority task nearly always ought to be woken
> > first.
> > 
> > But in a lot of cases you send messages from high to low and visa verse
> > without wanting to move their priorities by doing so. If forinstance you
> > want a IRQ-thread to be increased in priority when a RT task listens to
> > packets from that device I think it is a bad idea. The developer should
> > himself set the priorities right. The device might use a lot of CPU in
> > some cases. By increasing it's priority you might destroy the RT
> > properties of all the tasks in between. In general you don't know.
> >  
> 
> Actually, I was thinking of something more configurable (and so, more
> complex).  The main problem I've seen in general, is to differentiate
> services for RT tasks and others. So if a RT task is waiting for some
> disk activity while other RT tasks are running, the IRQ thread (or
> whatever will service the disk) may be starved. I agree that this is
> really more of a design issue, but I thought that there may be ways to
> facilitate the RT design by setting flags in a task before it reads from
> disk, so in case the RT task blocks waiting for a disk read, the disk
> serving thread would inherit the priority of that task. One could argue
> that the task could simply increase the service provider's priority
> before doing the read, but than it may not block, and this would be a
> waist.

Disk access - at least on top of a filesystem - is not real-time. But we
can say it is some other device.

I would take the following approach:
1) Ensure the IRQ handler isn't in anyway using a too much CPU and
increase it's priority staticly.
2) Reconsider my overall design: Apparently the device isn't suit-able for
real-time.

> 
> I guess servicing in general is very hard to predict, so a RT task must
> have all its information read and stored somewhere that it can receive
> in a predictable amount of time, and not on disk or someplace that takes
> another task to do the request that handles other tasks as well (thus
> complicating the priority scheme).  As for sockets, I did my Master's
> thesis on setting up RT sockets that are handle separately from other
> sockets with a protocol that allows for incoming packets to quickly be
> determined that they are RT packets and can go right to where they are
> needed. 

Linux relies on soft IRQ for delivering packets to the listening
protocol stacks. That is a problem because you can't just boost the
priority of soft-IRQ without boosting a lot of things.

With IRQ-threading the design could be changed such the IRQ thread does
the job directly. But that will make the whole IRQ thread drive the
protocol stack as well :-(

It all depends on what your requirements are. Maybe you can handle
"driving" the whole IP stack before handling the RT packet - maybe not.

How did you handle it in your thesis?


> 
> I just wanted to bring up this discussion, I guess a general approach is
> too difficult and not worth the effort.
>

If you can think up something there is no harm in trying it :-)
 
> Thanks,
> 
> -- Steve
> 
Esben



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-11 19:50       ` Esben Nielsen
@ 2004-12-11 22:34         ` Steven Rostedt
  2004-12-13 21:55           ` Bill Huey
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-11 22:34 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Sat, 2004-12-11 at 20:50 +0100, Esben Nielsen wrote:
> Linux relies on soft IRQ for delivering packets to the listening
> protocol stacks. That is a problem because you can't just boost the
> priority of soft-IRQ without boosting a lot of things.
> 
> With IRQ-threading the design could be changed such the IRQ thread does
> the job directly. But that will make the whole IRQ thread drive the
> protocol stack as well :-(
> 
> It all depends on what your requirements are. Maybe you can handle
> "driving" the whole IP stack before handling the RT packet - maybe not.
> 
> How did you handle it in your thesis?
> 

I had an irq threaded kernel, and all softirqs where handled by the
softirqd thread. I created two more threads that would handle the
sending and receiving of the packets.  Here's how it worked: 

Each packet had an ip option added that stated the priority of the
packet. (of course the priorities of each machine connected must have
this protocol and priorities mean the same).

When received, the interrupt (in interrupt context not a thread) would
look to see if it was an RT packet. If it was, it placed it on a rt
received queue and woke up the receive thread. If needed it would raise
the priority of that thread. If the packet was not RT, it went the
normal route (placed on the queue for the softirq to handle).

The packet queue was a heap queue sorted by priority. The parts of the
TCP/IP stack was broken up into sections. The receive thread would only
process the packet on top of the queue. At the end of the section, it
would check to see if the queue changed and then start processing the
packet on top, if a higher packet came in at that time.  So the packets
on the queue had a state attached to them.  When the packet eventually
made it to the process waiting, it was then handled by that process. So
if a process was waiting, the process would have been woken up and it
would handle the rest of the processing. Otherwise the receive thread
would do it up to where it can drop it off to the processes. I set the
packet to be once less priority of the process it was sent from and the
one it was going to.

The sending was done mostly by the process, but if it had to wait for
some reason, the sending thread would take over.

This was mostly academic in nature, but was a lot of fun and interesting
to see how results changed with different methods.

> 
> > 
> > I just wanted to bring up this discussion, I guess a general approach is
> > too difficult and not worth the effort.
> >
> 
> If you can think up something there is no harm in trying it :-)
>  

If I ever think of something, I would not hesitate on implementing
it ;-)

> > Thanks,
> > 
> > -- Steve
> > 
> Esben
> 
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-11 22:34         ` Steven Rostedt
@ 2004-12-13 21:55           ` Bill Huey
  2004-12-13 22:15             ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: Bill Huey @ 2004-12-13 21:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Esben Nielsen, Mark Johnson, Ingo Molnar, Amit Shah,
	Karsten Wiese, Bill Huey, Adam Heath, emann, Gunther Persoons,
	K.R. Foley, LKML, Florian Schmidt, Fernando Pablo Lopez-Lezcano,
	Lee Revell, Rui Nuno Capela, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

On Sat, Dec 11, 2004 at 05:34:40PM -0500, Steven Rostedt wrote:
> On Sat, 2004-12-11 at 20:50 +0100, Esben Nielsen wrote:
> > How did you handle it in your thesis?

I'd like to see the code even if it's not ready for inclusion or anything
along those lines just to see what other kind of problems you ran into.

> I had an irq threaded kernel, and all softirqs where handled by the
> softirqd thread. I created two more threads that would handle the
> sending and receiving of the packets.  Here's how it worked: 

[priority tags packets...]

> When received, the interrupt (in interrupt context not a thread) would
> look to see if it was an RT packet. If it was, it placed it on a rt
> received queue and woke up the receive thread. If needed it would raise
> the priority of that thread. If the packet was not RT, it went the
> normal route (placed on the queue for the softirq to handle).

A generalized system to do this is pretty important for folks doing things
like QoS over things like Firewire, SCSI, USB, and other high speed busses
of that nature including networking layers. Folks doing things with clusters
would love to have something like that if the subsystem layer above the
driver responsible for handling the protocol was modified to have this
ability.

One thing that I noticed in this thread is that even though you were talking
about the mechanisms to support these features, it really needs some
consideration as to how it's going to effect the stock kernel since you're
really introduction a first-class threading object/concept into the system.
That means changes to the scheduler, how QoS fits into this, etc...
IMO, it's ultimately about QoS and that alone is a hot button since it's
so invasive throughout the kernel.

Creating a special threaded server object (thinking out loud) might be a
good idea in that it could be attached to any arbitrary subsystem at will,
assuming if that particular subsystem's logic permits this easily.

It's not a light topic and can certain require more folks pushing it. I'm
very interested in getting something like this into Linux, but stability,
latency regularity, contention are things that still need a lot of work.

> The packet queue was a heap queue sorted by priority. The parts of the
> TCP/IP stack was broken up into sections. The receive thread would only
> process the packet on top of the queue. At the end of the section, it
> would check to see if the queue changed and then start processing the
> packet on top, if a higher packet came in at that time.  So the packets
> on the queue had a state attached to them.  When the packet eventually
> made it to the process waiting, it was then handled by that process. So
> if a process was waiting, the process would have been woken up and it
> would handle the rest of the processing. Otherwise the receive thread
> would do it up to where it can drop it off to the processes. I set the
> packet to be once less priority of the process it was sent from and the
> one it was going to.
> 
> The sending was done mostly by the process, but if it had to wait for
> some reason, the sending thread would take over.
> 
> This was mostly academic in nature, but was a lot of fun and interesting
> to see how results changed with different methods.

This is a good track to research casually since not that many people have
done so, and so that the problem space is mapped in this particular kernel.
With things like VoIP and relatives becoming popular, this is becoming
more and more essential over time.

It's up to you, but I think this is a great track to pursue.. That's because
if you don't do it, somebody else will... :)

bill

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-13 21:55           ` Bill Huey
@ 2004-12-13 22:15             ` Steven Rostedt
  2004-12-13 22:20               ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-13 22:15 UTC (permalink / raw)
  To: Bill Huey
  Cc: Esben Nielsen, Mark Johnson, Ingo Molnar, Amit Shah,
	Karsten Wiese, Adam Heath, emann, Gunther Persoons, K.R. Foley,
	LKML, Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 2004-12-13 at 13:55 -0800, Bill Huey wrote:

> 
> One thing that I noticed in this thread is that even though you were talking
> about the mechanisms to support these features, it really needs some
> consideration as to how it's going to effect the stock kernel since you're
> really introduction a first-class threading object/concept into the system.
> That means changes to the scheduler, how QoS fits into this, etc...
> IMO, it's ultimately about QoS and that alone is a hot button since it's
> so invasive throughout the kernel.
> 

Is there any talk about Ingo's patch getting into the mainstream kernel?

> Creating a special threaded server object (thinking out loud) might be a
> good idea in that it could be attached to any arbitrary subsystem at will,
> assuming if that particular subsystem's logic permits this easily.
> 
> It's not a light topic and can certain require more folks pushing it. I'm
> very interested in getting something like this into Linux, but stability,
> latency regularity, contention are things that still need a lot of work.
>  
> > The packet queue was a heap queue sorted by priority. The parts of the
> > TCP/IP stack was broken up into sections. The receive thread would only
> > process the packet on top of the queue. At the end of the section, it
> > would check to see if the queue changed and then start processing the
> > packet on top, if a higher packet came in at that time.  So the packets
> > on the queue had a state attached to them.  When the packet eventually
> > made it to the process waiting, it was then handled by that process. So
> > if a process was waiting, the process would have been woken up and it
> > would handle the rest of the processing. Otherwise the receive thread
> > would do it up to where it can drop it off to the processes. I set the
> > packet to be once less priority of the process it was sent from and the
> > one it was going to.
> > 
> > The sending was done mostly by the process, but if it had to wait for
> > some reason, the sending thread would take over.
> > 
> > This was mostly academic in nature, but was a lot of fun and interesting
> > to see how results changed with different methods.
> 
> This is a good track to research casually since not that many people have
> done so, and so that the problem space is mapped in this particular kernel.
> With things like VoIP and relatives becoming popular, this is becoming
> more and more essential over time.
> 
> It's up to you, but I think this is a great track to pursue.. That's because
> if you don't do it, somebody else will... :)
> 

I'd love to keep up on it, but now I'm working on a contract that's
taking all of my time. I did this some time back using the TimeSys GPL
kernel.  Of course I didn't have the priority inheritance (it's a
proprietary module), but it was good for my needs.

The work I'm now doing may swing back into this field, and we'll see
what happens.  As I said earlier, this was very much academic and needs
lots of work. I did notice that the processors today make the TCP/IP
stack very fast, but the big improvement was the separate queue for
packets coming in and seeing right a way that they need to be processed
ahead of other packets, as well as other processes.

> bill
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-13 22:15             ` Steven Rostedt
@ 2004-12-13 22:20               ` Ingo Molnar
  0 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-13 22:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Bill Huey, Esben Nielsen, Mark Johnson, Amit Shah, Karsten Wiese,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > One thing that I noticed in this thread is that even though you were talking
> > about the mechanisms to support these features, it really needs some
> > consideration as to how it's going to effect the stock kernel since you're
> > really introduction a first-class threading object/concept into the system.
> > That means changes to the scheduler, how QoS fits into this, etc...
> > IMO, it's ultimately about QoS and that alone is a hot button since it's
> > so invasive throughout the kernel.
> 
> Is there any talk about Ingo's patch getting into the mainstream
> kernel?

a good number of generic bits (generic irq subsystem, preemption
fixes/enhancements, lock initializer cleanups, and tons of fixes found
in -RT) are upstream or in -mm already, but the core PREEMPT_RT stuff is
still under development and thus not ready for upstream. I'm constantly
sending independent bits (fixes or orthogonal improvements) that show up
in -RT towards upstream as well. [-RT would be a 1MB unmaintainable
patch otherwise.]

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10 23:42 ` Steven Rostedt
  2004-12-11 16:59   ` john cooper
  2004-12-11 17:59   ` Esben Nielsen
@ 2004-12-13 22:31   ` Ingo Molnar
  2 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-13 22:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, LKML, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Steven Rostedt <rostedt@goodmis.org> wrote:

> [RFC] Has there been previously any thought of adding priority
> inheriting wait queues. [...]

this will make sense at a certain point.

> [...] it would really help to solve the problem of a high priority
> process waiting for an interrupt that can be starved by other high
> priority processes.

the primary use i think would be kernel-internal task <-> task
waitqueues such as the futex queues, to transport the effects of RT
priorities across waitqueues as well. IRQ related waitqueues are a nice
'side-effect'.

another next step would be to transport PI effects to userspace code,
for user-controlled synchronization objects such as futexes or e.g. SysV
semaphores.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-13 14:10 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-13 14:10 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, LKML, Ingo Molnar, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Steven Rostedt, Shane Shrybman, Thomas Gleixner, Michal Schmidt

>Disk access - at least on top of a filesystem - is not real-time. But we
>can say it is some other device.
I am not quite sure you should make such a general statement. There are
a number of "real time" processes that access disk drives. Things that
come to mind include:
 - paging for a visual display system (think a high end flight simulator)
 - streaming data acquisition
 - several multimedia applications (video / audio)
The application I mentioned (simulating a real world system that uses
a disk drive) certainly falls within the real time range as well.

You certainly have to manage the application carefully. But with
preallocated (prefer contiguous) files, you can do quite a lot with
a disk in a real time system. The rates may not be as high as needed
for some applications, but the overall concept is certainly valid.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 21:58 Mark_H_Johnson
  2004-12-09 22:55 ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 21:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>on SMP, latencytest + all IRQ threads (and ksoftirqd) at prio 99 +
>PREEMPT_RT is not comparable to PREEMPT_DESKTOP (with no IRQ threading).
Of course they are comparable. You may not consider it a FAIR
comparison, but they are comparable. I maintain the comparison shows
the increased overhead of IRQ threading - you maintain it is an
"inverse scenario" which I don't buy.

>The -RT kernel will 'split' hardirq and softirq workloads and migrate
>them to different CPUs - giving them a higher total throughput. Also, on
>PREEMPT_DESKTOP the IRQs will most likely go to one CPU only, and most
>softirq processing will be concentrated on that CPU too. Furthermore,
>the -RT kernel will agressively distribute highprio RT tasks.
Now wait a minute, how does -RT do that IRQ splitting? From what I recall
(I don't have RT up right now) taskset indicated all the IRQ tasks were
wired to CPU 0 and the only opportunity for splitting was with
ksoftirqd/0 and /1. I confirmed that by looking at the "last CPU" in top.

When I tried to change the CPU affinity of those IRQ tasks (to use both
CPU's), I got error messages. One of the responses you made at that
time was...
>> If setting it to 3 is REALLY BAD, perhaps we should prevent it.
>
>it's just like setting ksoftirqd's affinity. I agree that it's nasty,
>but there's no easy way right now.
Has this behavior changed in the last three weeks?

For a CONFIG_DESKTOP data point, let's take a look at the latency traces
I just made from -12PK.
  CPU 0 - 10, 14, 16 - 38, 40 - 43,
  CPU 1 - 00 - 09, 11 - 13, 15, 39, 44
let's see - 45 total traces, 29 for CPU 0 and 16 for CPU 1. Not quite
evenly balanced but not all on one CPU either (the data IS bursty though).
The common_interrupt trace appears to show up only on CPU 0, but the
latency traces are definitely on both CPU's.

>latencytest under your priority setup measures an _inverse_ scenario. (a
>CPU hog executing at a lower priority than all IRQ traffic) I'd not be
>surprised at all if it had higher latencies under -RT than under
>PREEMPT_DESKTOP.
Why "higher latencies"? And do you mean
 - more short latencies (I'm counting a lot of just over 100 usec delays)
OR
 - longer overall latencies (which I am not expecting but seeing)
OR
something else?

Let's look at the possible scenarios:
[PK refers to "Preemptible Kernel - PREEMPT_DESKTOP" w/o IRQ threading]
[RT refers to PREEMPT_RT with the IRQ # and ksoftirqd/# threads at RT 99]
 [1] A single interrupt comes in and latencytest is NOT on the CPU
that services the interrupt. In the case of PK, latencytest is
unaffected. In the case of RT, latencytest is affected ONLY if the
IRQ # thread or ksoftidqd/# thread is on the CPU with latencytest.
In that case, latencytest is pushed to the other CPU. That switch
takes some TBD amount of time and is counted by latencytest only if
it exceeds 100 usec.
 [2] A single interrupt comes in and latencytest is on the CPU that
services the interrupt. In the case of PK, latencytest is preempted
for the duration of the interrupt and resumes. In the case of RT,
latencytest is rescheduled on the other CPU (or not) once we reach the
place where we are ready to thread the IRQ. I would think RT should do
better in this case but am not sure.
 [3] A series of interrupts comes in. In PK what I see is several
sequential delays up to 1/2 msec or so (and have traces that show that
behavior). In RT I would expect a shorter latency period (if both CPU's
are busy with IRQ's or not) than PK [though I don't have traces for
this since if I cross CPU's the trace doesn't get recorded].

I don't see how RT should have worse numbers in these scenarios
unless the overhead is more (or I'm counting more trivial latencies)
than in PK. I would expect to see in the RT case a shorter maximum
delay (which alas I do NOT see).

>It's not clear-cut which one 'wins' though: because
>even this inverse scenario will have benefits in the -RT case: due to
>SCHED_OTHER workloads not interfering with this lower-prio RT task as
>much. But i'd expect there to be a constant moving of the 'benchmark
>result' forward and backwards, even if -RT only improves things - this
>is the nature of such an inverse priority setup.

Not quite sure what you mean by this.

>so this setup generates two conflicting parameters which are inverse to
>each other, and the 'sum' of these two parameters ends up fluctuating
>wildly. Pretty much like the results you are getting. The two parameters
>are: latency of the prio 30 task, and latency of the highprio tasks. The
>better the -RT kernel gets, the better the prio 30 tasks's priorities
>get relative to SCHED_OTHER tasks - but the worse they also get, due to
>the better handling of higher-prio tasks. Where the sum ends, whether
>it's a "win" or a "loss" depends on the workload, how much highprio
>activity the lowprio threads generate, etc.
I don't see how this rationale is relevant - the amount of work for IRQ
activities that is generated by each workload should be similar. Its
one of the reasons I run the same tests over and over again.

If I create a 750 Mbyte file (one of the stress test cases), I should be
doing a series of disk writes and interrupts. Both RT and PK should do
about the same work to create that file. So the overhead on latencytest
should be about the same for both RT and PK. If the overhead is
not the same, something is wrong.

If I look at the max latency:
  RT 3.90
  PK 1.91  (both cases nominal is 1.16 msec)
>From the scenarios I described above, I don't see why this result should
have occurred. Certainly nothing that should cause a delay of over
two msec on a roughly one msec task.

If I look at the % within 100 usec measure:
  RT 87% within 100 usec, 97% within 200 usec (360 seconds elapsed)
  PK 67% within 100 usec, 96% within 200 usec (57 seconds elapsed)
[note 250,000 samples in 360 seconds is 694 samples per second]
>From a percentage point of view, this looks bad for PK but if I
factor in the elapsed time I get...
 - PK interrupted latencytest about 13000 times
 - RT interrupted latencytest about 32000 times
I am not sure how much of this is due to the workload (disk writes)
or due to the elapsed time aspects.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 21:58 Mark_H_Johnson
@ 2004-12-09 22:55 ` Ingo Molnar
  0 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 22:55 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> When I tried to change the CPU affinity of those IRQ tasks (to use both
> CPU's), I got error messages. One of the responses you made at that
> time was...
> >> If setting it to 3 is REALLY BAD, perhaps we should prevent it.
> >
> >it's just like setting ksoftirqd's affinity. I agree that it's nasty,
> >but there's no easy way right now.
> Has this behavior changed in the last three weeks?

nope, you are right, PK and RT should be more or less comparable.

the moment we get a proper trace of one such 3msec delay we ought to see
what's happening.

> For a CONFIG_DESKTOP data point, let's take a look at the latency traces
> I just made from -12PK.
>   CPU 0 - 10, 14, 16 - 38, 40 - 43,
>   CPU 1 - 00 - 09, 11 - 13, 15, 39, 44
> let's see - 45 total traces, 29 for CPU 0 and 16 for CPU 1. Not quite
> evenly balanced but not all on one CPU either (the data IS bursty though).
> The common_interrupt trace appears to show up only on CPU 0, but the
> latency traces are definitely on both CPU's.

havent gotten these yet, but from your description i'd guess that the
CPU#1 latencies would be softirq processing latencies (those occur on
both CPUs).

> > latencytest under your priority setup measures an _inverse_ 
> > scenario. (a CPU hog executing at a lower priority than all IRQ 
> > traffic) I'd not be surprised at all if it had higher latencies 
> > under -RT than under PREEMPT_DESKTOP.
>
> Why "higher latencies"? And do you mean
>  - more short latencies (I'm counting a lot of just over 100 usec delays)
> OR
>  - longer overall latencies (which I am not expecting but seeing)
> OR
> something else?

i meant "longer overall latencies" - but this was based on the mistaken
theory of IRQ threads wandering between CPUs, which they dont do.

> Let's look at the possible scenarios:
> [PK refers to "Preemptible Kernel - PREEMPT_DESKTOP" w/o IRQ threading]
> [RT refers to PREEMPT_RT with the IRQ # and ksoftirqd/# threads at RT 99]

>  [1] A single interrupt comes in and latencytest is NOT on the CPU
> that services the interrupt. In the case of PK, latencytest is
> unaffected. In the case of RT, latencytest is affected ONLY if the
> IRQ # thread or ksoftidqd/# thread is on the CPU with latencytest.
> In that case, latencytest is pushed to the other CPU. That switch
> takes some TBD amount of time and is counted by latencytest only if
> it exceeds 100 usec.

i'd say that such a bounce doesnt happen on RT either, because, as
you've found out, all IRQ threads are bound to CPU#0.

>  [2] A single interrupt comes in and latencytest is on the CPU that
> services the interrupt. In the case of PK, latencytest is preempted
> for the duration of the interrupt and resumes. In the case of RT,
> latencytest is rescheduled on the other CPU (or not) once we reach the
> place where we are ready to thread the IRQ. I would think RT should do
> better in this case but am not sure.

yes, in the RT case latencytest should be pushed to the other CPU most 
of the time. (unless a higher-prio [ksoftirqd] task is running on the 
other CPU)

>  [3] A series of interrupts comes in. In PK what I see is several
> sequential delays up to 1/2 msec or so (and have traces that show that
> behavior). In RT I would expect a shorter latency period (if both CPU's
> are busy with IRQ's or not) than PK [though I don't have traces for
> this since if I cross CPU's the trace doesn't get recorded].

wrt. the 'trace doesnt get recorded' issue, it ought to work fine if you
have wakeup_timing enabled. (even when using user-triggered tracing.) 
I.e. your user task should be traced across migrations too. (if not then
it's a tracer bug.)

> I don't see how RT should have worse numbers in these scenarios unless
> the overhead is more (or I'm counting more trivial latencies) than in
> PK. I would expect to see in the RT case a shorter maximum delay
> (which alas I do NOT see).

yep, i'd expect this too.

what i was thinking about wrt. migrations was this: the total throughput
of interrupts could be higher on -RT, because of the better distribution
of RT tasks between CPUs. Higher IRQ throughput means less CPU time left
for the CPU-loop. (and also, consequently, bigger latencies measured in
the CPU-loop.) But since all IRQ threads are in essence bound to CPU#0, 
this scenario cannot occur.

if the CPU overhead of -RT is dramatically higher (especially due to
debugging code that only triggers in the -RT kernels) then we could see
a similar effect: the same amount of SCHED_OTHER processing generates a
higher amount of prio-99 activities than it does under the -PK kernel,
and hence the CPU time left for the CPU loop is lower as well. (and
also, bigger latencies are generated in the CPU loop.)

> If I look at the max latency:
>   RT 3.90
>   PK 1.91  (both cases nominal is 1.16 msec)
>
> From the scenarios I described above, I don't see why this result
> should have occurred. Certainly nothing that should cause a delay of
> over two msec on a roughly one msec task.

well, if IRQ threads and ksoftirqd comes in at the wrong moment, it's
prio 99 and could keep running for a long time. But no, i'd not expect
such a big difference either, it's the same workload after all.

> If I look at the % within 100 usec measure:
>   RT 87% within 100 usec, 97% within 200 usec (360 seconds elapsed)
>   PK 67% within 100 usec, 96% within 200 usec (57 seconds elapsed)

(this is the elapsed time of the prio ~30 CPU-loop, right?)

this smells too. There's one aspect of -RT that could starve lower-prio
RT tasks: if a high-prio RT task blocks on a mutex/semaphore then it
boosts whatever lowprio task is using that mutex currently. But this
means that it's boosted to prio 99 - preempting the prio 30 task. So
this means that depending on the level of contention, roughly the same
amount of time spent

in the PK case the prio 99 task would simply block, and the prio 30 task
could run, and you dont count this in your metrics, you only count the
'bad' effect: that the prio 30 task runs worse, you dont count the
'good' effect: that the prio 99 task runs better. This is why i think
it's unfair to only measure the 'middle priority layer', while not
counting improvements to the 'high priority layer'.

this theory is still a bit weak though to be the sole explanation: if
this were the case then we should see a decrease in total elapsed time
of the SCHED_OTHER workloads, right?

but 360 seconds vs. 57 seconds still sounds like alot... Perhaps we
should add 'CPU usage per priority level' statistics fields to
/proc/stat, or something like that? Perhaps even a 'CPU time spent while
boosted' field, to find out how the effective priority levels shift due
to PI.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 20:49 Mark_H_Johnson
  2004-12-09 21:56 ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 20:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>well, i think this measurement issue needs resolving before jumping to
>any generic conclusions. Not a single trace is extremely suspect. The
>userspace timestamps are rdtsc based, or gettimeofday() based?
rdtsc. Its actually code you sent me a while ago :-) when you
suspected a measurement problem before.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 20:49 Mark_H_Johnson
@ 2004-12-09 21:56 ` Ingo Molnar
  0 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 21:56 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >well, i think this measurement issue needs resolving before jumping to
> >any generic conclusions. Not a single trace is extremely suspect. The
> >userspace timestamps are rdtsc based, or gettimeofday() based?

> rdtsc. Its actually code you sent me a while ago :-) when you
> suspected a measurement problem before.

could you try to put a few deliberate delays into the code - does the
kernel based tracing method pick the latency up correctly? (attaching to
the thread via gdb and then 'cont'-ing it ought to be enough i think.)
It's very weird.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 20:38 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 20:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
>> I don't expect turning the debugging off will make that much of a
>> difference but I can try it tomorrow. [...]
>
>so basically this is your setup:
>- prio 99: all IRQ threads and ksoftirqd threads
Plus events/0 and /1 at RT FIFO 99.

> - prio 30: 'CPU loop' from latencytest, generating ~80% CPU load
That is the nominal case. It may be a little higher in some of the
runs (where the audio loop is consistently "fast") but never over
100% of a CPU unless you ask for a periodic sync [which I don't].

> - SCHED_OTHER: workload generators
Two primary tasks as SCHED_OTHER:
 - cpu_burn (nice w/ default, according to manpage its 10)
 - whatever workload generator is active (not nice)
I tend to also run with one or more "data collectors" which are
shell scripts that I run like this...
  chrt -f 1 ./get_ltrace.sh 250
They do sleeps of various durations (seconds) before looking at
/proc for data.

>and the metric is "delays in the prio 30 CPU loop", correct?
The % within 100 usec is always in the prio 30 CPU loop. The max
latency I sometimes mention is for that CPU loop as well
(80% of nominal audio duration). For the max latency, I try to
mention if its the delta or total time. (but sometimes forget)

The elapsed time is for the workload generator / RT application,
whichever gets done first. That is because the script starts both
(latencytest in background) and there is a killall after the
workload generator gets finished (which latencytest traps & dumps
its data to the output files). latencytest will automatically
stop after about 250000 samples - hence the upper limit of about
6 minutes for the test time.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 19:54 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 19:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>just to make sure we are talking about the same thing. Do you mean
>PREEMPT_DESKTOP with IRQ threading disabled?
Yes.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 19:23 Mark_H_Johnson
  2004-12-09 20:04 ` Ingo Molnar
  2004-12-10  5:01 ` Bill Huey
  0 siblings, 2 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 19:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

I may take this "off line" if it goes on too much longer. A little
"view of the customer" is good for the whole group, but if it
gets too much into my specific application, I don't see the benefit.

>* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
>> CPU load is pretty steady at up to 20% for any of the two CPU nodes in
>> the cluster. The upper bound for OS overhead (latency) I need is about
>> 1 msec (out of a 12.5 msec / 80 Hz frame). I do have some long CPU
>> runs / PCI shared memory traffic in the 80 Hz task at a one per second
>> rate that might take up to 10 msec of the 12.5 msec frame.
>
>so the 1 msec latency is needed by this 80 Hz task? I'd thus make this
>task prio 90 (higher than most IRQ handlers), and make the 80 Hz
>timesource's [timer IRQ? RTC? special driver?] IRQ thread prio 91. All
>other IRQ threads should be below prio 90. Whatever else this task
>triggers will be handled either by PI handling, or is started enough in
>advance (such as disk IO or network IO) to be completed by the time the
>80 Hz task needs it.
If I could do it over again, I may agree with you. However, there are a
few constraints you are not aware of:
 - the run time library I use (GNAT Ada) has only 31 priorities plus
a few it reserves for itself. These are mapped to 1-32 on Linux.
 - the framework we wrote (may years ago for another OS) uses almost all
of these priorities. For example, the 80 Hz task I referred to runs at
priority 24. The 1 Hz task runs at 4. We basically use every other
priority.
 - a task can request to run before / after a specific rate so the
odd priorities can be used as well.
 - we also have a "synchronizer" that runs at 29 and a couple other
special tasks that can run at 28.
So without rewriting the Ada run time, I don't have any free priority
levels to work with. Also note that I do get acceptable performance with
2.4 preempt + lowlat which does not have threaded IRQ's. I ought to get
acceptable performance with a 2.6 system (or else, why step up?).

>> I could set the IRQ priority of the shared memory interface to be the
>> highest (since I do task scheduling based on it) but after that there
>> is also no preset assignment of priority to I/O activity.
>
>but if this is the task that needs to do its work within 1 msec when
>signalled, it should be the highest prio one nevertheless, and no IRQ
>(except the signal IRQ) must be allowed to preempt it.
[I think we violently agree on this one]

>(The other tasks can 'feed' this master task with whatever scheduling
>pattern, as long as the 'master task' provides frames with a precise 80
>Hz frequency. Any jitter to the execution of these other threads is
>handled by buffering enough stuff in advance.)
We do not necessarily send signals at 80 Hz. Our framework has non
harmonic rates like...
  100, 80, 60, 50, 40, 30, 25, 20, 10, 5, 2, and 1
so the minimum frequency that divides evenly into all those is 1200 Hz.
Our 2.4 kernel has HZ=2400. If the "master task" (or in our system the
synchronizer) gets behind, the software is built to take care of that
(basically a best effort) to try to prevent missed frames.

>> Some form of priority inheritance may be "better" but I understand
>> that is not likely to be implemented (nor worth the effort).
>
>the master task's priority will be inherited across most of the
>dependencies that might happen at the kernel level. [ If it doesnt then
>it should show up in traces and i'm most interested in fixing it ... ]
I was referring to the priorities of the IRQ's being inherited from
the priority of the RT task making the I/O request. Then I could make
the priorities of all the IRQ's less than my highest RT task & they
would get boosted as needed. [but then I might need more buffering
for I/O since the RT tasks are starving them...]

>> By setting the IRQ threads to RT FIFO 99, I also get something closer
>> to PREEMPT_DESKTOP w/o IRQ threading (or for that matter, closer to
>> the 2.4 kernel I use today). It shows more clearly the overhead of
>> adding the threads.
>
>i believe this is the wrong model for this workload.
I stand by the statement I made. It is closer to the model of
PREEMPT_DESKTOP and shows the thread overhead more clearly. The user
can certainly optimize for a specific workload but that masks the
overhead added by threading.

>> [...] As Ingo noted in a private message
>>   "IRQ-threading will always be more expensive than direct IRQs,
>>    but it should be a fixed overhead not some drastic degradation."
>>
>> I agree the overhead should be modest but somehow the test cases I run
>> don't show that (yet). There is certainly more work to be done to fix
>> that.
>
>have you tried it with all debugging turned off? I'd like to fix any
>performance problems related to IRQ/softirq threading. (If you mean the
>'lost pings' problem, that one looks like to be more of a priority
>inversion problem than a real performance issue.)

I don't expect turning the debugging off will make that much of a
difference but I can try it tomorrow. The charts look MUCH worse
in _RT than _PK right now and both have the same level of debugging
enabled (and _PK is close to the 2.4 performance). I'll tar up the
html directories and send those separately so you can see the difference
between -5PK and -5RT at the application level. I'll send the 2.4
charts for a baseline comparison as well.

The lost pings go away by boosting the priority of ksoftirqd/0 and /1.
But even with all the IRQ's at 99 and those two tasks at 99, the ping
response time under _RT is about 2x to 3x the response time of the
non threaded IRQs of _PK.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 19:23 Mark_H_Johnson
@ 2004-12-09 20:04 ` Ingo Molnar
  2004-12-10  5:01 ` Bill Huey
  1 sibling, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 20:04 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> I don't expect turning the debugging off will make that much of a
> difference but I can try it tomorrow. [...]

so basically this is your setup:

 - prio 99: all IRQ threads and ksoftirqd threads

 - prio 30: 'CPU loop' from latencytest, generating ~80% CPU load

 - SCHED_OTHER: workload generators

and the metric is "delays in the prio 30 CPU loop", correct?

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 19:23 Mark_H_Johnson
  2004-12-09 20:04 ` Ingo Molnar
@ 2004-12-10  5:01 ` Bill Huey
  2004-12-10  5:14   ` Steven Rostedt
  1 sibling, 1 reply; 72+ messages in thread
From: Bill Huey @ 2004-12-10  5:01 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

On Thu, Dec 09, 2004 at 01:23:38PM -0600, Mark_H_Johnson@raytheon.com wrote:
> I may take this "off line" if it goes on too much longer. A little
> "view of the customer" is good for the whole group, but if it
> gets too much into my specific application, I don't see the benefit.

Taking offline would cut the rest of the developers off from having
any empirical data to work with. It's a bad idea. The entire point
of the RT kernel and app is to characterize the behavior of the system
so that fringe events happen and so that they can be tracked down and
eventually solved. Continue on IMO. :)

bill


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10  5:01 ` Bill Huey
@ 2004-12-10  5:14   ` Steven Rostedt
  2004-12-10  5:58     ` Bill Huey
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-10  5:14 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, LKML, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

On Thu, 2004-12-09 at 21:01 -0800, Bill Huey wrote:
> On Thu, Dec 09, 2004 at 01:23:38PM -0600, Mark_H_Johnson@raytheon.com wrote:
> > I may take this "off line" if it goes on too much longer. A little
> > "view of the customer" is good for the whole group, but if it
> > gets too much into my specific application, I don't see the benefit.
> 
> Taking offline would cut the rest of the developers off from having
> any empirical data to work with. It's a bad idea. The entire point
> of the RT kernel and app is to characterize the behavior of the system
> so that fringe events happen and so that they can be tracked down and
> eventually solved. Continue on IMO. :)

I second the motion. It's a fun read ;-)

(just my 0.02 cents)

-- Steve

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10  5:14   ` Steven Rostedt
@ 2004-12-10  5:58     ` Bill Huey
  0 siblings, 0 replies; 72+ messages in thread
From: Bill Huey @ 2004-12-10  5:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Bill Huey, Mark Johnson, Ingo Molnar, Amit Shah, Karsten Wiese,
	Adam Heath, emann, Gunther Persoons, K.R. Foley, LKML,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

On Fri, Dec 10, 2004 at 12:14:16AM -0500, Steven Rostedt wrote:
> On Thu, 2004-12-09 at 21:01 -0800, Bill Huey wrote:
> > On Thu, Dec 09, 2004 at 01:23:38PM -0600, Mark_H_Johnson@raytheon.com wrote:
> > > I may take this "off line" if it goes on too much longer. A little
> > > "view of the customer" is good for the whole group, but if it
> > > gets too much into my specific application, I don't see the benefit.
> > 
> > Taking offline would cut the rest of the developers off from having
> > any empirical data to work with. It's a bad idea. The entire point
> > of the RT kernel and app is to characterize the behavior of the system
> > so that fringe events happen and so that they can be tracked down and
> > eventually solved. Continue on IMO. :)
> 
> I second the motion. It's a fun read ;-)

Like your SLAB adventures. I thought it was a bit bizzare that it was
made fully preemptable and it any time you get another developer able
to hammer on this, like you, is alway an encouraging sign for the rest
of us on this project. :)

Unfortunately, jackd is only one program and what's needed is a broader
set of apps that can push the system much harder, along with jackd, to
see where things blow up. SMP is a likely trigger for all of this stuff.
In particular, shared-exclusive lock semantics under high contention
situations, vma access, etc... We'll see.

bill


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 18:15 Mark_H_Johnson
  2004-12-09 20:11 ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 18:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>
>> >also, i'd like to take a look at latency traces, if you have them for
>> >this run.
>>
>> I could if I had any. The _RT run had NO latency traces > 250 usec
>> (the limit I had set for the test). The equivalent _PK run had 37 of
>> those traces. I can rerun the test with a smaller limit to get some if
>> it is really important. My build of -12 is almost done and we can see
>> what kind of repeatability / results from the all_cpus trace shows.
>
>/me is puzzled.
>
>so all the CPU-loop delays within the -RT kernel are below 250 usecs? I
>guess i dont understand what this means then:

There were no cases where /proc/sys/kernel/preempt_max_latency went
over 250 usec in the RT stress test that I did (for the same test, _PK
had over 30 such traces).

>| The max CPU latencies in RT are worse than PK as well. The values for
>| RT range from 3.00 msec to 5.43 msec and on PK range from 1.45 msec to
>| 2.24 msec.
>
>these come from userspace timestamping? So where userspace detects a
>delay the kernel tracer doesnt measure any?
Yes. That is correct. Very puzzling to me too.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 18:15 Mark_H_Johnson
@ 2004-12-09 20:11 ` Ingo Molnar
  0 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 20:11 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >| The max CPU latencies in RT are worse than PK as well. The values for
> >| RT range from 3.00 msec to 5.43 msec and on PK range from 1.45 msec to
> >| 2.24 msec.
> >
> >
> >these come from userspace timestamping? So where userspace detects a
> >delay the kernel tracer doesnt measure any?
>
> Yes. That is correct. Very puzzling to me too.

well, i think this measurement issue needs resolving before jumping to
any generic conclusions. Not a single trace is extremely suspect. The
userspace timestamps are rdtsc based, or gettimeofday() based? In
theory, as long as no trace is triggered, there should not be any huge
overhead from tracing itself (when a trace is reported and saved then,
if the trace is large, it can be quite expensive that the tracer wont
report as a latency - but this isnt the case here).

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 17:22 Mark_H_Johnson
  2004-12-09 17:31 ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>also, i'd like to take a look at latency traces, if you have them for
>this run.

I could if I had any. The _RT run had NO latency traces > 250 usec (the
limit I had set for the test). The equivalent _PK run had 37 of those
traces. I can rerun the test with a smaller limit to get some if it is
really important. My build of -12 is almost done and we can see what kind
of repeatability / results from the all_cpus trace shows.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 17:22 Mark_H_Johnson
@ 2004-12-09 17:31 ` Ingo Molnar
  2004-12-09 20:34   ` K.R. Foley
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 17:31 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >also, i'd like to take a look at latency traces, if you have them for
> >this run.
> 
> I could if I had any. The _RT run had NO latency traces > 250 usec
> (the limit I had set for the test). The equivalent _PK run had 37 of
> those traces. I can rerun the test with a smaller limit to get some if
> it is really important. My build of -12 is almost done and we can see
> what kind of repeatability / results from the all_cpus trace shows.

/me is puzzled.

so all the CPU-loop delays within the -RT kernel are below 250 usecs? I
guess i dont understand what this means then:

| The max CPU latencies in RT are worse than PK as well. The values for
| RT range from 3.00 msec to 5.43 msec and on PK range from 1.45 msec to
| 2.24 msec.

these come from userspace timestamping? So where userspace detects a
delay the kernel tracer doesnt measure any?

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 17:31 ` Ingo Molnar
@ 2004-12-09 20:34   ` K.R. Foley
  2004-12-09 22:06     ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: K.R. Foley @ 2004-12-09 20:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Ingo Molnar wrote:
> * Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
> 
> 
>>>also, i'd like to take a look at latency traces, if you have them for
>>>this run.
>>
>>I could if I had any. The _RT run had NO latency traces > 250 usec
>>(the limit I had set for the test). The equivalent _PK run had 37 of
>>those traces. I can rerun the test with a smaller limit to get some if
>>it is really important. My build of -12 is almost done and we can see
>>what kind of repeatability / results from the all_cpus trace shows.
> 
> 
> /me is puzzled.
> 
> so all the CPU-loop delays within the -RT kernel are below 250 usecs? I
> guess i dont understand what this means then:
> 
> | The max CPU latencies in RT are worse than PK as well. The values for
> | RT range from 3.00 msec to 5.43 msec and on PK range from 1.45 msec to
> | 2.24 msec.
> 
> these come from userspace timestamping? So where userspace detects a
> delay the kernel tracer doesnt measure any?
> 
> 	Ingo
> 

Ingo,

I see something similar here also:

running realfeel with rtc histogram generates > 100 usec entries in the 
histogram but none of these are ever caught by the wakeup tracing.

IRQ 8 = 99
realfeel = 98
IRQ 0 = 97

-realfeel sets rtc up to 1024 Hz and does blocking read on rtc
-IRQ 8 hits and rtc_interrupt runs code from rtc_wake_event which sets 
last_interrupt_time then calls wake_up_interruptible which as you know 
eventually calls try_to_wake_up because it's the default_wake_function
-the blocked read then restarts after the schedule() call in rtc_read, 
right?
-then realfeel in rtc_read runs code in rtc_read_event which sets now, 
then generates histogram entry from the diff between now and 
last_interrupt_time

No wakeup latency generated from this.

I think I know why we don't get traces from this. TIF_NEED_RESCHED is 
not set for IRQ 8 at the time that it wakes up realfeel so _need_resched 
fails and trace_start_sched_wakeup doesn't actually call 
__trace_start_sched_wakeup(p)???

kr


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 20:34   ` K.R. Foley
@ 2004-12-09 22:06     ` Ingo Molnar
  2004-12-09 23:16       ` K.R. Foley
  2004-12-10  4:26       ` K.R. Foley
  0 siblings, 2 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 22:06 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* K.R. Foley <kr@cybsft.com> wrote:

> running realfeel with rtc histogram generates > 100 usec entries in
> the histogram but none of these are ever caught by the wakeup tracing.

can you reproduce this with rtc_wakeup:

  http://www.affenbande.org/~tapas/wiki/index.php?rtc_wakeup

?

> I think I know why we don't get traces from this. TIF_NEED_RESCHED is
> not set for IRQ 8 at the time that it wakes up realfeel so
> _need_resched fails and trace_start_sched_wakeup doesn't actually call
> __trace_start_sched_wakeup(p)???

here's the code:

+static inline void trace_start_sched_wakeup(task_t *p, runqueue_t *rq)
+{
+       if (TASK_PREEMPTS_CURR(p, rq) && (p != rq->curr) && _need_resched())
+               __trace_start_sched_wakeup(p);
+}

indeed this only triggers if the woken up task has a higher priority
than the waker... hm. Could you try to reverse the priorities of 
realfeel and IRQ8, does that produce traces?

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 22:06     ` Ingo Molnar
@ 2004-12-09 23:16       ` K.R. Foley
  2004-12-10  4:26       ` K.R. Foley
  1 sibling, 0 replies; 72+ messages in thread
From: K.R. Foley @ 2004-12-09 23:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>running realfeel with rtc histogram generates > 100 usec entries in
>>the histogram but none of these are ever caught by the wakeup tracing.
> 
> 
> can you reproduce this with rtc_wakeup:
> 
>   http://www.affenbande.org/~tapas/wiki/index.php?rtc_wakeup
> 
> ?

Yes. See attached files. When I ran rtc_wakeup the priorities were
IRQ 8= 97
IRQ 0= 96
Dropping IRQ 8 (down to 86) below rtc_wakeup kept rtc_wakeup from 
completing any runs.

> 
> 
>>I think I know why we don't get traces from this. TIF_NEED_RESCHED is
>>not set for IRQ 8 at the time that it wakes up realfeel so
>>_need_resched fails and trace_start_sched_wakeup doesn't actually call
>>__trace_start_sched_wakeup(p)???
> 
> 
> here's the code:
> 
> +static inline void trace_start_sched_wakeup(task_t *p, runqueue_t *rq)
> +{
> +       if (TASK_PREEMPTS_CURR(p, rq) && (p != rq->curr) && _need_resched())
> +               __trace_start_sched_wakeup(p);
> +}

I know. I MUST KEEP MY MOUTH SHUT. I MUST KEEP MY MOUTH SHUT. I just 
didn't see how it was possible that either of the other two conditions 
could ever be false in this case and I missed the call to resched_task
> 
> indeed this only triggers if the woken up task has a higher priority
> than the waker... hm. Could you try to reverse the priorities of 
> realfeel and IRQ8, does that produce traces?

I did this and latencies in the histogram dropped drastically. The 
highest latency in the histogram is 33 usecs and thus never gets high 
enough to trigger the tracing???

IRQ 8 = 97
IRQ 0 = 96
realfeel = 98

> 
> 	Ingo
> 

[-- Attachment #2: latency_trace.out1 --]
[-- Type: text/plain, Size: 5638 bytes --]

preemption latency trace v1.1.4 on 2.6.10-rc2-mm3-V0.7.32-12
--------------------------------------------------------------------
 latency: 46 us, #83/83 | (M:rt VP:0, KP:1, SP:1 HP:1 #P:2)
    -----------------
    | task: su-5646 (uid:500 nice:0 policy:0 rt_prio:0)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> hardirq         
               || / _---=> softirq         
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
    bash-5650  0-h.2    0µs : __trace_start_sched_wakeup (try_to_wake_up)
    bash-5650  0-h.2    0µs : _raw_spin_unlock (try_to_wake_up)
    bash-5650  0-h.1    0µs : preempt_schedule (try_to_wake_up)
    bash-5650  0        1µs : __wake_up_common <su-5646> (74 75): 
    bash-5650  0-h.1    1µs : try_to_wake_up (__wake_up_common)
    bash-5650  0-h.1    1µs : _raw_spin_unlock (try_to_wake_up)
    bash-5650  0-h..    2µs : preempt_schedule (try_to_wake_up)
    bash-5650  0.h..    2µs : _spin_unlock_irqrestore (__wake_up_sync)
    bash-5650  0.h..    2µs : up_mutex (__wake_up_sync)
    bash-5650  0.h..    2µs : __up_mutex (up_mutex)
    bash-5650  0-h..    3µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.1    3µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.2    3µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.3    3µs : mutex_getprio (__up_mutex)
    bash-5650  0        4µs : __up_mutex <bash-5650> (75 75): 
    bash-5650  0-h.3    4µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.2    4µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.2    4µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.1    5µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.1    5µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h..    5µs : preempt_schedule (__up_mutex)
    bash-5650  0.h..    5µs : next_thread (__wake_up_parent)
    bash-5650  0.h..    6µs : _spin_is_locked (next_thread)
    bash-5650  0.h..    6µs : rt_mutex_is_locked (next_thread)
    bash-5650  0.h..    6µs : _spin_unlock_irqrestore (do_notify_parent)
    bash-5650  0.h..    6µs : up_mutex (do_notify_parent)
    bash-5650  0.h..    7µs : __up_mutex (up_mutex)
    bash-5650  0-h..    7µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.1    7µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.2    7µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.3    8µs : mutex_getprio (__up_mutex)
    bash-5650  0        8µs : __up_mutex <bash-5650> (75 75): 
    bash-5650  0-h.3    8µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.2    8µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.2    9µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.1    9µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.1    9µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h..   10µs : preempt_schedule (__up_mutex)
    bash-5650  0.h..   10µs : _write_unlock_irq (exit_notify)
    bash-5650  0.h..   10µs : up_write_mutex (exit_notify)
    bash-5650  0.h..   11µs : __up_mutex (up_write_mutex)
    bash-5650  0-h..   11µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.1   11µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.2   12µs : _raw_spin_lock (__up_mutex)
    bash-5650  0-h.3   12µs : mutex_getprio (__up_mutex)
    bash-5650  0       12µs : __up_mutex <bash-5650> (75 75): 
    bash-5650  0-h.3   12µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.2   13µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.2   13µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h.1   13µs : preempt_schedule (__up_mutex)
    bash-5650  0-h.1   14µs : _raw_spin_unlock (__up_mutex)
    bash-5650  0-h..   14µs : preempt_schedule (__up_mutex)
    bash-5650  0.h..   14µs : check_no_held_locks (do_exit)
    bash-5650  0-h..   15µs+: _raw_spin_lock (check_no_held_locks)
    bash-5650  0-h.1   22µs : _raw_spin_lock (check_no_held_locks)
    bash-5650  0-h.2   23µs : _raw_spin_unlock (check_no_held_locks)
    bash-5650  0-h.1   23µs : preempt_schedule (check_no_held_locks)
    bash-5650  0-h.1   24µs : _raw_spin_unlock (check_no_held_locks)
    bash-5650  0-h..   24µs : preempt_schedule (check_no_held_locks)
    bash-5650  0.h..   24µs : preempt_schedule (do_exit)
    bash-5650  0-h..   25µs : __schedule (preempt_schedule)
    bash-5650  0-h.1   25µs : sched_clock (__schedule)
    bash-5650  0-h.1   26µs : _raw_spin_lock_irq (__schedule)
    bash-5650  0-h.1   26µs : _raw_spin_lock_irqsave (__schedule)
    bash-5650  0-h.2   27µs : dequeue_task (__schedule)
    bash-5650  0-h.2   27µs : recalc_task_prio (__schedule)
    bash-5650  0-h.2   28µs : effective_prio (recalc_task_prio)
    bash-5650  0-h.2   28µs : enqueue_task (__schedule)
    bash-5650  0-..2   29µs+: trace_array (__schedule)
    bash-5650  0       33µs : __schedule <su-5646> (74 78): 
    bash-5650  0       33µs : __schedule <bash-5650> (75 78): 
    bash-5650  0-..2   34µs+: trace_array (__schedule)
      su-5646  0-..2   41µs : __switch_to (__schedule)
      su-5646  0       42µs : schedule <bash-5650> (75 74): 
      su-5646  0-..2   42µs : finish_task_switch (__schedule)
      su-5646  0-..2   43µs : _raw_spin_unlock (finish_task_switch)
      su-5646  0-..1   43µs : trace_stop_sched_switched (finish_task_switch)
      su-5646  0       44µs : finish_task_switch <su-5646> (74 0): 
      su-5646  0-..1   44µs : _raw_spin_lock_irqsave (trace_stop_sched_switched)
      su-5646  0-..2   45µs : trace_stop_sched_switched (finish_task_switch)

vim:ft=help

[-- Attachment #3: log.out1 --]
[-- Type: text/plain, Size: 2846 bytes --]

`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=0
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt
bug in rtc_read(): called in state S_IDLE!
`IRQ 8'[677] is being piggy. need_resched=0, cpu=1
Read missed before next interrupt

rtc latency histogram of {rtc_wakeup/5775, 320346 samples}:
10 179572
11 98009
12 11054
13 19555
14 5171
15 928
16 1257
17 610
18 638
19 317
20 255
21 585
22 1108
23 375
24 220
25 138
26 113
27 201
28 92
29 19
30 21
31 4
32 3
33 3
34 1
36 2
40 2
41 3
42 2
44 1
45 1
46 1
48 14
49 16
50 9
51 5
52 4
53 1
54 2
56 1
58 1
60 2
62 1
65 1
67 1
147 1
157 1
158 5
159 3
160 6
161 1
162 2
165 1
167 2
168 1
169 1
170 1
172 1
174 1

[-- Attachment #4: rtc.out1 --]
[-- Type: text/plain, Size: 940 bytes --]

rtc_wakeup - press ctrl-c to stop - use -h to get help
freq:             8192
max # of irqs:    0 (run until stopped)
jitter threshold: 100000% (122070 usec)
output filename:  /dev/null
rt priority:      90(91)
aquiring rt privs
getting cpu speed
929730325.422 Hz (929.730 MHz)
# of cycles for "perfect" period: 113492 (122 usec)
setting up ringbuffer
setting up consumer thread
setting up /dev/rtc
locking memory
turning irq on
beginning measurement
missed 1 irq(s) - not timing last period
new max. jitter: 1.3% (1 usec)
new max. jitter: 2.7% (3 usec)
new max. jitter: 6.4% (7 usec)
new max. jitter: 8.0% (9 usec)
new max. jitter: 9.8% (11 usec)
new max. jitter: 18.8% (22 usec)
new max. jitter: 46.2% (56 usec)
new max. jitter: 68.4% (83 usec)
new max. jitter: 91.0% (111 usec)
new max. jitter: 102.2% (124 usec)
done.
total # of irqs:      320362
missed irqs:          1
threshold violations: 0
max jitter:           102.2% (124 usec)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 22:06     ` Ingo Molnar
  2004-12-09 23:16       ` K.R. Foley
@ 2004-12-10  4:26       ` K.R. Foley
  2004-12-10 11:22         ` Ingo Molnar
  1 sibling, 1 reply; 72+ messages in thread
From: K.R. Foley @ 2004-12-10  4:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>running realfeel with rtc histogram generates > 100 usec entries in
>>the histogram but none of these are ever caught by the wakeup tracing.
> 
> 
> can you reproduce this with rtc_wakeup:
> 
>   http://www.affenbande.org/~tapas/wiki/index.php?rtc_wakeup
> 
> ?
> 
> 
>>I think I know why we don't get traces from this. TIF_NEED_RESCHED is
>>not set for IRQ 8 at the time that it wakes up realfeel so
>>_need_resched fails and trace_start_sched_wakeup doesn't actually call
>>__trace_start_sched_wakeup(p)???
> 
> 
> here's the code:
> 
> +static inline void trace_start_sched_wakeup(task_t *p, runqueue_t *rq)
> +{
> +       if (TASK_PREEMPTS_CURR(p, rq) && (p != rq->curr) && _need_resched())
> +               __trace_start_sched_wakeup(p);
> +}
> 
> indeed this only triggers if the woken up task has a higher priority
> than the waker... hm. Could you try to reverse the priorities of 
> realfeel and IRQ8, does that produce traces?

I guess I really am slow. You laid it all out for me above and I still 
didn't get it until I looked at again. I still haven't been able to 
capture an actual trace from any of these programs, but thanks to your 
addition of logging all of the max latencies in 32-14 I can see that the 
traces were there until another trace pushes them out.
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10  4:26       ` K.R. Foley
@ 2004-12-10 11:22         ` Ingo Molnar
  2004-12-10 15:26           ` K.R. Foley
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-10 11:22 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* K.R. Foley <kr@cybsft.com> wrote:

> >here's the code:
> >
> >+static inline void trace_start_sched_wakeup(task_t *p, runqueue_t *rq)
> >+{
> >+       if (TASK_PREEMPTS_CURR(p, rq) && (p != rq->curr) && 
> >_need_resched())
> >+               __trace_start_sched_wakeup(p);
> >+}
> >
> >indeed this only triggers if the woken up task has a higher priority
> >than the waker... hm. Could you try to reverse the priorities of 
> >realfeel and IRQ8, does that produce traces?
> 
> I guess I really am slow. You laid it all out for me above and I still
> didn't get it until I looked at again. I still haven't been able to
> capture an actual trace from any of these programs, but thanks to your
> addition of logging all of the max latencies in 32-14 I can see that
> the traces were there until another trace pushes them out.

wakeup tracing can only work reliably if it's a higher-prio task that is
being woken up (to which the currently executing task is obliged to
preempt). Otherwise the currently executing task (the one which wakes up
the other task) could continue to execute indefinitely, making the
wakeup latency trace much less useful. Hence the priority check and the
need_resched() check: 'is the wakee higher-prio', and 'does the current
task really have to preempt right now'.

(hm, i think the _need_resched() check is in fact buggy, especially on
SMP systems: if there's a cross-CPU wakeup then _need_resched() wont be
set for the current task! Is this perhaps what you wanted to point out? 
I've uploaded the -32-17 kernel which has the _need_resched() check
removed.)

unfortunately this issue seems to hit realfeel/rtc_wakeup too: there the
common wakeup is done from the IRQ thread, which is higher-prio than
realfeel/rtc_wakeup! So wakeup tracing/timing doesnt get activated at
all for those types of wakeups.

a solution/workaround to this would be to 'reverse' the priorities of
the tasks: i.e. to make the IRQ thread prio 80, and to make realfeel
prio 90, and to look at the results. In theory realfeel shouldnt be
running when the next IRQ arrives, so this should produce meaningful
traces.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-10 11:22         ` Ingo Molnar
@ 2004-12-10 15:26           ` K.R. Foley
  0 siblings, 0 replies; 72+ messages in thread
From: K.R. Foley @ 2004-12-10 15:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>>here's the code:
>>>
>>>+static inline void trace_start_sched_wakeup(task_t *p, runqueue_t *rq)
>>>+{
>>>+       if (TASK_PREEMPTS_CURR(p, rq) && (p != rq->curr) && 
>>>_need_resched())
>>>+               __trace_start_sched_wakeup(p);
>>>+}
>>>
>>>indeed this only triggers if the woken up task has a higher priority
>>>than the waker... hm. Could you try to reverse the priorities of 
>>>realfeel and IRQ8, does that produce traces?
>>
>>I guess I really am slow. You laid it all out for me above and I still
>>didn't get it until I looked at again. I still haven't been able to
>>capture an actual trace from any of these programs, but thanks to your
>>addition of logging all of the max latencies in 32-14 I can see that
>>the traces were there until another trace pushes them out.
> 
> 
> wakeup tracing can only work reliably if it's a higher-prio task that is
> being woken up (to which the currently executing task is obliged to
> preempt). Otherwise the currently executing task (the one which wakes up
> the other task) could continue to execute indefinitely, making the
> wakeup latency trace much less useful. Hence the priority check and the
> need_resched() check: 'is the wakee higher-prio', and 'does the current
> task really have to preempt right now'.
> 
> (hm, i think the _need_resched() check is in fact buggy, especially on
> SMP systems: if there's a cross-CPU wakeup then _need_resched() wont be
> set for the current task! Is this perhaps what you wanted to point out? 
> I've uploaded the -32-17 kernel which has the _need_resched() check
> removed.)
> 
> unfortunately this issue seems to hit realfeel/rtc_wakeup too: there the
> common wakeup is done from the IRQ thread, which is higher-prio than
> realfeel/rtc_wakeup! So wakeup tracing/timing doesnt get activated at
> all for those types of wakeups.

Also worth noting, unless I have my head up my rear again. If the waker 
is higher prio than the wakee (IRQ 8 is higher than rtc_wakeup) the 
wakee doesn't preempt the waker during the wakeup. It gets put into the 
runqueue but doesn't run until schedule gets called later (assuming 
there isn't another higher prio task queued). Where if the wakee is 
higher prio than the waker, it looks like it will preempt it in most 
cases, which I think is probably why there is such a difference being 
reported in the rtc histogram when the two priorities are switched.
> 
> a solution/workaround to this would be to 'reverse' the priorities of
> the tasks: i.e. to make the IRQ thread prio 80, and to make realfeel
> prio 90, and to look at the results. In theory realfeel shouldnt be
> running when the next IRQ arrives, so this should produce meaningful
> traces.

This seems to work quite well for realfeel because it doesn't burn the 
CPU while it's waiting on data. If an app doesn't just sleep or block 
waiting on data though couldn't it end up interfering with the IRQ thread?
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 16:56 Mark_H_Johnson
  2004-12-09 17:28 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 16:56 UTC (permalink / raw)
  To: Florian Schmidt
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Ingo Molnar,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Don't take this message the wrong way. I strongly support what
Ingo is doing with the 2.6 kernel. Its just sometimes the measurements
don't seem to show the improvements everyone wants to see.

>But you do have set your reference irq (soundcard) to the highest prio
>in the PREEMPT_RT case? I just ask to make sure.
Yes, but then I have ALL the IRQ's at the highest priority (plus a couple
other /0 and /1 tasks). Please note, I only use latencytest (an audio
application) to get an idea of RT performance on a desktop machine
before I consider using the kernel for my real application.

In my "real" application (a large real time simulation running on a
cluster) I cannot necessarily assign one batch of IRQ's higher than
any others (nor above / below the main RT tasks). The character of
my RT application is something like this:
 - an interrupt is delivered on a periodic basis across the
PCI bus / shared memory interface to synchronize operations
across the cluster
 - one or more active RT tasks doing compute (mix of logical and
floating point operations)
 - bursts of PCI activity on a shared memory interface between
cluster nodes (think message passing)
 - bursts of network activity (primarily on the head node)
 - occasional bursts of disk I/O, primarily reads but some writes
to preallocated files
 - non RT monitoring (plus a FEW daemons)
CPU load is pretty steady at up to 20% for any of the two CPU nodes
in the cluster. The upper bound for OS overhead (latency) I need is
about 1 msec (out of a 12.5 msec / 80 Hz frame). I do have some
long CPU runs / PCI shared memory traffic in the 80 Hz task at
a one per second rate that might take up to 10 msec of the 12.5
msec frame.

I could set the IRQ priority of the shared memory interface to be
the highest (since I do task scheduling based on it) but after
that there is also no preset assignment of priority to I/O activity.
Some form of priority inheritance may be "better" but I understand
that is not likely to be implemented (nor worth the effort).

By setting the IRQ threads to RT FIFO 99, I also get something closer
to PREEMPT_DESKTOP w/o IRQ threading (or for that matter, closer to
the 2.4 kernel I use today). It shows more clearly the overhead
of adding the threads.

The other place where PREEMPT_RT shows the overhead is in simple
activities like a ping. The average time to respond to a ping on a
stock kernel (or PREEMPT_DESKTOP) on my hardware is about 150 usec
and over twice that on PREEMPT_RT.

>Also, the PK results
>can probably even be improved by having all irq handlers threaded except
>for the soundcard irq.
Again, I don't really see a benefit for my real application.

Don't get me wrong. I see a lot of benefit for what Ingo is doing
to the 2.6 kernel. If I see a fix to the non RT process starvation
problem, I don't see any serious problems preventing me from
deploying a 2.6 PREEMPT_DESKTOP kernel. It will be the first 2.6 kernel
that works better for my application than the 2.4 kernel I use today.

It would be better to have PREEMPT_RT at that same point. It solves
some knotty problems (like how to avoid chains of hard / soft IRQ's
from preempting a real time task) but the threading overhead and
related application performance impacts that get introduced at this
point seem pretty significant to me. As Ingo noted in a private
message
  "IRQ-threading will always be more expensive than direct IRQs,
   but it should be a fixed overhead not some drastic degradation."
I agree the overhead should be modest but somehow the test cases I
run don't show that (yet). There is certainly more work to be done
to fix that.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 16:56 Mark_H_Johnson
@ 2004-12-09 17:28 ` Ingo Molnar
  2004-12-09 17:41 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 17:28 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Florian Schmidt, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >But you do have set your reference irq (soundcard) to the highest prio
> >in the PREEMPT_RT case? I just ask to make sure.
>
> Yes, but then I have ALL the IRQ's at the highest priority (plus a couple
> other /0 and /1 tasks). [...]

that is the fundamental problem i believe: your 'CPU loop' gets delayed
by them.

> [...] Please note, I only use latencytest (an audio application) to
> get an idea of RT performance on a desktop machine before I consider
> using the kernel for my real application.

but you never want your real application be delayed by things like IDE
processing or networking workloads, correct? The only thing that should
have higher priority than your application is the event thread that
handles the hardware from which you get events. I.e. the soundcard IRQ
in your case (plus the timer IRQ thread, because your task is also
timing out).

i'm not sure what the primary event source for your application is, but
i bet it's not the IDE irq thread, nor the network IRQ thread.

so you are seeing the _inverse_ of advances in the -RT kernel: it's
getting better and better at preempting your prio 30 CPU loop with the
higher-prio RT tasks. I.e. the lower-prio CPU loop gets worse and worse
latencies.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 16:56 Mark_H_Johnson
  2004-12-09 17:28 ` Ingo Molnar
@ 2004-12-09 17:41 ` Ingo Molnar
  2004-12-09 18:26 ` Ingo Molnar
  2004-12-09 19:04 ` Esben Nielsen
  3 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 17:41 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Florian Schmidt, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> Don't take this message the wrong way. I strongly support what Ingo is
> doing with the 2.6 kernel. Its just sometimes the measurements don't
> seem to show the improvements everyone wants to see.

just in case it wasnt obvious ... your feedback is really useful, no
matter in what direction it goes. You have one of the most complex test
setups, so i pretty much expect your setup to trigger the most problems
as well (and it is also the hardest to analyze). I think we are fine as
long as constant progress is made (which i believe we are making, even
if seemingly not for your particular workload =B-).

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 16:56 Mark_H_Johnson
  2004-12-09 17:28 ` Ingo Molnar
  2004-12-09 17:41 ` Ingo Molnar
@ 2004-12-09 18:26 ` Ingo Molnar
  2004-12-09 19:04 ` Esben Nielsen
  3 siblings, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 18:26 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Florian Schmidt, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> In my "real" application (a large real time simulation running on a
> cluster) I cannot necessarily assign one batch of IRQ's higher than
> any others (nor above / below the main RT tasks). The character of
> my RT application is something like this:
> [...]

> CPU load is pretty steady at up to 20% for any of the two CPU nodes in
> the cluster. The upper bound for OS overhead (latency) I need is about
> 1 msec (out of a 12.5 msec / 80 Hz frame). I do have some long CPU
> runs / PCI shared memory traffic in the 80 Hz task at a one per second
> rate that might take up to 10 msec of the 12.5 msec frame.

so the 1 msec latency is needed by this 80 Hz task? I'd thus make this
task prio 90 (higher than most IRQ handlers), and make the 80 Hz
timesource's [timer IRQ? RTC? special driver?] IRQ thread prio 91. All
other IRQ threads should be below prio 90. Whatever else this task
triggers will be handled either by PI handling, or is started enough in
advance (such as disk IO or network IO) to be completed by the time the
80 Hz task needs it.

> I could set the IRQ priority of the shared memory interface to be the
> highest (since I do task scheduling based on it) but after that there
> is also no preset assignment of priority to I/O activity.

but if this is the task that needs to do its work within 1 msec when
signalled, it should be the highest prio one nevertheless, and no IRQ
(except the signal IRQ) must be allowed to preempt it.

(The other tasks can 'feed' this master task with whatever scheduling
pattern, as long as the 'master task' provides frames with a precise 80
Hz frequency. Any jitter to the execution of these other threads is
handled by buffering enough stuff in advance.)

> Some form of priority inheritance may be "better" but I understand
> that is not likely to be implemented (nor worth the effort).

the master task's priority will be inherited across most of the
dependencies that might happen at the kernel level. [ If it doesnt then
it should show up in traces and i'm most interested in fixing it ... ]

> By setting the IRQ threads to RT FIFO 99, I also get something closer
> to PREEMPT_DESKTOP w/o IRQ threading (or for that matter, closer to
> the 2.4 kernel I use today). It shows more clearly the overhead of
> adding the threads.

i believe this is the wrong model for this workload.

> [...] As Ingo noted in a private message
>   "IRQ-threading will always be more expensive than direct IRQs,
>    but it should be a fixed overhead not some drastic degradation."
>
> I agree the overhead should be modest but somehow the test cases I run
> don't show that (yet). There is certainly more work to be done to fix
> that.

have you tried it with all debugging turned off? I'd like to fix any
performance problems related to IRQ/softirq threading. (If you mean the
'lost pings' problem, that one looks like to be more of a priority
inversion problem than a real performance issue.)

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 16:56 Mark_H_Johnson
                   ` (2 preceding siblings ...)
  2004-12-09 18:26 ` Ingo Molnar
@ 2004-12-09 19:04 ` Esben Nielsen
  2004-12-09 19:58   ` john cooper
  2004-12-09 20:16   ` Lee Revell
  3 siblings, 2 replies; 72+ messages in thread
From: Esben Nielsen @ 2004-12-09 19:04 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Florian Schmidt, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel, Ingo Molnar,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Thu, 9 Dec 2004 Mark_H_Johnson@raytheon.com wrote:

> Don't take this message the wrong way. I strongly support what
> Ingo is doing with the 2.6 kernel. Its just sometimes the measurements
> don't seem to show the improvements everyone wants to see.

It all depends on whate "everyone" wants to see!  You can have tuning for
quite different things. Even within real-time you can tune for low
interrupt latency, low task latency and predictability.

If you forinstance want low interrupt latency you take avoid doing almost
anything with interrupts disabled and actions like waking a task is
deferred from the interrupt routings to some post interrupt handling. But
that is ofcourse an overhead because you add yet another state to the
whole structure. That would most like mean lower task latency.

Priority inheritance improves the predictability but it doesn't improve 
the raw interrupt or task latency.

In general: If you want really low latencies you have to do stuff which
hurt the overall performance because you have to turn on preemption at a
really low level. On the other hand if your latency requirement is not
that low you can do much better by locking for long periods (but shorter
than your required latency, ofcourse) and get done with the job at hand
without worrying about fine-grained locking.

> [...]
>   "IRQ-threading will always be more expensive than direct IRQs,
>    but it should be a fixed overhead not some drastic degradation."
> I agree the overhead should be modest but somehow the test cases I
> run don't show that (yet). There is certainly more work to be done
> to fix that.
> 

IRQ threading makes the system more predictable but for many, many
devices it is very expensive. I am predicting that many interrupt routines 
have to be turned back to running in interrupt context.

At work I deal with a few drivers in a RTOS. We run ArcNet and ethernet
drivers in task context because the bus is so slow that reading/writing
packets from/to the controllers will block interrupts and therefore all 
tasks for too long. But all the fast interrupts (serial, CAN, timers...)
we handle in interrupt context. 

On a general perpose OS like Linux where different "users" (RT developers
in this case) have different needs on different systems. Therefore I think
it ought to be configureable, driver for driver. It will be a hard job to
go through them, but Ingo have certainly laid out the framework. What is
needed is to add CONFIG_MY_DRIVER_THREADED and decide the threading and
lock types of the locks used in the driver from that option. Code can
probably be made to do most of the conversions and adding to the configs
automaticly :-)

Muteces are also an overhead. There must be a lot of locks in the system
which can safely be transfered back to raw spinlocks as the locking time
is in the same order of the locking time internally in a mutex. There is
no perpose of using a mutex instead of a raw spinlock if the region being
locked is shorter or about the same as the job of handling the mutex
internals and rescheduling (twice)!

Finally I suggest a very dirty compromise:
Use the internal spinlock in a mutex to lock the users region when that
region is really small. I.e. instead of (the most common case):
 lock mutex's spinlock
 set mutex's owner current
 unlock mutex's spinlock
 do the stuff
 lock mutex's spinlock
 set mutex's owner NULL
 unlock mutex's spinlock

do
 lock mutex's spinlock
 check owner==NULL
 do the stuff
 unlock mutex's spinlock

Ofcourse if owner!=NULL this will have to fall back to the very slow case 
of sleeping. Once it is seen that all lockings done with a mutex is done
this way it can safely be made into a raw spinlock.

Esben

> --Mark H Johnson
>   <mailto:Mark_H_Johnson@raytheon.com>
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 19:04 ` Esben Nielsen
@ 2004-12-09 19:58   ` john cooper
  2004-12-09 20:16   ` Lee Revell
  1 sibling, 0 replies; 72+ messages in thread
From: john cooper @ 2004-12-09 19:58 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Mark_H_Johnson, Florian Schmidt, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, K.R. Foley,
	linux-kernel, Ingo Molnar, Fernando Pablo Lopez-Lezcano,
	Lee Revell, Rui Nuno Capela, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt, john cooper

Esben Nielsen wrote:

> Muteces are also an overhead. There must be a lot of locks in the system
> which can safely be transfered back to raw spinlocks as the locking time
> is in the same order of the locking time internally in a mutex. There is
> no perpose of using a mutex instead of a raw spinlock if the region being
> locked is shorter or about the same as the job of handling the mutex
> internals and rescheduling (twice)!

That will certainly be the case in some scenarios.  It seems
useful for the mutex user to have a means to advice of the
anticipated usage (hold time).

The other [perhaps additional] means of adaptation would be
Solaris-style where a failed mutex acquisition attempt would
spin rather than block the caller if the mutex owner is
currently running on some other cpu.  The rationale being the
spin wait time is less overhead compared with two context
switches.  Though I'd expect this ideal has been batted around
here before.

-john

-- 
john.cooper@timesys.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 19:04 ` Esben Nielsen
  2004-12-09 19:58   ` john cooper
@ 2004-12-09 20:16   ` Lee Revell
  1 sibling, 0 replies; 72+ messages in thread
From: Lee Revell @ 2004-12-09 20:16 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Mark_H_Johnson, Florian Schmidt, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, K.R. Foley,
	linux-kernel, Ingo Molnar, Fernando Pablo Lopez-Lezcano,
	Rui Nuno Capela, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Thu, 2004-12-09 at 20:04 +0100, Esben Nielsen wrote:
> IRQ threading makes the system more predictable but for many, many
> devices it is very expensive. I am predicting that many interrupt routines 
> have to be turned back to running in interrupt context.

It's important to keep in mind that for the type of applications that
would want PREEMPT_DESKTOP, the IRQ threading is only necessary because
of the amount of work the IDE subsystem does in hardirq context.  There
was some discussion a while back and Jens posted a patch to move the IDE
IO completion to a softirq.  IIRC there was not a lot of comment on it.
But, it seems to me that this approach would give the most favorable
balance of performance and low latency for many uses.  My tests show
that with softirq preemption this should allow jackd to run at 64 frames
or so without IRQ threading.

Lee

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 15:16 Mark_H_Johnson
  2004-12-09 16:17 ` Florian Schmidt
  2004-12-09 17:13 ` Ingo Molnar
  0 siblings, 2 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 15:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Comparison of .32-5RT and .32-5PK results
RT has PREEMPT_RT,
PK has PREEMPT_DESKTOP and no threaded IRQ's.
2.4 has lowlat + preempt patches applied

      within 100 usec
       CPU loop (%)   Elapsed Time (sec)    2.4
Test   RT     PK        RT      PK   |   CPU  Elapsed
X     90.46  99.88      67 *    63+  |  97.20   70
top   83.03  99.87      35 *    33+  |  97.48   29
neto  91.69  97.57     360 *   320+* |  96.23   36
neti  88.37  97.79     360 *   300+* |  95.86   41
diskw 87.73  67.41     360 *    57+* |  77.64   29
diskc 86.35  99.39     360 *   320+* |  84.12   77
diskr 81.57  99.89     360 *   320+* |  90.66   86
total                 1902    1413   |         368
 [higher is better]  [lower is better]
* wide variation in audio duration
+ long stretch of audio duration "too fast"

The max CPU latencies in RT are worse than PK as well. The
values for RT range from 3.00 msec to 5.43 msec and on
PK range from 1.45 msec to 2.24 msec.

This is the first set of charts I have seen where _RT is
basically worse than _PK in all the application measures.

To contrast, there were plenty of > 250 usec latency traces
in the _PK run and none during _RT. The PK run also had
three of the "starvation" periods where the 5 second sleep
took 212, 70, and 248 seconds and the RT run had one
starvation period of 11 seconds.

Not quite sure why these measures are so inconsistent..

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 15:16 Mark_H_Johnson
@ 2004-12-09 16:17 ` Florian Schmidt
  2004-12-09 17:13 ` Ingo Molnar
  1 sibling, 0 replies; 72+ messages in thread
From: Florian Schmidt @ 2004-12-09 16:17 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Ingo Molnar, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

On Thu, 9 Dec 2004 09:16:49 -0600
Mark_H_Johnson@raytheon.com wrote:

> Comparison of .32-5RT and .32-5PK results
> RT has PREEMPT_RT,
> PK has PREEMPT_DESKTOP and no threaded IRQ's.
> 2.4 has lowlat + preempt patches applied

But you do have set your reference irq (soundcard) to the highest prio
in the PREEMPT_RT case? I just ask to make sure. Also, the PK results
can probably even be improved by having all irq handlers threaded except
for the soundcard irq.

> 
>       within 100 usec
>        CPU loop (%)   Elapsed Time (sec)    2.4
> Test   RT     PK        RT      PK   |   CPU  Elapsed
> X     90.46  99.88      67 *    63+  |  97.20   70
> top   83.03  99.87      35 *    33+  |  97.48   29
> neto  91.69  97.57     360 *   320+* |  96.23   36
> neti  88.37  97.79     360 *   300+* |  95.86   41
> diskw 87.73  67.41     360 *    57+* |  77.64   29
> diskc 86.35  99.39     360 *   320+* |  84.12   77
> diskr 81.57  99.89     360 *   320+* |  90.66   86
> total                 1902    1413   |         368
>  [higher is better]  [lower is better]
> * wide variation in audio duration
> + long stretch of audio duration "too fast"
> 

Flo

-- 
Palimm Palimm!
http://affenbande.org/~tapas/

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 15:16 Mark_H_Johnson
  2004-12-09 16:17 ` Florian Schmidt
@ 2004-12-09 17:13 ` Ingo Molnar
  1 sibling, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 17:13 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> Comparison of .32-5RT and .32-5PK results
> RT has PREEMPT_RT,
> PK has PREEMPT_DESKTOP and no threaded IRQ's.
> 2.4 has lowlat + preempt patches applied
> 
>       within 100 usec
>        CPU loop (%)   Elapsed Time (sec)    2.4
> Test   RT     PK        RT      PK   |   CPU  Elapsed
> X     90.46  99.88      67 *    63+  |  97.20   70
> top   83.03  99.87      35 *    33+  |  97.48   29
> neto  91.69  97.57     360 *   320+* |  96.23   36
> neti  88.37  97.79     360 *   300+* |  95.86   41
> diskw 87.73  67.41     360 *    57+* |  77.64   29
> diskc 86.35  99.39     360 *   320+* |  84.12   77
> diskr 81.57  99.89     360 *   320+* |  90.66   86
> total                 1902    1413   |         368
>  [higher is better]  [lower is better]
> * wide variation in audio duration
> + long stretch of audio duration "too fast"

i think this could be the effect of the "CPU loop" being at a lower
priority (prio 30?) than all of the IRQ threads. The SMP scheduler is
now better at distributing high-prio RT tasks i.e. of IRQ threads, all
of which are higher prio than the CPU loop.

could you do one run with the CPU loop being prio 90, soundcard IRQ
being prio 91 and timer IRQ being prio 92 - so that we can see what the
RT kernel could be capable of, if the IRQ threads didnt interfere?

also, i'd like to take a look at latency traces, if you have them for
this run.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 14:46 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 14:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

>interactive tasks do get thrown back, but they wont ever preempt RT
>tasks. RT tasks themselves can starve any lower-prio process
>indefinitely.
Definitely the behavior I want to see.

> Interactive tasks can starve other tasks up to a certain
>limit, which is defined via STARVATION_LIMIT, at which point we empty
>the active array and perform an array switch. (also see
>EXPIRED_STARVING())
Could this somehow be the cause of the relatively poor performance
I am seeing with the following combination on a 2 CPU system:
 a one RT task with nominal 80% CPU usage / output to audio
 b one non RT, nice task at 100% CPU usage (cpu_burn)
 c one non RT, not nice task doing lots of I/O
 d a hundred non RT tasks, relatively idle
The elapsed time of (c) goes from under 40 seconds to over
300 seconds (basically does little to no work while the RT task is
active).

I should have only 1 CPU's worth of work as RT and based on what
the comments in sched.c indicate the nice job should get preempted
by the not nice job on a regular basis (but somehow that doesn't
seem to happen).

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-09 14:14 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-09 14:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt

Another odd crash, this time with PREEMPT_RT and 32-5.

Was trying to download 32-12 using mozilla and saw the following:
 - the download window came up with ?? of ?? downloaded
 - at this point, mozilla was not responsive, could move windows
with the window manager but no updates to the window contents.
 - top showed no CPU usage for mozilla-bin
Tried alt-sysrq-L (was then going to do -D) and got the following
messages on the serial console...

SysRq : (          IRQ 1-278  |#0): new 2304 us maximum-latency critical
section.
[stack dump shown]
(          IRQ 1-278  |#0): new 374313 us maximum-latency critical section.
[stack dump shown]
(          IRQ 1-278  |#0): new 374868 us maximum-latency critical section.
[stack dump shown]
(          IRQ 1-278  |#0): new 374923 us maximum-latency critical section.
[stack dump shown]

At this point, the system is non responsive. Network operations had
stopped, no mouse / display updates, no response to keyboard commands
like Alt-SysRq keys. Never saw the output of Alt-SysRq-L on the serial
console. The system log did not have anything either, its last message
was the one noting that I had logged in for the day.

Let me know if you need the serial console output.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
@ 2004-12-07 21:41 Mark_H_Johnson
  0 siblings, 0 replies; 72+ messages in thread
From: Mark_H_Johnson @ 2004-12-07 21:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	K.R. Foley, Bill Huey, Adam Heath, Florian Schmidt,
	Thomas Gleixner, Michal Schmidt, Fernando Pablo Lopez-Lezcano,
	Karsten Wiese, Gunther Persoons, emann, Shane Shrybman,
	Amit Shah, Esben Nielsen

>i have released the -V0.7.32-6 Real-Time Preemption patch, which can be
>downloaded from the usual place:
>
>   http://redhat.com/~mingo/realtime-preempt/
>

When building V0.7.32-5, (using a -2 kernel) I had another failure
when doing the second mkinitrd. Let me know if you need the trace
for that (since I could not reproduce it with -5).

Some preliminary results for -5 (with PREEMPT_DESKTOP)....

[1] I have a FEW cases where the cpu_delay program triggers
a user trace, but my data collection script does not get any
data to report:
  /proc/sys/kernel/preempt_max_latency
does not change.

For example, the following sequence of activated / triggered messages

Trace activated with 0.000300 second delay.
Trace triggered with 0.000399 second delay. [not recorded]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000383 second delay. [00]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000439 second delay. [01]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000351 second delay. [02]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000521 second delay. [03]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000470 second delay. [04]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000302 second delay. [05]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000313 second delay. [06]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000325 second delay. [07]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000326 second delay. [08]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000434 second delay. [09]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000306 second delay. [10]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000645 second delay. [not recorded]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000324 second delay. [not recorded]
Trace activated with 0.000300 second delay.
Trace triggered with 0.000396 second delay. [11]

and the collected data...

#lt.00: latency: 385 us, entries: 454 (454)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.01: latency: 439 us, entries: 434 (434)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.02: latency: 351 us, entries: 885 (885)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.03: latency: 523 us, entries: 592 (592)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.04: latency: 470 us, entries: 571 (571)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.05: latency: 303 us, entries: 235 (235)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.06: latency: 314 us, entries: 5 (5)   |   [VP:0 KP:1 SP:0 HP:0 #CPUS:2]
lt.07: latency: 325 us, entries: 868 (868)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.08: latency: 327 us, entries: 226 (226)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.09: latency: 437 us, entries: 508 (508)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.10: latency: 308 us, entries: 411 (411)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]
lt.11: latency: 396 us, entries: 435 (435)   |   [VP:0 KP:1 SP:0 HP:0
#CPUS:2]

None of the ones I collected cross CPU's so I assume that's
why I get a few triggered conditions that don't get traced.

[2] The non RT process starvation continues. I'll send a few
profile logs separately to see if something is obvious. The script
that sleeps for 5 seconds was delayed over 200 seconds at one point.
It may be related to heavy disk activity.

[3] The charts from latencytest had the following results. I'll repeat
the tests to see if the results are consistent.

  a. Max CPU duration was 2.24 msec (vs. 1.16 nominal). That is better
than any of the 2.4 kernel results I have. Duration of the CPU task is
generally pretty good (percentage wise).

  b. Odd variations in audio duration. Varies between
   - consistently "too fast" (should be 1.45 msec, appears to be 1.25 usec)
   - "wide variation" (should be 1.45 msec, varies up to about 2 usec)
   - a FEW huge audio delays (up to 896 msec)

  c. A few periods (up to 10 seconds long) where the CPU duration is
about 100 usec longer than nominal. Primarily during the disk write
stress test but appears in shorter duration in a few of the other tests.
May be hardware related but I find it a little odd that EVERY frame in
these periods are delayed by 100 usec (out of 1160 usec nominal). The
other possible cause I'm thinking of is I got on the same CPU as the
clock interrupt for a long period [not sure why that would happen either].

[4] Of the latency traces, I see the following patterns:

  a. modprobe (don't do that during RT...)
  b. IRQ chaining (e.g., hard IRQ for disk followed by soft IRQ for
network)
  c. timer / signal processing
  d. a FEW odd examples where I get a BIG chunk of time in one line (up
to 400 usec).
  e. long series of rt_check_expire (up to 300 usec) traces
  f. a FEW cases where it appears my serial console may be causing
some long delays (over 200 msec). Of course, if the crashes go away I
can go back to dmesg -n 0 (now using dmesg -n 8).

Will send latency traces separately (as well as the profile outputs).

The max CPU duration numbers during the stress tests are REALLY
encouraging.

  --Mark

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-1
@ 2004-11-16 13:09 Ingo Molnar
  2004-11-16 13:40 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-3 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-16 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah


i have released the -V0.7.27-1 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this quick update fixes a couple of build bugs.

Changes since a -V0.7.27-0:

 - fix iptables compilation error

 - fix selinux compilation error

to create a -V0.7.27-1 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm1/2.6.10-rc2-mm1.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm1-V0.7.27-1

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-3
  2004-11-16 13:09 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-1 Ingo Molnar
@ 2004-11-16 13:40 ` Ingo Molnar
  2004-11-17 12:42   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-0 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-16 13:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah


i have released the -V0.7.27-3 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this is another quick update to fix a couple of bugs. Sorry about the
fast pace of updates but these fixes are worth having ASAP:

Changes since a -V0.7.27-1:

 - fix module-put BKL count bug - this could explain/fix the lockups
   reported by Rui Nuno Capela.

 - fixed a netfilter related networking deadlock reported by Mark H. 
   Johnson two weeks ago, it triggered on my testbox today. This (rare)
   bug could potentially explain some of the other lockup reports that
   are still open.

 - fix load average constant +1.0 offset when PREEMPT_RT is enabled. 
   This was an artifact of the IRQ-threading of the timer interrupt.

to create a -V0.7.27-3 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm1/2.6.10-rc2-mm1.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm1-V0.7.27-3

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-0
  2004-11-16 13:40 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-3 Ingo Molnar
@ 2004-11-17 12:42   ` Ingo Molnar
  2004-11-18 12:35     ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-1 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-17 12:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah


i have released the -V0.7.28-0 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this is a fixes & latency-reduction release.

Changes since a -V0.7.27-3:

 - made the UP-ioapic code a bit more conservative again - maybe some of
   the lockups are related?

 - removed the BKL from the sound code in a cleaner way and
   removed the quite fragile 'negative ->lock_depth' code. Much less
   intrusive than i originally thought, and much cleaner as well.

 - more fixes to the wakeup-timing logic, 4 false positives fixed in
   total, mostly related to new-task-wakeup not accurately starting the
   tracer.

 - fixed the mmx-memcpy related latency reported by Florian Schmidt and 
   others. Also turned off the MMX/SSE ops in the RAID code, which 
   can introduce similar latencies.

 - kgdb fix from Bill Huey

 - knfsd shutdown with-BKL-held fix

 - highmem compilation fix

 - profiling related crash fix

 - implemented 'direct-path' rescheduling to further reduce scheduling
   latency: the kernel will now in most cases go from try_to_wakeup()
   into the scheduler directly without re-enabling interrupts ever again
   (and thus not giving irq handlers a window to increase latency). This
   is also the final fix for irq nesting and irq-stack recursion.

 - turn off sync wakeups on PREEMPT_RT -> they are latency generators

to create a -V0.7.28-0 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm1/2.6.10-rc2-mm1.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm1-V0.7.28-0

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-1
  2004-11-17 12:42   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-0 Ingo Molnar
@ 2004-11-18 12:35     ` Ingo Molnar
  2004-11-18 16:46       ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.29-0 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-18 12:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah

i have released the -V0.7.28-1 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this should fix the lockup bug reported by Florian Schmidt.

there's a generic PREEMPT bug in the upstream kernel: there exists a
single-instruction race window in __flush_tlb(), if the kernel preempted
exactly there in a lazy-TLB thread and certain other, rare scheduling
and MM properties were true as well (a certain constellation of threads
and lazy-TLB kernel threads occured), and the lazy-TLB task then got
another user TLB to inherit, and switched to a task from which it
inherited that new TLB, thus the wrong cr3 was loaded and inherited by
this next, non-lazy-TLB next task; then (and only then) this scenario
would typically manifest itself in the form of an infinite pagefault
lockup occuring much after the fact, upon the next userspace access (to
the joy of a totally baffled kernel developer). I suspect from the
description you can guess how much fun it was to debug it =B-)

the bug is even more rare in the generic kernel, because there most (but
not all) TLB flush points are in a critical section.

this fix could resolve some of the other 'my box just locked up'
reports.

Changes since a -V0.7.28-0:

 - reverted the UP-ioapic change - it was unrelated to the lockup and it
   is known to cause problems on certain IDE/soundcard combinations.

 - fixed and improved the trace_print_on_crash tracing feature - it was
   highly needed to find the TLB bug ...

to create a -V0.7.28-1 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm1/2.6.10-rc2-mm1.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm1-V0.7.28-1

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.29-0
  2004-11-18 12:35     ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-1 Ingo Molnar
@ 2004-11-18 16:46       ` Ingo Molnar
  2004-11-22  0:54         ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-2 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-18 16:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah


i have released the -V0.7.29-0 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this is a pure merge of -V0.7.28-2 to 2.6.10-rc2-mm2. -rc2-mm2 itself is
a fixes-only release.

to create a -V0.7.29-0 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm2/2.6.10-rc2-mm2.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm2-V0.7.29-0

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-2
  2004-11-18 16:46       ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.29-0 Ingo Molnar
@ 2004-11-22  0:54         ` Ingo Molnar
  2004-11-23 17:58           ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-9 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-22  0:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


i have released the -V0.7.30-2 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

the biggest change in this release are fixes for priority-inheritance
bugs uncovered by Esben Nielsen pi_test suite. These bugs could explain
some of the jackd-under-load latencies reported.

Changes since -V0.7.29-0:

 - priority inheritance handling fixes:

    - sort the RT wakees at wakeup time, not at block-time: an RT task
      might have gotten boosted while it slept.

    - fix priority-restoration bug at mutex-release time

    - use task_rt() not p->policy to determine whether a task needs 
      PI handling - a SCHED_OTHER task might be boosted to RT prio.

    - fix mutex_setprio() bug: queue now-RT tasks to the active array, 
      otherwise expired SCHED_OTHER tasks will not be properly boosted.

 - went back to the mask-and-delay method of handling hardirqs on 
   UP-IOAPIC as well. Due to APIC prioritization hardirqs can get
   delayed by another, unacked hardirq, so the quick method needs more 
   work before it can be used.

 - added Thomas Gleixner's semaphore -> completion changes for 
   drv->unload_sem. This fixes the module unload crashes reported by 
   Rui Nuno Capela and Shane Shrybman.

 - dvb mutex updates for RT, this fixes the bug reported by Christian 
   Meder.

 - e100 fix from K.R. Foley - this should fix the boot-time e100
   enable_irq warning.

 - NFS lockd mutex RT fixes from Thomas Gleixner - this could fix some
   of the bugs reported by Bill Huey.

 - PREEMPT_VOLUNTARY fixes - this could fix the boot-time hang reported 
   by Lee Revell.

 - wake up irq thread upon creation - this solves the 'irq thread only 
   changes priority after first interrupt arrives' anomaly reported.

to create a -V0.7.30-2 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm2/2.6.10-rc2-mm2.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm2-V0.7.30-2

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-9
  2004-11-22  0:54         ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-2 Ingo Molnar
@ 2004-11-23 17:58           ` Ingo Molnar
  2004-11-24 10:16             ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-10 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-23 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


i have released the -V0.7.30-9 Real-Time Preemption patch, which can be
downloaded from the usual place:

    http://redhat.com/~mingo/realtime-preempt/

this is a fixes-only release.

most importantly it includes a JACK related latency fix. With Florian
Schmidt's great detective work we honed in on a big latency source
within JACK: the use of named pipes (fifos) on journalled filesystems. 

This issue has been empirically identified before (and is mentioned in
the JACK howto) but has never been given high enough prominence. It
turns out that the atime updates done while read()ing or write()ing
named pipes causes the delays - it may under certain circumstances call
out into the journalling code. It may block even on non-journalled
filesystems.

To work this issue around, when PREEMPT_RT is enabled the -30-9 kernel
skips atime updates on named-fifos. (it's pretty pointless anyway.)
Alternative userspace workarounds are to put the fifos on tmpfs/ramfs,
or to mark the filesystem noatime,nodiratime.

those experiencing xruns under JACK should definitely try the -30-9
kernel.

Changes since -V0.7.30-2:

 - named fifo reads/writes are now atomic, whenever possible

 - fixed pi_lock related SMP & CRITICAL_IRQSOFF_TIMING lockups, this 
   could resolve the lockups reported by Mark H. Johnson.

 - fixed one more PI buglet: wake up the new owner _after_ restoring
   the priority of the old owner.

 - made the NMI oopser more robust - it should print out some message 
   in pretty much any locking scenario.

 - added the blocker device used by Esben Nielsen's pi_test suite.

 - added user-triggerable ALSA xrun tracing to the patch: if a 
   sound IO channel has xrun_debug enabled in /proc then 
   user_trace_stop() will be called before printing the xrun message,
   and the current trace will be saved to /proc/latency_trace. This is a
   'one-shot' tracing method for now. I can be activated via:

     echo 1 > /proc/asound/card0/pcm0p/xrun_debug

     echo 1 > /proc/sys/kernel/trace_user_triggered
     echo 1 > /proc/sys/kernel/trace_freerunning
     echo 0 > /proc/sys/kernel/preempt_max_latency
     echo 0 > /proc/sys/kernel/preempt_thresh
     echo 0 > /proc/sys/kernel/preempt_wakeup_timing

     ./gettimeofday 0 1

  gettimeofday.c is attached below. The JACK fifo xrun source was found
  via this tracing facility.

to create a -V0.7.30-9 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm2/2.6.10-rc2-mm2.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm2-V0.7.30-9

	Ingo


-- gettimeofday.c:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
	if (argc != 3) {
		printf("usage: gettimeofday <val1> <val2>\n");
		exit(0);
	}
	gettimeofday(atol(argv[1]), atol(argv[2]));

	return 0;
}


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-10
  2004-11-23 17:58           ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-9 Ingo Molnar
@ 2004-11-24 10:16             ` Ingo Molnar
  2004-12-03 20:58               ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.32-0 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-11-24 10:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


i have released the -V0.7.30-10 Real-Time Preemption patch, which can be
downloaded from the usual place:

    http://redhat.com/~mingo/realtime-preempt/

this is a fixes-only release.

the most important fixes are the ones to the priority inheritance logic
(affecting the latency of RT tasks), discovered and reported by Esben
Nielsen. I also found two more PI bugs running the new pi_test2 code
from Esben.

Changes since -V0.7.30-9:

 - PI fixes:

   - the waiter->prio field caused wrong priority settings upon unlock, 
     resulting in PI bugs in the nested-locking case.

   - use rt_task() when determining PI tasks, not p->policy.

   - in the blocking-on-blocked-task nesting case both promote now-RT
     tasks to the pi_waiters list and queue them to the head of the wait
     list, and demote now-non-RT tasks from the pi_waiters list and 
     queue them to the tail of the wait list.

 - PI-debugging blocker device update from Esben Nielsen

 - module build fix: export user_trace_stop symbol, this fixes the error 
   reported by Florian Schmidt

 - tracer fix: in the default !freerunning tracing mode, if the trace
   buffer overflows (this is relatively rare, but can happen) then the
   tracer overwrote kernel memory that leads to lockups/kernel crashes. 
   Maybe this bug was also the source of the truncated trace bug
   reported by Mark H. Johnson?

 - reduce tracing overhead within schedule() when !tracing_enabled.

to create a -V0.7.30-10 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm2/2.6.10-rc2-mm2.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm2-V0.7.30-10

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.32-0
  2004-11-24 10:16             ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-10 Ingo Molnar
@ 2004-12-03 20:58               ` Ingo Molnar
  2004-12-07 13:29                 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-4 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-03 20:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

i have released the -V0.7.32-0 Real-Time Preemption patch, which can be
downloaded from the usual place:

	http://redhat.com/~mingo/realtime-preempt/

this is a fixes-mostly release with one new feature:

implemented global RT-task balancing on SMP systems, which improves the
latency of RT tasks on SMP systems. The basic problem was that the 2.6
kernel has per-CPU runqueues. In the current design there is no
guarantee that if an RT task starves another, lower-prio (or same-prio)
RT task in a given local runqueue, that the starved task will ever be
migrated to another CPU: it has to wait for the higher-prio task to
finish. In short, task migration on SMP is fundamentally non-RT and
priority-insensitive. In particular the workloads and latencies reported
by Mark H. Johnson reflect such SMP scheduling artifacts.

the new global RT-task balancing feature solves this problem by tracking
the 'RT overload' situation (when there is more than one RT tasks on a
CPU) and makes other CPUs 'pull' RT tasks (and only RT tasks)
immediately when such a situation occurs.

To give an example, in the following scheduling scenario:

  CPU#0					CPU#1

  task-A SCHED_FIFO prio 30		task-C SCHED_FIFO prio 30 [curr]
  task-B SCHED_FIFO prio 40 [curr]

task-B is the currently executing task on CPU#0, task-C is the currently
executing task on CPU#1. Now on the vanilla 2.6 kernel, if task-C
blocks, there's no guarantee that task-A will be run there - if there's
a SCHED_NORMAL task on CPU#1's runqueue then it will run indefinitely. 
With global RT-balancing task-A will be scheduled on CPU#1 immediately
after task-C leaves it.

furthermore, if in the same scenario, if e.g. a RT-prio 35 task-D is
woken up on CPU#0, the vanilla 2.6 scheduler will not move it to CPU#1,
even though it could preempt the prio 30 task-C there. With global
RT-balancing this will happen and task-C will be preempted immediately
and task-D runs.

on low RT load (the common case) the scheduler behaves like the stock
scheduler - the new logic only kicks in if a CPU runqueue has 2 or more
RT tasks running at once.

anyway, while the feature is stable on my SMP testboxes, this is still a
nontrivial ~200-lines delta in the scheduler so there might be problems. 
UP should not be affected by this.

other changes since -V0.7.32-20:

 - local-APIC shutdown fix: this should solve some of the 'reboot hangs' 
   reports.

 - more tracing fixes - might fix the 'truncated traces' problems.

 - reduce the NMI watchdog frequency from 10 KHz to 1000 Hz.

 - dont report futex reschedules as atomicity violations

to create a -V0.7.32-0 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm2/2.6.10-rc2-mm2.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm2-V0.7.32-0

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-4
  2004-12-03 20:58               ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.32-0 Ingo Molnar
@ 2004-12-07 13:29                 ` Ingo Molnar
  2004-12-07 14:11                   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-07 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


i have released the -V0.7.32-4 Real-Time Preemption patch, which can be
downloaded from the usual place:

    http://redhat.com/~mingo/realtime-preempt/

this is a fixes-only release.

Changes since -V0.7.32-2:

 - fixed a seqlock related xtime_lock lockup scenario - this could 
   explain the SMP lockups reported by Mark H. Johnson.
 
 - fixed a small buglet in the new SMP RT-balancing code, which could 
   lead to bad balancing in certain rare cases.

to create a -V0.7.32-4 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm3/2.6.10-rc2-mm3.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm3-V0.7.32-4

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-07 13:29                 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-4 Ingo Molnar
@ 2004-12-07 14:11                   ` Ingo Molnar
  2004-12-08  4:31                     ` K.R. Foley
  2004-12-08 17:13                     ` Steven Rostedt
  0 siblings, 2 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-07 14:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


i have released the -V0.7.32-6 Real-Time Preemption patch, which can be
downloaded from the usual place:

   http://redhat.com/~mingo/realtime-preempt/

this too is a fixes-only release.

Changes since -V0.7.32-4:

 - fixed a lock_kernel()-re-enables-interrupts bug reported by Daniel 
   Walker. The fix is to allow down() from irqs-off sections (and 
   save/restore irq flags) as long as there's no real contention on the
   semaphore.

 - fixed a /proc/latency_trace formatting bug reported by Mark H. 
   Johnson.

to create a -V0.7.32-6 tree from scratch, the patching order is:

  http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2
  http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.10-rc2.bz2
  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc2/2.6.10-rc2-mm3/2.6.10-rc2-mm3.bz2
  http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.10-rc2-mm3-V0.7.32-6

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-07 14:11                   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Ingo Molnar
@ 2004-12-08  4:31                     ` K.R. Foley
  2004-12-08  8:34                       ` Ingo Molnar
  2004-12-08 17:13                     ` Steven Rostedt
  1 sibling, 1 reply; 72+ messages in thread
From: K.R. Foley @ 2004-12-08  4:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

[-- Attachment #1: Type: text/plain, Size: 397 bytes --]

Ingo Molnar wrote:
> i have released the -V0.7.32-6 Real-Time Preemption patch, which can be
> downloaded from the usual place:
> 

Ingo,

Could you explain what the attached trace means. It looks to me like the 
trace starts in try_to_wake_up when we are trying to wake amlat, but 
then before we finish we get a hit on IRQ 8 and run the IRQ handler??? 
Or do I somehow have it backwards? :)

kr

[-- Attachment #2: trace --]
[-- Type: text/plain, Size: 3211 bytes --]

preemption latency trace v1.1.3 on 2.6.10-rc2-mm3-V0.7.32-9
--------------------------------------------------------------------
 latency: 39 us, #42/42 | (M:rt VP:0, KP:1, SP:1 HP:1 #P:2)
    -----------------
    | task: IRQ 8-677 (uid:0 nice:-5 policy:1 rt_prio:99)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> hardirq         
               || / _---=> softirq         
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
   amlat-4973  0-h.3    0µs : __trace_start_sched_wakeup (try_to_wake_up)
   amlat-4973  0-h.3    1µs : _raw_spin_unlock (try_to_wake_up)
   amlat-4973  0-h.2    1µs : preempt_schedule (try_to_wake_up)
   amlat-4973  0        2µs : wake_up_process <IRQ 8-677> (0 1): 
   amlat-4973  0-h.2    2µs : try_to_wake_up (wake_up_process)
   amlat-4973  0-h.2    2µs : _raw_spin_unlock (try_to_wake_up)
   amlat-4973  0-h.1    3µs : preempt_schedule (try_to_wake_up)
   amlat-4973  0-h.1    3µs : wake_up_process (redirect_hardirq)
   amlat-4973  0-h.1    4µs : _raw_spin_unlock (__do_IRQ)
   amlat-4973  0-h..    4µs : preempt_schedule (__do_IRQ)
   amlat-4973  0-h..    4µs : irq_exit (do_IRQ)
   amlat-4973  0-h.1    5µs : do_softirq (irq_exit)
   amlat-4973  0-h.1    5µs : __do_softirq (do_softirq)
   amlat-4973  0-h..    6µs : preempt_schedule_irq (need_resched)
   amlat-4973  0-h..    6µs : __schedule (preempt_schedule_irq)
   amlat-4973  0-h.1    7µs : sched_clock (__schedule)
   amlat-4973  0-h.1    8µs : _raw_spin_lock_irq (__schedule)
   amlat-4973  0-h.1    8µs : _raw_spin_lock_irqsave (__schedule)
   amlat-4973  0-h.2   10µs : pull_rt_tasks (__schedule)
   amlat-4973  0-h.2   10µs : find_next_bit (pull_rt_tasks)
   amlat-4973  0-h.2   11µs+: find_next_bit (pull_rt_tasks)
   amlat-4973  0-..2   13µs : trace_array (__schedule)
   amlat-4973  0       14µs : __schedule <IRQ 8-677> (0 1): 
   amlat-4973  0       14µs+: __schedule <amlat-4973> (1 2): 
   amlat-4973  0       18µs+: __schedule <<unknown-792> (39 3a): 
   amlat-4973  0       21µs : __schedule <<unknown-4> (69 6e): 
   amlat-4973  0       21µs : __schedule <<unknown-4854> (73 78): 
   amlat-4973  0-..2   22µs+: trace_array (__schedule)
   IRQ 8-677   0-..2   31µs : __switch_to (__schedule)
   IRQ 8-677   0       32µs : schedule <amlat-4973> (1 0): 
   IRQ 8-677   0-..2   32µs : finish_task_switch (__schedule)
   IRQ 8-677   0-..2   33µs : smp_send_reschedule_allbutself (finish_task_switch)
   IRQ 8-677   0-..2   33µs : __bitmap_weight (smp_send_reschedule_allbutself)
   IRQ 8-677   0-..2   34µs : __send_IPI_shortcut (smp_send_reschedule_allbutself)
   IRQ 8-677   0-..2   35µs : _raw_spin_unlock (finish_task_switch)
   IRQ 8-677   0-..1   35µs : trace_stop_sched_switched (finish_task_switch)
   IRQ 8-677   0       36µs : finish_task_switch <IRQ 8-677> (0 0): 
   IRQ 8-677   0-..1   36µs+: _raw_spin_lock_irqsave (trace_stop_sched_switched)
   IRQ 8-677   0-..1   43µs : trace_stop_sched_switched (finish_task_switch)


vim:ft=help

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08  4:31                     ` K.R. Foley
@ 2004-12-08  8:34                       ` Ingo Molnar
  2004-12-08 16:07                         ` K.R. Foley
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-08  8:34 UTC (permalink / raw)
  To: K.R. Foley
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


* K.R. Foley <kr@cybsft.com> wrote:

> Could you explain what the attached trace means. It looks to me like
> the trace starts in try_to_wake_up when we are trying to wake amlat,
> but then before we finish we get a hit on IRQ 8 and run the IRQ
> handler???  Or do I somehow have it backwards? :)

>    amlat-4973  0-h.3    0?s : __trace_start_sched_wakeup (try_to_wake_up)
>    amlat-4973  0-h.3    1?s : _raw_spin_unlock (try_to_wake_up)
>    amlat-4973  0-h.2    1?s : preempt_schedule (try_to_wake_up)
>    amlat-4973  0        2?s : wake_up_process <IRQ 8-677> (0 1): 

this portion shows that amlat-4973 woke up IRQ_8-677. Subsequently the 
scheduler picked it from a list of 5 tasks:

>    amlat-4973  0-..2   13?s : trace_array (__schedule)
>    amlat-4973  0       14?s : __schedule <IRQ 8-677> (0 1): 
>    amlat-4973  0       14?s+: __schedule <amlat-4973> (1 2): 
>    amlat-4973  0       18?s+: __schedule <<unknown-792> (39 3a): 
>    amlat-4973  0       21?s : __schedule <<unknown-4> (69 6e): 
>    amlat-4973  0       21?s : __schedule <<unknown-4854> (73 78): 
>    amlat-4973  0-..2   22?s+: trace_array (__schedule)
>    IRQ 8-677   0-..2   31?s : __switch_to (__schedule)

IRQ_8's RT priority was 1, amlat's priority was 2, so IRQ-8 got
selected. (there were also other, SCHED_NORMAL tasks with pid 792, 4 and
4854 in the queue but they did not get selected) [ Note that in reality
the O(1) scheduler only considered IRQ_8 when picking the next task,
it's the tracer that listed all runnable tasks, to make it easier to
validate scheduler logic. This 'list all runnable tasks at schedule()
time' tracing is only done if both tracing and rw-deadlock detection is
enabled.]

in this trace you can see the new RT global balancing in the works as
well:

>    IRQ 8-677   0       32?s : schedule <amlat-4973> (1 0): 
>    IRQ 8-677   0-..2   32?s : finish_task_switch (__schedule)
>    IRQ 8-677   0-..2   33?s : smp_send_reschedule_allbutself (finish_task_switch)
>    IRQ 8-677   0-..2   33?s : __bitmap_weight (smp_send_reschedule_allbutself)
>    IRQ 8-677   0-..2   34?s : __send_IPI_shortcut (smp_send_reschedule_allbutself)

here the scheduler noticed that a higher-prio RT task (IRQ_8) preempted
a lower-prio but still RT task (amlat), and sent an IPI (inter-processor
interrupt) to another CPU in the system so that amlat can run on the
other CPU.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08  8:34                       ` Ingo Molnar
@ 2004-12-08 16:07                         ` K.R. Foley
  2004-12-08 16:18                           ` Lee Revell
  2004-12-09  2:45                           ` K.R. Foley
  0 siblings, 2 replies; 72+ messages in thread
From: K.R. Foley @ 2004-12-08 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>Could you explain what the attached trace means. It looks to me like
>>the trace starts in try_to_wake_up when we are trying to wake amlat,
>>but then before we finish we get a hit on IRQ 8 and run the IRQ
>>handler???  Or do I somehow have it backwards? :)
> 

Thank you. I really did have it backwards. The thing that confused me 
was that trace_start... gets called with the task that we are trying to 
wake up. I didn't follow the trace code far enough to realize that it 
later starts getting task info from current instead of p. :) This all 
makes more sense now.

I am still confused about one thing, unrelated to this. If RT tasks 
never expire and thus are never moved to the expired array??? Does that 
imply that we never switch the active and expired arrays? If so how do 
tasks that do expire get moved back into the active array?

> 
>>   amlat-4973  0-h.3    0?s : __trace_start_sched_wakeup (try_to_wake_up)
>>   amlat-4973  0-h.3    1?s : _raw_spin_unlock (try_to_wake_up)
>>   amlat-4973  0-h.2    1?s : preempt_schedule (try_to_wake_up)
>>   amlat-4973  0        2?s : wake_up_process <IRQ 8-677> (0 1): 
> 
> 
> this portion shows that amlat-4973 woke up IRQ_8-677. Subsequently the 
> scheduler picked it from a list of 5 tasks:
> 
> 
>>   amlat-4973  0-..2   13?s : trace_array (__schedule)
>>   amlat-4973  0       14?s : __schedule <IRQ 8-677> (0 1): 
>>   amlat-4973  0       14?s+: __schedule <amlat-4973> (1 2): 
>>   amlat-4973  0       18?s+: __schedule <<unknown-792> (39 3a): 
>>   amlat-4973  0       21?s : __schedule <<unknown-4> (69 6e): 
>>   amlat-4973  0       21?s : __schedule <<unknown-4854> (73 78): 
>>   amlat-4973  0-..2   22?s+: trace_array (__schedule)
>>   IRQ 8-677   0-..2   31?s : __switch_to (__schedule)
> 
> 
> IRQ_8's RT priority was 1, amlat's priority was 2, so IRQ-8 got
> selected. (there were also other, SCHED_NORMAL tasks with pid 792, 4 and
> 4854 in the queue but they did not get selected) [ Note that in reality
> the O(1) scheduler only considered IRQ_8 when picking the next task,
> it's the tracer that listed all runnable tasks, to make it easier to
> validate scheduler logic. This 'list all runnable tasks at schedule()
> time' tracing is only done if both tracing and rw-deadlock detection is
> enabled.]
> 
> in this trace you can see the new RT global balancing in the works as
> well:
> 
> 
>>   IRQ 8-677   0       32?s : schedule <amlat-4973> (1 0): 
>>   IRQ 8-677   0-..2   32?s : finish_task_switch (__schedule)
>>   IRQ 8-677   0-..2   33?s : smp_send_reschedule_allbutself (finish_task_switch)
>>   IRQ 8-677   0-..2   33?s : __bitmap_weight (smp_send_reschedule_allbutself)
>>   IRQ 8-677   0-..2   34?s : __send_IPI_shortcut (smp_send_reschedule_allbutself)
> 
> 
> here the scheduler noticed that a higher-prio RT task (IRQ_8) preempted
> a lower-prio but still RT task (amlat), and sent an IPI (inter-processor
> interrupt) to another CPU in the system so that amlat can run on the
> other CPU.
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 16:07                         ` K.R. Foley
@ 2004-12-08 16:18                           ` Lee Revell
  2004-12-08 16:52                             ` K.R. Foley
  2004-12-09  2:45                           ` K.R. Foley
  1 sibling, 1 reply; 72+ messages in thread
From: Lee Revell @ 2004-12-08 16:18 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Ingo Molnar, linux-kernel, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

On Wed, 2004-12-08 at 10:07 -0600, K.R. Foley wrote:
> I am still confused about one thing, unrelated to this. If RT tasks 
> never expire and thus are never moved to the expired array??? Does that 
> imply that we never switch the active and expired arrays? If so how do 
> tasks that do expire get moved back into the active array?

I think that RT tasks use a completely different scheduling mechanism
that bypasses the active/expired array.

Lee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 16:18                           ` Lee Revell
@ 2004-12-08 16:52                             ` K.R. Foley
  2004-12-08 16:58                               ` Lee Revell
  2004-12-09  9:02                               ` Ingo Molnar
  0 siblings, 2 replies; 72+ messages in thread
From: K.R. Foley @ 2004-12-08 16:52 UTC (permalink / raw)
  To: Lee Revell
  Cc: Ingo Molnar, linux-kernel, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Lee Revell wrote:
> On Wed, 2004-12-08 at 10:07 -0600, K.R. Foley wrote:
> 
>>I am still confused about one thing, unrelated to this. If RT tasks 
>>never expire and thus are never moved to the expired array??? Does that 
>>imply that we never switch the active and expired arrays? If so how do 
>>tasks that do expire get moved back into the active array?
> 
> 
> I think that RT tasks use a completely different scheduling mechanism
> that bypasses the active/expired array.
> 
> Lee
> 
> 
Please don't misunderstand. I am not arguing with you because obviously 
I am not really intimate with this code, but if the above statement is 
true then I am even more confused than I thought. I don't see any such 
distinctions in the scheduler code. In fact it looks to me like the 
whole scheduler is built on the premise of allowing RT tasks to be just 
like other tasks with a few exceptions, one of which is that RT tasks 
never hit the expired task array.

kr

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 16:52                             ` K.R. Foley
@ 2004-12-08 16:58                               ` Lee Revell
  2004-12-09  9:02                               ` Ingo Molnar
  1 sibling, 0 replies; 72+ messages in thread
From: Lee Revell @ 2004-12-08 16:58 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Ingo Molnar, linux-kernel, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

On Wed, 2004-12-08 at 10:52 -0600, K.R. Foley wrote:
> Lee Revell wrote:
> > On Wed, 2004-12-08 at 10:07 -0600, K.R. Foley wrote:
> > 
> >>I am still confused about one thing, unrelated to this. If RT tasks 
> >>never expire and thus are never moved to the expired array??? Does that 
> >>imply that we never switch the active and expired arrays? If so how do 
> >>tasks that do expire get moved back into the active array?
> > 
> > 
> > I think that RT tasks use a completely different scheduling mechanism
> > that bypasses the active/expired array.
> > 
> > Lee
> > 
> > 
> Please don't misunderstand. I am not arguing with you because obviously 
> I am not really intimate with this code, but if the above statement is 
> true then I am even more confused than I thought. I don't see any such 
> distinctions in the scheduler code. In fact it looks to me like the 
> whole scheduler is built on the premise of allowing RT tasks to be just 
> like other tasks with a few exceptions, one of which is that RT tasks 
> never hit the expired task array.

No, you are probably right, I am the one who is confused.

Lee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 16:52                             ` K.R. Foley
  2004-12-08 16:58                               ` Lee Revell
@ 2004-12-09  9:02                               ` Ingo Molnar
  1 sibling, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09  9:02 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Lee Revell, linux-kernel, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

* K.R. Foley <kr@cybsft.com> wrote:

> [...] it looks to me like the whole scheduler is built on the premise
> of allowing RT tasks to be just like other tasks with a few
> exceptions, one of which is that RT tasks never hit the expired task
> array.

this is more or less correct, and we are trying to keep RT scheduling
'integrated' into the SCHED_NORMAL scheduler as long as it's practical.

but Lee is correct too in that the scheduling behavior of RT tasks and
SCHED_NORMAL tasks is completely different. But on the implementational
level the distinction is less stark and boils down to a few branches
here and there, while 90% of the scheduling code is shared.

to answer your question: it is true that if there is an RT task active
then we never switch the arrays. That's how RT tasks work: they run
until they want. That's why it needs privileges to use RT scheduling,
and that's why a buggy RT task can lock up the system. The 'array
switching' mechanism is part of the 10% of code that is only used by
non-RT tasks. SCHED_FIFO tasks dont have any timeslices, they run until
they deschedule voluntarily. SCHED_RR tasks have a notion of timeslices
but they only yield to RR tasks on their own priority level, which is
implemented without an array-switch. [the array-switch implements fair
scheduling between different-priority SCHED_NORMAL tasks - this is a
fundamentally harder problem which necessiates more work from the
scheduler.]

(btw., the 'global RT balancing' SMP code i recently added, and the
priority inheritance scheduler code increases the 10% of non-shared
scheduling code to perhaps 15% or so. Not always is it possible to share
code.)

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 16:07                         ` K.R. Foley
  2004-12-08 16:18                           ` Lee Revell
@ 2004-12-09  2:45                           ` K.R. Foley
  2004-12-09 12:11                             ` Ingo Molnar
  1 sibling, 1 reply; 72+ messages in thread
From: K.R. Foley @ 2004-12-09  2:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

K.R. Foley wrote:
<snip>

> 
> I am still confused about one thing, unrelated to this. If RT tasks 
> never expire and thus are never moved to the expired array??? Does that 
> imply that we never switch the active and expired arrays? If so how do 
> tasks that do expire get moved back into the active array?
> 
OK dumb question. I am going out to get my own personal brown paper bag, 
since I seem to be wearing it so often. I forgot tasks get removed from 
the runqueue when they are sleeping, etc. so the active array should 
empty most of the time. However, with more RT tasks and interactive 
tasks being thrown back into the active queue I could see this POSSIBLY 
occasionally starving a few processes???

kr
<snip>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09  2:45                           ` K.R. Foley
@ 2004-12-09 12:11                             ` Ingo Molnar
  2004-12-09 14:50                               ` K.R. Foley
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09 12:11 UTC (permalink / raw)
  To: K.R. Foley
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


* K.R. Foley <kr@cybsft.com> wrote:

> OK dumb question. I am going out to get my own personal brown paper
> bag, since I seem to be wearing it so often. I forgot tasks get
> removed from the runqueue when they are sleeping, etc. so the active
> array should empty most of the time. However, with more RT tasks and
> interactive tasks being thrown back into the active queue I could see
> this POSSIBLY occasionally starving a few processes???

interactive tasks do get thrown back, but they wont ever preempt RT
tasks. RT tasks themselves can starve any lower-prio process
indefinitely. Interactive tasks can starve other tasks up to a certain
limit, which is defined via STARVATION_LIMIT, at which point we empty
the active array and perform an array switch. (also see
EXPIRED_STARVING())

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09 12:11                             ` Ingo Molnar
@ 2004-12-09 14:50                               ` K.R. Foley
  0 siblings, 0 replies; 72+ messages in thread
From: K.R. Foley @ 2004-12-09 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>OK dumb question. I am going out to get my own personal brown paper
>>bag, since I seem to be wearing it so often. I forgot tasks get
>>removed from the runqueue when they are sleeping, etc. so the active
>>array should empty most of the time. However, with more RT tasks and
>>interactive tasks being thrown back into the active queue I could see
>>this POSSIBLY occasionally starving a few processes???
> 
> 
> interactive tasks do get thrown back, but they wont ever preempt RT
> tasks. RT tasks themselves can starve any lower-prio process
> indefinitely. Interactive tasks can starve other tasks up to a certain
> limit, which is defined via STARVATION_LIMIT, at which point we empty
> the active array and perform an array switch. (also see
> EXPIRED_STARVING())
> 
> 	Ingo
> 
Understood. BTW, I wouldn't consider some possible starvation of lower 
priority, non-realtime tasks to be incorrect behavior for a realtime 
system. The comments in the above email as well as previous emails were 
not intended as complaints or questions of correctness. They were more 
just thoughts generated while thinking about some of the reports of 
non-realtime tasks being starved.

kr

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-07 14:11                   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Ingo Molnar
  2004-12-08  4:31                     ` K.R. Foley
@ 2004-12-08 17:13                     ` Steven Rostedt
  2004-12-08 18:14                       ` Rui Nuno Capela
  2004-12-09  9:06                       ` Ingo Molnar
  1 sibling, 2 replies; 72+ messages in thread
From: Steven Rostedt @ 2004-12-08 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Lee Revell, Rui Nuno Capela, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Hi Ingo,

I found a race condition in slab.c, but I'm still trying to figure out
exactly how it's playing out.  This has to do with dynamic loading and
unloading of caches. I have a small test case that simulates the problem
at http://home.stny.rr.com/rostedt/tests/sillycaches.tgz

This was done on:

# uname -r
2.6.10-rc2-mm3-V0.7.32-9

I have a module that creates a cache to allocate objects from. When you
unload the module, it deallocates the objects and then destroys the
cache.  But with your patched kernel I get the following output, and the
system then goes into an unstable state. That is the system will crash
at a latter time. Usually when dealing with caches.

Here's the output:

slab error in kmem_cache_destroy(): cache `silly_stuff': Can't free all objects
 [<c0103953>] dump_stack+0x23/0x30 (20)
 [<c014929f>] kmem_cache_destroy+0xff/0x1a0 (28)
 [<d081e10d>] mkcache_cleanup+0x1d/0x21 [sillymod] (12)
 [<c013a711>] sys_delete_module+0x161/0x1a0 (100)
 [<c0102a00>] syscall_call+0x7/0xb (-8124)
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<c01383ed>] .... print_traces+0x1d/0x60
.....[<c0103953>] ..   ( <= dump_stack+0x23/0x30)

I've done some extra testing and found that if I wait between the frees
and the destroying of the cache, everything works fine.  This problem
happens because it seems that the objects in the slab are being freed in
a batch style and they don't get freed on the destroy. I put prints in
to see more information and found that in kmem_cache_destroy, it calls
__cache_shrink, which calls drain_cpu_caches (obvious from code), but
what my prints show, is that when it gets down to drain_array_locked (it
gets in the function) that ac->avail is zero.  I need to read more into
the details of how the slab works, but you can take a look too.

By the way, 2.6.10-rc2-mm3 does not have a problem with this.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 17:13                     ` Steven Rostedt
@ 2004-12-08 18:14                       ` Rui Nuno Capela
  2004-12-08 19:03                         ` Steven Rostedt
  2004-12-09  9:06                       ` Ingo Molnar
  1 sibling, 1 reply; 72+ messages in thread
From: Rui Nuno Capela @ 2004-12-08 18:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Steven Rostedt wrote:
>
> I found a race condition in slab.c, but I'm still trying to figure out
> exactly how it's playing out.  This has to do with dynamic loading and
> unloading of caches. I have a small test case that simulates the problem
> at http://home.stny.rr.com/rostedt/tests/sillycaches.tgz
>
> This was done on:
>
> # uname -r
> 2.6.10-rc2-mm3-V0.7.32-9
>
> I have a module that creates a cache to allocate objects from. When you
> unload the module, it deallocates the objects and then destroys the
> cache.  But with your patched kernel I get the following output, and the
> system then goes into an unstable state. That is the system will crash
> at a latter time. Usually when dealing with caches.
>
> Here's the output:
>
> slab error in kmem_cache_destroy(): cache `silly_stuff': Can't free all
> objects
>  [<c0103953>] dump_stack+0x23/0x30 (20)
>  [<c014929f>] kmem_cache_destroy+0xff/0x1a0 (28)
>  [<d081e10d>] mkcache_cleanup+0x1d/0x21 [sillymod] (12)
>  [<c013a711>] sys_delete_module+0x161/0x1a0 (100)
>  [<c0102a00>] syscall_call+0x7/0xb (-8124)
> ---------------------------
> | preempt count: 00000001 ]
> | 1-level deep critical section nesting:
> ----------------------------------------
> .. [<c01383ed>] .... print_traces+0x1d/0x60
> .....[<c0103953>] ..   ( <= dump_stack+0x23/0x30)
>
>
> I've done some extra testing and found that if I wait between the frees
> and the destroying of the cache, everything works fine.  This problem
> happens because it seems that the objects in the slab are being freed in
> a batch style and they don't get freed on the destroy. I put prints in
> to see more information and found that in kmem_cache_destroy, it calls
> __cache_shrink, which calls drain_cpu_caches (obvious from code), but
> what my prints show, is that when it gets down to drain_array_locked (it
> gets in the function) that ac->avail is zero.  I need to read more into
> the details of how the slab works, but you can take a look too.
>
> By the way, 2.6.10-rc2-mm3 does not have a problem with this.
>

AFAICS this seems to be exactly the bug I've reported recently, about when
an usb-storage flashram stick is first time unplugged.

Good show Steven :) Hope it helps.

Cheers.
-- 
rncbc aka Rui Nuno Capela
rncbc@rncbc.org


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 18:14                       ` Rui Nuno Capela
@ 2004-12-08 19:03                         ` Steven Rostedt
  2004-12-08 21:39                           ` Rui Nuno Capela
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-08 19:03 UTC (permalink / raw)
  To: Rui Nuno Capela
  Cc: Ingo Molnar, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

On Wed, 2004-12-08 at 18:14 +0000, Rui Nuno Capela wrote:
> Steven Rostedt wrote:
> >
> > I found a race condition in slab.c, but I'm still trying to figure out
> > exactly how it's playing out.  This has to do with dynamic loading and
> > unloading of caches. I have a small test case that simulates the problem
> > at http://home.stny.rr.com/rostedt/tests/sillycaches.tgz
> >
> > This was done on:
> >
> > # uname -r
> > 2.6.10-rc2-mm3-V0.7.32-9
> >

<snip>


Found the culprit!!! I did a diff of 2.6.10-rc2-mm3 to
2.6.10-rc2-mm3-V0.7.32-9 and found this in slab.c:
----------------------------
+#ifndef CONFIG_PREEMPT_RT
+/*
+ * Executes in an IRQ context:
+ */
 static void do_drain(void *arg)
 {         kmem_cache_t *cachep = (kmem_cache_t*)arg;
        struct array_cache *ac;
+       int cpu = smp_processor_id();
         check_irq_off();
-       ac = ac_data(cachep);
+       ac = ac_data(cachep, cpu);
        spin_lock(&cachep->spinlock);
        free_block(cachep, &ac_entry(ac)[0], ac->avail);
        spin_unlock(&cachep->spinlock);
        ac->avail = 0;
 }
+#endif

 static void drain_cpu_caches(kmem_cache_t *cachep)
 {
+#ifndef CONFIG_PREEMPT_RT
        smp_call_function_all_cpus(do_drain, cachep);
+#endif
        check_irq_on();

--------------------------------
(I have CONFIG_PREEMPT_RT defined :-)

I then put in 

 static void drain_cpu_caches(kmem_cache_t *cachep)
 {
 #ifndef CONFIG_PREEMPT_RT
        smp_call_function_all_cpus(do_drain, cachep);
 #endif
        check_irq_on();
        spin_lock_irq(&cachep->spinlock);
+       {
+               struct array_cache *ac;
+               ac = ac_data(cachep, smp_processor_id());
+               free_block(cachep, &ac_entry(ac)[0], ac->avail);
+               ac->avail = 0;
+       }

To see what would happen, and this indeed fixed the problem. At least
didn't cause the problem to appear after a few tests.

Obviously, this is not the right answer, and Ingo, since I don't know
exactly what you are accomplishing with the added cpu changes, I think
you are probably better at writing a patch than I.  

Which brings up another point.

In places like kmem_cache_create you have cpu = _smp_processor_id(); and
way down near the bottom, you use it.  Being a preemptable kernel, can't
that process jump cpus in the time being? So isn't that in itself a race
condition?

Thanks,

-- Steve

Rui,

Try adding the following in slab.c

--- slab.c      2004-12-08 09:27:10.000000000 -0500
+++ slab.c.new  2004-12-08 13:58:40.000000000 -0500
@@ -1533,6 +1533,12 @@
 #ifndef CONFIG_PREEMPT_RT
        smp_call_function_all_cpus(do_drain, cachep);
 #endif
+       {
+               struct array_cache *ac;
+               ac = ac_data(cachep, smp_processor_id());
+               free_block(cachep, &ac_entry(ac)[0], ac->avail);
+               ac->avail = 0;
+       }
        check_irq_on();
        spin_lock_irq(&cachep->spinlock);
        if (cachep->lists.shared)


and see if this fixes your usb problems.  I would say that this is not a
proper fix and especially for a SMP system. But if it fixes your problem
then you know this is the solution.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 19:03                         ` Steven Rostedt
@ 2004-12-08 21:39                           ` Rui Nuno Capela
  2004-12-08 22:11                             ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: Rui Nuno Capela @ 2004-12-08 21:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

Steven Rostedt wrote:
>> Steven Rostedt wrote:
>> >
>> > I found a race condition in slab.c, but I'm still trying to figure
>> > out exactly how it's playing out.  This has to do with dynamic
>> > loading and unloading of caches. I have a small test case that
>> > simulates the problem at
>> > http://home.stny.rr.com/rostedt/tests/sillycaches.tgz
>> >
>> > This was done on:
>> >
>> > # uname -r
>> > 2.6.10-rc2-mm3-V0.7.32-9
>> >

>
> Rui,
>
> Try adding the following in slab.c
>
> --- slab.c      2004-12-08 09:27:10.000000000 -0500
> +++ slab.c.new  2004-12-08 13:58:40.000000000 -0500
> @@ -1533,6 +1533,12 @@
>  #ifndef CONFIG_PREEMPT_RT
>         smp_call_function_all_cpus(do_drain, cachep);
>  #endif
> +       {
> +               struct array_cache *ac;
> +               ac = ac_data(cachep, smp_processor_id());
> +               free_block(cachep, &ac_entry(ac)[0], ac->avail);
> +               ac->avail = 0;
> +       }
>         check_irq_on();
>         spin_lock_irq(&cachep->spinlock);
>         if (cachep->lists.shared)
>
>
> and see if this fixes your usb problems.  I would say that this is not a
> proper fix and especially for a SMP system. But if it fixes your problem
> then you know this is the solution.
>

Almost there, perhaps not...

It doesn't solve the problem completely, if not at all. What was kind of a
deterministic failure now seems probabilistic: the fault still occur on
unplugging the usb-storage stick, but not everytime as before.

Did try several times, reboot included, and now it fails after unplugging
a second or a third time. Never needed to replug/unplug more than 3 times
for it to show up, however.

Here is one sample, taken on the patched RT-V0.7.32-9:

 usb 4-5: USB disconnect, address 7
 slab error in kmem_cache_destroy(): cache `scsi_cmd_cache': Can't free
all objects
  [<c010361f>] dump_stack+0x23/0x25 (20)
  [<c014669f>] kmem_cache_destroy+0x103/0x1aa (28)
  [<c026e61a>] scsi_destroy_command_freelist+0x97/0xa8 (28)
  [<c026f451>] scsi_host_dev_release+0x37/0xe1 (104)
  [<c023c569>] device_release+0x7a/0x7c (32)
  [<c01efc50>] kobject_cleanup+0x87/0x89 (28)
  [<c01f06ab>] kref_put+0x52/0xef (40)
  [<c01efc8c>] kobject_put+0x25/0x27 (16)
  [<f8cf1843>] usb_stor_release_resources+0x66/0xca [usb_storage] (16)
  [<f8cf1d93>] storage_disconnect+0x8e/0x9b [usb_storage] (16)
  [<f89ca117>] usb_unbind_interface+0x84/0x86 [usbcore] (28)
  [<c023d7d5>] device_release_driver+0x75/0x77 (28)
  [<c023d9d8>] bus_remove_device+0x53/0x82 (20)
  [<c023c9a1>] device_del+0x4b/0x9b (20)
  [<f89d142a>] usb_disable_device+0xf5/0x10a [usbcore] (28)
  [<f89cc61c>] usb_disconnect+0xc8/0x164 [usbcore] (40)
  [<f89cd77e>] hub_port_connect_change+0x2ef/0x426 [usbcore] (56)
  [<f89cda7b>] hub_events+0x1c6/0x39d [usbcore] (56)
  [<f89cdc89>] hub_thread+0x37/0x109 [usbcore] (96)
  [<c01009b1>] kernel_thread_helper+0x5/0xb (150118420)

Bye now.
-- 
rncbc aka Rui Nuno Capela
rncbc@rncbc.org


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 21:39                           ` Rui Nuno Capela
@ 2004-12-08 22:11                             ` Steven Rostedt
  2004-12-09  9:32                               ` Ingo Molnar
  0 siblings, 1 reply; 72+ messages in thread
From: Steven Rostedt @ 2004-12-08 22:11 UTC (permalink / raw)
  To: Rui Nuno Capela
  Cc: Ingo Molnar, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

On Wed, 2004-12-08 at 21:39 +0000, Rui Nuno Capela wrote:
> 
> Almost there, perhaps not...
> 
> It doesn't solve the problem completely, if not at all. What was kind of a
> deterministic failure now seems probabilistic: the fault still occur on
> unplugging the usb-storage stick, but not everytime as before.
> 

OK, so I would say that this is part of a fix, but there are others.
There are lots of changes done to the slab.c file by Ingo.  The change I
made (and that is just a quick patch, it needs real work), was only in a
place that was obvious that there could be problems. 

Are you running an SMP machine? If so, than the patch I gave you is
definitely not enough. 

Ingo really scares me with all the removing of local_irq_disables in the
rt mode. I'm not sure exactly what is going on there, and why they can,
or should be removed. Ingo?

> Did try several times, reboot included, and now it fails after unplugging
> a second or a third time. Never needed to replug/unplug more than 3 times
> for it to show up, however.
> 
> Here is one sample, taken on the patched RT-V0.7.32-9:
> 
>  usb 4-5: USB disconnect, address 7
>  slab error in kmem_cache_destroy(): cache `scsi_cmd_cache': Can't free
> all objects
>   [<c010361f>] dump_stack+0x23/0x25 (20)
>   [<c014669f>] kmem_cache_destroy+0x103/0x1aa (28)
>   [<c026e61a>] scsi_destroy_command_freelist+0x97/0xa8 (28)
>   [<c026f451>] scsi_host_dev_release+0x37/0xe1 (104)
>   [<c023c569>] device_release+0x7a/0x7c (32)
>   [<c01efc50>] kobject_cleanup+0x87/0x89 (28)
>   [<c01f06ab>] kref_put+0x52/0xef (40)
>   [<c01efc8c>] kobject_put+0x25/0x27 (16)
>   [<f8cf1843>] usb_stor_release_resources+0x66/0xca [usb_storage] (16)
>   [<f8cf1d93>] storage_disconnect+0x8e/0x9b [usb_storage] (16)
>   [<f89ca117>] usb_unbind_interface+0x84/0x86 [usbcore] (28)
>   [<c023d7d5>] device_release_driver+0x75/0x77 (28)
>   [<c023d9d8>] bus_remove_device+0x53/0x82 (20)
>   [<c023c9a1>] device_del+0x4b/0x9b (20)
>   [<f89d142a>] usb_disable_device+0xf5/0x10a [usbcore] (28)
>   [<f89cc61c>] usb_disconnect+0xc8/0x164 [usbcore] (40)
>   [<f89cd77e>] hub_port_connect_change+0x2ef/0x426 [usbcore] (56)
>   [<f89cda7b>] hub_events+0x1c6/0x39d [usbcore] (56)
>   [<f89cdc89>] hub_thread+0x37/0x109 [usbcore] (96)
>   [<c01009b1>] kernel_thread_helper+0x5/0xb (150118420)
> 

Unfortunately this really doesn't help. The problem occurs earlier than
this. What happened was that there are still slabs out there that need
to be freed that were postponed to a later time and not done with the
kmem_cache_free call.  So this dump only tells you that. The backtrace
unfortunately doesn't give us any more clues.  It's all in the slab.c
code.
(Of course if the driver really did not free the objects then it's not
slab.c's fault, and why Ingo may not have thought it was related to RT)

Tschuess,

-- Steve


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 22:11                             ` Steven Rostedt
@ 2004-12-09  9:32                               ` Ingo Molnar
  2004-12-09 13:36                                 ` Steven Rostedt
  0 siblings, 1 reply; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09  9:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rui Nuno Capela, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 2004-12-08 at 21:39 +0000, Rui Nuno Capela wrote:
> > 
> > Almost there, perhaps not...
> > 
> > It doesn't solve the problem completely, if not at all. What was kind of a
> > deterministic failure now seems probabilistic: the fault still occur on
> > unplugging the usb-storage stick, but not everytime as before.
> > 
> 
> OK, so I would say that this is part of a fix, but there are others.
> There are lots of changes done to the slab.c file by Ingo.  The change I
> made (and that is just a quick patch, it needs real work), was only in a
> place that was obvious that there could be problems. 
> 
> Are you running an SMP machine? If so, than the patch I gave you is
> definitely not enough. 

one of Rui's boxes is an SMP system - which would explain why the bug
goes from an 'always crash' to 'spurious crash'. (if Rui's laptop
triggers this problem too then there must be something else going on as
well.)

> Ingo really scares me with all the removing of local_irq_disables in
> the rt mode. I'm not sure exactly what is going on there, and why they
> can, or should be removed. Ingo?

it is done so that the SLAB code can be fully preempted too. The SLAB
code is of central importance to the -RT project, if it's not fully
preemptible then that has a ripple effect on other subsystems (timer,
signal code, file handling, etc.).

So while making it fully preemptible was quite challenging (==dangerous,
scary), i couldnt just keep the SLAB using raw spinlocks, due to the
locking dependencies. (nor did i have any true inner desire to keep it
non-preemptible - the point of PREEMPT_RT is to have everything
preemptible. I want to see how much preemption the Linux kernel can take
=B-) It has held up surprisingly well i have to say.)

to make the SLAB code fully preemptible, there were two main aspects
that i had to fix:

 1) irq context execution
 2) process preemption

in the -RT kernel all IRQ contexts execute in a separate process
context, so the SLAB code is never called from a true IRQ context -
hence problem #1 is solved. As far as #1 is concerned, the
local_irq_disable()s are not needed anymore.

the other aspect is process<->process preemption - which can still occur
in the -RT kernel (and is the whole point of the PREEMPT_RT feature). 
This means that the per-CPU assumptions within slab.c break.

To solve this i've turned the unlocked per-CPU SLAB code to be
controlled by the cachep->spinlock. (on RT only - on non-RT kernels the
SLAB code should be largely unmodified - this is why all that _rt and
_nort API trickery is done.) Since the SLAB code is thus locked by
cachep->spinlock on PREEMPT_RT, other tasks cannot interfere with the
internal data structures.

Finally, there was still the problem of the use of smp_processor_id() -
the non-RT SLAB code (rightfully) assumes that smp_processor_id() is
constant, but this is not true for the RT code - which can be preempted
anytime (still holding the spinlock of course) and can be migrated to
another CPU.

To solve this problem i am saving smp_processor_id() once, before we use
any per-CPU data structure for the first time, and this constant CPU ID
value is cached and used throughout the whole SLAB processing pass.

[ Since in the RT case we lock the cachep exclusively, it's not a
problem if the 'old' CPU's ID is used as an index - as long as the index
is consistent. Most of the time the current CPU's ID will be used so we
preserve most of the performance advantages (==cache-hotness) of per-CPU
SLABs on SMP systems too. (except for the locking, which is serialized
on RT.) ]

SLAB draining was an oversight - it's mainly called when there is VM
pressure (which is not a stricly necessary feature, so i disabled it),
but i forgot about the module-unload case where it's a correctness
feature. Your patch is a good starting point, i'll try to fix it on SMP
too.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-09  9:32                               ` Ingo Molnar
@ 2004-12-09 13:36                                 ` Steven Rostedt
  0 siblings, 0 replies; 72+ messages in thread
From: Steven Rostedt @ 2004-12-09 13:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rui Nuno Capela, LKML, Lee Revell, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen

On Thu, 2004-12-09 at 10:32 +0100, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> > Ingo really scares me with all the removing of local_irq_disables in
> > the rt mode. I'm not sure exactly what is going on there, and why they
> > can, or should be removed. Ingo?
> 
> it is done so that the SLAB code can be fully preempted too. The SLAB
> code is of central importance to the -RT project, if it's not fully
> preemptible then that has a ripple effect on other subsystems (timer,
> signal code, file handling, etc.).
> 
> So while making it fully preemptible was quite challenging (==dangerous,
> scary), i couldnt just keep the SLAB using raw spinlocks, due to the
> locking dependencies. (nor did i have any true inner desire to keep it
> non-preemptible - the point of PREEMPT_RT is to have everything
> preemptible. I want to see how much preemption the Linux kernel can take
> =B-) It has held up surprisingly well i have to say.)

<snip>


> 
> 	Ingo


Ingo,

Thanks for the write up. It really clears things up for me. Now I
understand your approach, not only for slabs, but other areas of the
kernel. Once again, thanks for the explanation.

-- Steve


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6
  2004-12-08 17:13                     ` Steven Rostedt
  2004-12-08 18:14                       ` Rui Nuno Capela
@ 2004-12-09  9:06                       ` Ingo Molnar
  1 sibling, 0 replies; 72+ messages in thread
From: Ingo Molnar @ 2004-12-09  9:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Lee Revell, Rui Nuno Capela, Mark Johnson, K.R. Foley,
	Bill Huey, Adam Heath, Florian Schmidt, Thomas Gleixner,
	Michal Schmidt, Fernando Pablo Lopez-Lezcano, Karsten Wiese,
	Gunther Persoons, emann, Shane Shrybman, Amit Shah,
	Esben Nielsen


* Steven Rostedt <rostedt@goodmis.org> wrote:

> Hi Ingo,
> 
> I found a race condition in slab.c, but I'm still trying to figure out
> exactly how it's playing out.  This has to do with dynamic loading and
> unloading of caches. I have a small test case that simulates the
> problem at http://home.stny.rr.com/rostedt/tests/sillycaches.tgz

good catch! When i converted slab.c to RT i mistakenly thought that SLAB
flushing (draining) is only an SMP optimization (which i thus generously
disabled), but i forgot about module unloading. This could indeed
explain some of the unresolved bugs in the -RT patchset.

	Ingo

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2004-12-14 16:54 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-09 18:10 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Mark_H_Johnson
2004-12-09 19:40 ` Ingo Molnar
2004-12-09 19:58 ` Ingo Molnar
2004-12-10 23:42 ` Steven Rostedt
2004-12-11 16:59   ` john cooper
2004-12-12  4:36     ` Steven Rostedt
2004-12-13 23:45       ` john cooper
2004-12-14 13:01         ` Steven Rostedt
2004-12-14 14:28           ` john cooper
2004-12-14 16:53             ` Steven Rostedt
2004-12-11 17:59   ` Esben Nielsen
2004-12-11 18:59     ` Steven Rostedt
2004-12-11 19:50       ` Esben Nielsen
2004-12-11 22:34         ` Steven Rostedt
2004-12-13 21:55           ` Bill Huey
2004-12-13 22:15             ` Steven Rostedt
2004-12-13 22:20               ` Ingo Molnar
2004-12-13 22:31   ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2004-12-13 14:10 Mark_H_Johnson
2004-12-09 21:58 Mark_H_Johnson
2004-12-09 22:55 ` Ingo Molnar
2004-12-09 20:49 Mark_H_Johnson
2004-12-09 21:56 ` Ingo Molnar
2004-12-09 20:38 Mark_H_Johnson
2004-12-09 19:54 Mark_H_Johnson
2004-12-09 19:23 Mark_H_Johnson
2004-12-09 20:04 ` Ingo Molnar
2004-12-10  5:01 ` Bill Huey
2004-12-10  5:14   ` Steven Rostedt
2004-12-10  5:58     ` Bill Huey
2004-12-09 18:15 Mark_H_Johnson
2004-12-09 20:11 ` Ingo Molnar
2004-12-09 17:22 Mark_H_Johnson
2004-12-09 17:31 ` Ingo Molnar
2004-12-09 20:34   ` K.R. Foley
2004-12-09 22:06     ` Ingo Molnar
2004-12-09 23:16       ` K.R. Foley
2004-12-10  4:26       ` K.R. Foley
2004-12-10 11:22         ` Ingo Molnar
2004-12-10 15:26           ` K.R. Foley
2004-12-09 16:56 Mark_H_Johnson
2004-12-09 17:28 ` Ingo Molnar
2004-12-09 17:41 ` Ingo Molnar
2004-12-09 18:26 ` Ingo Molnar
2004-12-09 19:04 ` Esben Nielsen
2004-12-09 19:58   ` john cooper
2004-12-09 20:16   ` Lee Revell
2004-12-09 15:16 Mark_H_Johnson
2004-12-09 16:17 ` Florian Schmidt
2004-12-09 17:13 ` Ingo Molnar
2004-12-09 14:46 Mark_H_Johnson
2004-12-09 14:14 Mark_H_Johnson
2004-12-07 21:41 Mark_H_Johnson
2004-11-16 13:09 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-1 Ingo Molnar
2004-11-16 13:40 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.27-3 Ingo Molnar
2004-11-17 12:42   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-0 Ingo Molnar
2004-11-18 12:35     ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm1-V0.7.28-1 Ingo Molnar
2004-11-18 16:46       ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.29-0 Ingo Molnar
2004-11-22  0:54         ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-2 Ingo Molnar
2004-11-23 17:58           ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-9 Ingo Molnar
2004-11-24 10:16             ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.30-10 Ingo Molnar
2004-12-03 20:58               ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm2-V0.7.32-0 Ingo Molnar
2004-12-07 13:29                 ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-4 Ingo Molnar
2004-12-07 14:11                   ` [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-6 Ingo Molnar
2004-12-08  4:31                     ` K.R. Foley
2004-12-08  8:34                       ` Ingo Molnar
2004-12-08 16:07                         ` K.R. Foley
2004-12-08 16:18                           ` Lee Revell
2004-12-08 16:52                             ` K.R. Foley
2004-12-08 16:58                               ` Lee Revell
2004-12-09  9:02                               ` Ingo Molnar
2004-12-09  2:45                           ` K.R. Foley
2004-12-09 12:11                             ` Ingo Molnar
2004-12-09 14:50                               ` K.R. Foley
2004-12-08 17:13                     ` Steven Rostedt
2004-12-08 18:14                       ` Rui Nuno Capela
2004-12-08 19:03                         ` Steven Rostedt
2004-12-08 21:39                           ` Rui Nuno Capela
2004-12-08 22:11                             ` Steven Rostedt
2004-12-09  9:32                               ` Ingo Molnar
2004-12-09 13:36                                 ` Steven Rostedt
2004-12-09  9:06                       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).