* William Weston <weston@sysex.net> wrote:

> OK.  Running on -50-25 now.  The burnP6 starvation doesn't seem to 
> affect the whole system, but comes close enough to require the reset 
> button every time.  I usually, but not always, lose network, X, the 
> keyboard, mouse, and serial console.  I'm still unable to get any sort 
> of a trace from these lockups, since it's looking more like a bunch of 
> processes starving than a kernel crash or a full lockup.
> 
> Once, with VLC (viewing a 5mbit/s mcast/udp stream) and two burnP6 
> instances running, I was able to fire up top on the serial console and 
> found out that the IRQ thread for the ns83820 nic was using 99% of one 
> CPU.

aha! that's an important clue. It seems you've got a screaming interrupt 
or some other loop in ns83820 irq handling. Could you do this:

	chrt -o 0 -p `pidof 'IRQ 18'`

(assuming your ns83820 device is still on IRQ18) To check the command 
was effective, enter the following command:

	chrt -o -p `pidof 'IRQ 18'`

and you should see output like:

	pid 748's current scheduling policy: SCHED_OTHER
	pid 748's current scheduling priority: 0

i.e. the thread is not SCHED_FIFO anymore.

this will not fix the ns83820 problem for you, but will make it more 
debuggable - you will still probably lose networking, but keyboard and 
the local console should work fine. You should see a 99% CPU-time 
looping ns83820 IRQ thread when the condition triggers. To debug the 
condition further, could you do something like:

	vmstat 1

what kind of interrupt rate ('in' field) does it report? If it's very 
high then it's a screaming interrupt, if it's low then the IRQ thread is 
looping for some other reason. (but both would be bugs of the -RT 
kernel.)

also, could you try to get a trace of what the IRQ thread is doing. I've 
attached trace-it.c, just run it as root (on a LATENCY_TRACING-enabled 
-RT kernel) to get a finegrained trace of what's going on in the system. 
Whenever the thread-spinning occurs, just run this utility:

	trace-it > trace.txt

and you should get a couple of milliseconds worth of system activity.  
The trace output should look like a really long latency-trace. (The 
latency traces usually compress really well with bzip2 -9, so you can 
attach it to public replies too, if compressed - that way others can 
have a look too.)

> Once, with a normal desktop load and a yum update, this came across on 
> the serial console:
> 
> cat/2100[CPU#1]: BUG in update_out_trace at kernel/latency.c:587

on SMP this could occur if the TSCs of different CPUs are too apart from 
each other. I'll probably put an automatic check for this into the 
/proc/latency_trace code.

	Ingo