* William Weston wrote: > OK. Running on -50-25 now. The burnP6 starvation doesn't seem to > affect the whole system, but comes close enough to require the reset > button every time. I usually, but not always, lose network, X, the > keyboard, mouse, and serial console. I'm still unable to get any sort > of a trace from these lockups, since it's looking more like a bunch of > processes starving than a kernel crash or a full lockup. > > Once, with VLC (viewing a 5mbit/s mcast/udp stream) and two burnP6 > instances running, I was able to fire up top on the serial console and > found out that the IRQ thread for the ns83820 nic was using 99% of one > CPU. aha! that's an important clue. It seems you've got a screaming interrupt or some other loop in ns83820 irq handling. Could you do this: chrt -o 0 -p `pidof 'IRQ 18'` (assuming your ns83820 device is still on IRQ18) To check the command was effective, enter the following command: chrt -o -p `pidof 'IRQ 18'` and you should see output like: pid 748's current scheduling policy: SCHED_OTHER pid 748's current scheduling priority: 0 i.e. the thread is not SCHED_FIFO anymore. this will not fix the ns83820 problem for you, but will make it more debuggable - you will still probably lose networking, but keyboard and the local console should work fine. You should see a 99% CPU-time looping ns83820 IRQ thread when the condition triggers. To debug the condition further, could you do something like: vmstat 1 what kind of interrupt rate ('in' field) does it report? If it's very high then it's a screaming interrupt, if it's low then the IRQ thread is looping for some other reason. (but both would be bugs of the -RT kernel.) also, could you try to get a trace of what the IRQ thread is doing. I've attached trace-it.c, just run it as root (on a LATENCY_TRACING-enabled -RT kernel) to get a finegrained trace of what's going on in the system. Whenever the thread-spinning occurs, just run this utility: trace-it > trace.txt and you should get a couple of milliseconds worth of system activity. The trace output should look like a really long latency-trace. (The latency traces usually compress really well with bzip2 -9, so you can attach it to public replies too, if compressed - that way others can have a look too.) > Once, with a normal desktop load and a yum update, this came across on > the serial console: > > cat/2100[CPU#1]: BUG in update_out_trace at kernel/latency.c:587 on SMP this could occur if the TSCs of different CPUs are too apart from each other. I'll probably put an automatic check for this into the /proc/latency_trace code. Ingo