Re: Need help on "Self Detected Stall on CPU"

* Re: Need help on "Self Detected Stall on CPU"
       [not found] <BY5PR11MB4118B89CB3FAEB897AFC3819E0AA0@BY5PR11MB4118.namprd11.prod.outlook.com>
@ 2020-04-30 19:17 ` Paul E. McKenney
  2020-05-01  2:11   ` Atul Kulkarni
  0 siblings, 1 reply; 2+ messages in thread
From: Paul E. McKenney @ 2020-04-30 19:17 UTC (permalink / raw)
  To: Atul Kulkarni; +Cc: linux-kernel

On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
> 
> Hope you are doing well.  I have watched your various conference videos and have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is no one better who can guide us on how we can deal with it.
> 
> I have attached logs for your reference. Towards end I have run couple of sysreq commands and have taken crash dump using sysreq which may help provide additional information.
> Could you please guide us on how we could fix  this issue or identify what is going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU
[20526.351110]  0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0
[20526.360163]   (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!"  If you don't let RCU's kernel threads run, then RCU CPU
stall warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep
for three jiffies.  As you can see from earlier in that same line, that
was 2101 jiffies ago.  The "->state=0x402" means that the scheduler
believes that this kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a
hardware configuration problem, a timer-driver bug, an interrupt-handling
problem, and so on.  This sort of problem is especially common when
bringing up new hardware or when modifying timer code or when modifying
code on the interrupt/exception paths.

So the question to ask yourself is "Why is the timer wakeup not reaching
this kthread?", with special attention to changed code and new hardware.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 2+ messages in thread