All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Need help on "Self Detected Stall on CPU"
       [not found] <BY5PR11MB4118B89CB3FAEB897AFC3819E0AA0@BY5PR11MB4118.namprd11.prod.outlook.com>
@ 2020-04-30 19:17 ` Paul E. McKenney
  2020-05-01  2:11   ` Atul Kulkarni
  0 siblings, 1 reply; 2+ messages in thread
From: Paul E. McKenney @ 2020-04-30 19:17 UTC (permalink / raw)
  To: Atul Kulkarni; +Cc: linux-kernel

On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
> 
> Hope you are doing well.  I have watched your various conference videos and have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is no one better who can guide us on how we can deal with it.
> 
> I have attached logs for your reference. Towards end I have run couple of sysreq commands and have taken crash dump using sysreq which may help provide additional information.
> Could you please guide us on how we could fix  this issue or identify what is going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU
[20526.351110]  0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0
[20526.360163]   (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!"  If you don't let RCU's kernel threads run, then RCU CPU
stall warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep
for three jiffies.  As you can see from earlier in that same line, that
was 2101 jiffies ago.  The "->state=0x402" means that the scheduler
believes that this kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a
hardware configuration problem, a timer-driver bug, an interrupt-handling
problem, and so on.  This sort of problem is especially common when
bringing up new hardware or when modifying timer code or when modifying
code on the interrupt/exception paths.

So the question to ask yourself is "Why is the timer wakeup not reaching
this kthread?", with special attention to changed code and new hardware.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: Need help on "Self Detected Stall on CPU"
  2020-04-30 19:17 ` Need help on "Self Detected Stall on CPU" Paul E. McKenney
@ 2020-05-01  2:11   ` Atul Kulkarni
  0 siblings, 0 replies; 2+ messages in thread
From: Atul Kulkarni @ 2020-05-01  2:11 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Paul Reeves, Mikhail Shoykher

Thank you sir for your guidance and quick response.

Let me introduce my colleagues Paul and Mikhail here (copied in CC). They would be taking actions based on your guidance in this email and may reach you with further queries.

Appreciate your support and help.

Thanks,
Atul

-----Original Message-----
From: Paul E. McKenney <paulmck@kernel.org> 
Sent: 01 May 2020 00:47
To: Atul Kulkarni <Atul.Kulkarni@katerra.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Need help on "Self Detected Stall on CPU"

On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
> 
> Hope you are doing well.  I have watched your various conference videos and have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is no one better who can guide us on how we can deal with it.
> 
> I have attached logs for your reference. Towards end I have run couple of sysreq commands and have taken crash dump using sysreq which may help provide additional information.
> Could you please guide us on how we could fix  this issue or identify what is going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU [20526.351110]  0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0
[20526.360163]   (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!"  If you don't let RCU's kernel threads run, then RCU CPU stall warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep for three jiffies.  As you can see from earlier in that same line, that was 2101 jiffies ago.  The "->state=0x402" means that the scheduler believes that this kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a hardware configuration problem, a timer-driver bug, an interrupt-handling problem, and so on.  This sort of problem is especially common when bringing up new hardware or when modifying timer code or when modifying code on the interrupt/exception paths.

So the question to ask yourself is "Why is the timer wakeup not reaching this kthread?", with special attention to changed code and new hardware.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-05-01  2:11 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BY5PR11MB4118B89CB3FAEB897AFC3819E0AA0@BY5PR11MB4118.namprd11.prod.outlook.com>
2020-04-30 19:17 ` Need help on "Self Detected Stall on CPU" Paul E. McKenney
2020-05-01  2:11   ` Atul Kulkarni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.