linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Watchdog detected hard lockup.
       [not found] <CAD6qDH4cw1BSHpHoqzFGYkcxBJcFy=AjjoYuY1-a+HAwgDDY1g@mail.gmail.com>
@ 2017-05-18  7:31 ` Sergey Senozhatsky
       [not found]   ` <CAD6qDH6e7HChVg0N8myc5g82ovWAbsqXTD5+=bT6b8ADv8MVYA@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2017-05-18  7:31 UTC (permalink / raw)
  To: Sumit Gemini
  Cc: Petr Mladek, Steven Rostedt, Peter Zijlstra, sergey.senozhatsky,
	sergey.senozhatsky.work, linux-kernel

Hello Sumit,

please Cc kernel mailing list.

Cc Petr, Peter, Steven, kernel-list

On (05/18/17 12:22), Sumit Gemini wrote:
>    Hi Sergey,
> 
>  I read your comment on [1]https://lkml.org/lkml/2016/9/27/279. Here you were talking about using alt_printk instead of printk.
>  I Got this crash, could this crash related to call printk recursively?
>  Could you please tell me why i got this crash? Do you see any suspicious entry here?
> 
>  [2324956.184374] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
>  [2324956.184374] Pid: 0, comm: kworker/0:0 Tainted: P            3.1.10-gb20-default #1
>  [2324956.184374] Call Trace:
>  [2324956.184374]  [<ffffffff81003add>] dump_trace+0x98/0x223
>  [2324956.184374]  [<ffffffff813dc659>] dump_stack+0x69/0x6f
>  [2324956.184374]  [<ffffffff813def08>] panic+0xa4/0x1b4
>  [2324956.184374]  [<ffffffff81094b48>] watchdog_overflow_callback+0x79/0x9e
>  [2324956.184374]  [<ffffffff810b66b5>] __perf_event_overflow+0xf9/0x185
>  [2324956.184374]  [<ffffffff81010a2c>] intel_pmu_handle_irq+0x15d/0x1be
>  [2324956.184374]  [<ffffffff813f375b>] perf_event_nmi_handler+0x3e/0x87
>  [2324956.184374]  [<ffffffff813f4c0f>] notifier_call_chain+0x2e/0x5b
>  [2324956.184374]  [<ffffffff813f4c76>] __atomic_notifier_call_chain+0x3a/0x4d
>  [2324956.184374]  [<ffffffff813f4cc1>] notify_die+0x2d/0x32
>  [2324956.184374]  [<ffffffff813f2bbb>] default_do_nmi+0x29/0x1b5
>  [2324956.184374]  [<ffffffff813f2eb7>] do_nmi+0x47/0x6f
>  [2324956.184374]  [<ffffffff813f2a30>] nmi+0x20/0x30
>  [2324956.184374]  [<ffffffff8106b725>] arch_local_irq_save+0x1/0x17
>  [2324956.184374]  [<ffffffff813f1e9b>] _raw_spin_lock_irqsave+0xf/0x39
>  [2324956.184374]  [<ffffffff8105d9b7>] down_trylock+0xb/0x29
>  [2324956.184374]  [<ffffffff810409c2>] console_trylock+0xf/0x47
>  [2324956.184374]  [<ffffffff81040dc9>] console_unlock+0x100/0x129
>  [2324956.184374]  [<ffffffff8104120b>] vprintk+0x348/0x394
>  [2324956.184374]  [<ffffffff813df06b>] printk+0x53/0x58
>  [2324956.184374]  [<ffffffff81359902>] ip_handle_martian_source+0x6d/0xf6
>  [2324956.184374]  [<ffffffff8135b980>] ip_route_input_slow+0x45d/0x462
>  [2324956.184374]  [<ffffffff8135c02f>] ip_route_input_common+0x217/0x231
>  [2324956.184374]  [<ffffffff8137e995>] arp_process+0x181/0x4ac
>  [2324956.184374]  [<ffffffff81337bf5>] __netif_receive_skb+0x317/0x36b
>  [2324956.184374]  [<ffffffff81337b44>] __netif_receive_skb+0x266/0x36b
>  [2324956.184374]  [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
>  [2324956.184374]  [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
>  [2324956.184374]  [<ffffffffa02e1dee>] igb_clean_rx_irq+0x30d/0x39e [igb]
>  [2324956.184374]  [<ffffffffa02e1ecd>] igb_poll+0x4e/0x74 [igb]
>  [2324956.184374]  [<ffffffff81339c88>] net_rx_action+0x65/0x178
>  [2324956.184374]  [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
>  [2324956.184374]  [<ffffffff813f9aac>] call_softirq+0x1c/0x30
>  [2324956.184374]  [<ffffffff81003931>] do_softirq+0x3c/0x7b
>  [2324956.184374]  [<ffffffff81045f98>] irq_exit+0x3c/0xac
>  [2324956.184374]  [<ffffffff81003655>] do_IRQ+0x82/0x98
>  [2324956.184374]  [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e
>  [2324956.184374]  [<ffffffff8123f91d>] intel_idle+0xdd/0x104
>  [2324956.184374]  [<ffffffff8130b76b>] cpuidle_idle_call+0xdf/0x181
>  [2324956.184374]  [<ffffffff81001201>] cpu_idle+0x56/0xa9
> 
>  I shell be very greatful to you.
> 

your CPU1 couldn't acquire console_sem spin_lock

	down_trylock()
		raw_spin_lock_irqsave(&sem->lock, flags);

and _probably_ looped on it with local IRQs disabled long enough to
disappoint the NMI watchdog, which eventually declared a hard lockup
condition. but I'm not entirely sure it stuck looping on sem->lock,
it could be the case that your net IRQ took too much time and by the
time NMI declared hard lockup we were just unlucky to execute
down_trylock().

on the other hand, the fact that you see

	IRQ
	 ip_handle_martian_source()
	  printk()
	    console_unlock()

is already not really good. first of all, you had hit ip_route_input slow
path; second, you hit pr_warn()/pr_err or print_hex_dump(), so the kernel
was reporting you something abnormal. printk() invoked from IRQ context
managed to acquire the console_sem and spent some time printing the messages
to the serial console, console_unlock() in the call trace. which sometimes
can be a call for troubles. printk() can't always be safely called from
atomic contexts, just because of cases like this.

I don't know how big ->hard_header_len can be, but that print_hex_dump()
can take some extra time. if you also happened to have other printk messages
coming from other CPUs in the meantime, then CPU1 had have to spent even
more time in console_unlock() printing those messages; from IRQ handler.


hm, this doesn't look to me as a printk recursion, at least after a
quick look. either,

a) something was holding sem->lock, can't tell what exactly  (unlikely)

or

b) your IRQ handler just happened to execute way too long. given that
   there was printk()->console_unlock() and probably a hex dump of
   skb, this is a more likely root cause.

	-ss

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Watchdog detected hard lockup.
       [not found]     ` <CAD6qDH6r76dv6pucTpOOmq_GCi=Yt9mW1yTOHSMLitHVf3CqpQ@mail.gmail.com>
@ 2017-05-24  8:26       ` Sergey Senozhatsky
  2017-05-24  8:39         ` Sergey Senozhatsky
  0 siblings, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2017-05-24  8:26 UTC (permalink / raw)
  To: Sumit Gemini
  Cc: Sergey Senozhatsky, Petr Mladek, Steven Rostedt, Peter Zijlstra,
	Sergey Senozhatsky, linux-kernel

On (05/24/17 12:57), Sumit Gemini wrote:
>    Hi Sergey,
> 
>    Can I get solution for this issue? As i tried to stop martian source
>    packets but team did not agree on it. Please suggest me what can i do?

oh, um... hm. there is no quick solution I can suggest (assuming that
the lockup was actually caused by printk).

try to disable CONFIG_IP_ROUTE_VERBOSE?

	-ss

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Watchdog detected hard lockup.
  2017-05-24  8:26       ` Sergey Senozhatsky
@ 2017-05-24  8:39         ` Sergey Senozhatsky
       [not found]           ` <CAD6qDH6zyP535vm4VQ+R0JFRXupvXsxj2xUmDKcd3=iJAfY1Dg@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2017-05-24  8:39 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sumit Gemini, Petr Mladek, Steven Rostedt, Peter Zijlstra,
	Sergey Senozhatsky, linux-kernel

On (05/24/17 17:26), Sergey Senozhatsky wrote:
> On (05/24/17 12:57), Sumit Gemini wrote:
> >    Hi Sergey,
> > 
> >    Can I get solution for this issue? As i tried to stop martian source
> >    packets but team did not agree on it. Please suggest me what can i do?
> 
> oh, um... hm. there is no quick solution I can suggest (assuming that
> the lockup was actually caused by printk).
> 
> try to disable CONFIG_IP_ROUTE_VERBOSE?

unless you want to play with RFC/untested patchset:

lkml.kernel.org/r/20170509082859.854-1-sergey.senozhatsky@gmail.com

which you'll have to backport to your kernel.

	-ss

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fwd: Watchdog detected hard lockup.
       [not found]                   ` <CAD6qDH6C2umiP-UOB2nkwawQO+1JBw8Xc+a0NFRCOZHTjnqmtQ@mail.gmail.com>
@ 2017-05-25  5:50                     ` Sergey Senozhatsky
  2017-05-25  9:56                       ` Petr Mladek
  0 siblings, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2017-05-25  5:50 UTC (permalink / raw)
  To: Sumit Gemini
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Petr Mladek,
	Peter Zijlstra, Steven Rostedt, linux-kernel

On (05/25/17 11:14), Sumit Gemini wrote:
>    Thanks Sergey and i forgot to add other guys in discussion.
>     
> 
>    [1]lkml.kernel.org/r/20170509082859.854-1-sergey.senozhatsky@gmail.com
> 
>    Now I'm going for backporting the kernel with RFC, as you suggested me
>    yesterday.

or you can try to reduce the amount of data you printk from IRQ. do you
need to hex dump sbks all the time? switching to a deferred printk won't
do the trick, probably, because you gonna console_unlock() from IRQ in
the worst case anyway.

	-ss

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fwd: Watchdog detected hard lockup.
  2017-05-25  5:50                     ` Fwd: " Sergey Senozhatsky
@ 2017-05-25  9:56                       ` Petr Mladek
  0 siblings, 0 replies; 5+ messages in thread
From: Petr Mladek @ 2017-05-25  9:56 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sumit Gemini, Sergey Senozhatsky, Peter Zijlstra, Steven Rostedt,
	linux-kernel

On Thu 2017-05-25 14:50:14, Sergey Senozhatsky wrote:
> On (05/25/17 11:14), Sumit Gemini wrote:
> >    Thanks Sergey and i forgot to add other guys in discussion.
> >     
> > 
> >    [1]lkml.kernel.org/r/20170509082859.854-1-sergey.senozhatsky@gmail.com
> > 
> >    Now I'm going for backporting the kernel with RFC, as you suggested me
> >    yesterday.
> 
> or you can try to reduce the amount of data you printk from IRQ.

In fact, you should reduce the amount of data you printk from all
contexts. There might be more sources that produce too many messages.
The system evidently has troubles to keep the speed and get them
on the console.

You might also try to reduce the console_level and show only the more
important messages on the console.


> need to hex dump sbks all the time?

I wanted to suggest using ratelimit but it seems that the hex dump
is already ratelimited, see net_ratelimit() in
ip_handle_martian_source().

This might suggest that are more sources of printk messages on the
system. Or that you modified the ratelimit setting to print too many
messages.

> switching to a deferred printk won't
> do the trick, probably, because you gonna console_unlock() from IRQ in
> the worst case anyway.

Yup.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-25  9:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAD6qDH4cw1BSHpHoqzFGYkcxBJcFy=AjjoYuY1-a+HAwgDDY1g@mail.gmail.com>
2017-05-18  7:31 ` Watchdog detected hard lockup Sergey Senozhatsky
     [not found]   ` <CAD6qDH6e7HChVg0N8myc5g82ovWAbsqXTD5+=bT6b8ADv8MVYA@mail.gmail.com>
     [not found]     ` <CAD6qDH6r76dv6pucTpOOmq_GCi=Yt9mW1yTOHSMLitHVf3CqpQ@mail.gmail.com>
2017-05-24  8:26       ` Sergey Senozhatsky
2017-05-24  8:39         ` Sergey Senozhatsky
     [not found]           ` <CAD6qDH6zyP535vm4VQ+R0JFRXupvXsxj2xUmDKcd3=iJAfY1Dg@mail.gmail.com>
     [not found]             ` <20170524125907.GA463@jagdpanzerIV.localdomain>
     [not found]               ` <CAD6qDH6+G6WYdP8-aOhLk0VJV1QX1yXLD-dwxLk=o3ZLmG_-JQ@mail.gmail.com>
     [not found]                 ` <20170525052952.GA386@jagdpanzerIV.localdomain>
     [not found]                   ` <CAD6qDH6C2umiP-UOB2nkwawQO+1JBw8Xc+a0NFRCOZHTjnqmtQ@mail.gmail.com>
2017-05-25  5:50                     ` Fwd: " Sergey Senozhatsky
2017-05-25  9:56                       ` Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).