All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Agner <stefan@agner.ch>
To: linux-amlogic@lists.infradead.org, linux-arm-kernel@lists.infradead.org
Cc: Neil Armstrong <narmstrong@baylibre.com>,
	Jerome Brunet <jbrunet@baylibre.com>,
	Kevin Hilman <khilman@baylibre.com>,
	Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Subject: Re: Random reboots on ODROID-N2+
Date: Tue, 22 Jun 2021 09:39:23 +0200	[thread overview]
Message-ID: <a0ef23c932d1a7412d620237d9acde0f@agner.ch> (raw)
In-Reply-To: <40ca11f84b7cdbfb9ad2ddd480cb204a@agner.ch>

On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
> 
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
> 
> After running serial console on several instances, I was able to catch
> this stack trace:
> 
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390

<snip>

We do see those crashes in similar frequency with Linux 5.12:

[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
[129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[129988.642352] pc : free_page_and_swap_cache+0x0/0x110
[129988.642352] lr : tlb_remove_table_rcu+0x30/0x60
[129988.642353] sp : ffff8000115bbdf0
[129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18
[129988.642358] x27: 000000000000000a x26: ffff000000120000
[129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90
[129988.642362] x23: ffff800011456680 x22: ffff0000e07df970
[129988.642365] x21: 0000000000000003 x20: 0000000000000001
[129988.642367] x19: ffff000005300000 x18: 0000000000000000
[129988.642369] x17: 0000000000000000 x16: 0000000000000000
[129988.642371] x15: 0000000000000000 x14: 0000000000000500
[129988.642373] x13: 0000000000000002 x12: 0000000000000000
[129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800
[129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8
[129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001
[129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007
[129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00
[129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80
[129988.642387] Kernel panic - not syncing: Asynchronous SError
Interrupt
[129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642390] Call trace:
[129988.642391]  dump_backtrace+0x0/0x1a0
[129988.642392]  show_stack+0x18/0x70
[129988.642392]  dump_stack+0xd0/0x12c
[129988.642393]  panic+0x170/0x338
[129988.642394]  nmi_panic+0x8c/0x90
[129988.642395]  arm64_serror_panic+0x78/0x84
[129988.642395]  do_serror+0x38/0xa0
[129988.642396]  el1_error+0x80/0xf8
[129988.642397]  free_page_and_swap_cache+0x0/0x110
[129988.642398]  rcu_core+0x310/0x5d0
[129988.642398]  rcu_core_si+0x10/0x20
[129988.642399]  _stext+0x128/0x28c
[129988.642400]  irq_exit+0xd8/0x100
[129988.642401]  __handle_domain_irq+0x68/0xc0
[129988.642401]  gic_handle_irq+0xa8/0xe0
[129988.642402]  el1_irq+0xbc/0x180
[129988.642403]  arch_cpu_idle+0x18/0x30
[129988.642404]  default_idle_call+0x20/0x68
[129988.642404]  do_idle+0x218/0x270
[129988.642405]  cpu_startup_entry+0x24/0x70
[129988.642406]  secondary_start_kernel+0x178/0x190
[129988.642418] SMP: stopping secondary CPUs
[129988.642419] Kernel Offset: disabled
[129988.642420] CPU features: 0x00240002,61082004
[129988.642421] Memory Limit: none

It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).

I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Stefan Agner <stefan@agner.ch>
To: linux-amlogic@lists.infradead.org, linux-arm-kernel@lists.infradead.org
Cc: Neil Armstrong <narmstrong@baylibre.com>,
	Jerome Brunet <jbrunet@baylibre.com>,
	Kevin Hilman <khilman@baylibre.com>,
	Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Subject: Re: Random reboots on ODROID-N2+
Date: Tue, 22 Jun 2021 09:39:23 +0200	[thread overview]
Message-ID: <a0ef23c932d1a7412d620237d9acde0f@agner.ch> (raw)
In-Reply-To: <40ca11f84b7cdbfb9ad2ddd480cb204a@agner.ch>

On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
> 
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
> 
> After running serial console on several instances, I was able to catch
> this stack trace:
> 
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390

<snip>

We do see those crashes in similar frequency with Linux 5.12:

[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
[129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[129988.642352] pc : free_page_and_swap_cache+0x0/0x110
[129988.642352] lr : tlb_remove_table_rcu+0x30/0x60
[129988.642353] sp : ffff8000115bbdf0
[129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18
[129988.642358] x27: 000000000000000a x26: ffff000000120000
[129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90
[129988.642362] x23: ffff800011456680 x22: ffff0000e07df970
[129988.642365] x21: 0000000000000003 x20: 0000000000000001
[129988.642367] x19: ffff000005300000 x18: 0000000000000000
[129988.642369] x17: 0000000000000000 x16: 0000000000000000
[129988.642371] x15: 0000000000000000 x14: 0000000000000500
[129988.642373] x13: 0000000000000002 x12: 0000000000000000
[129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800
[129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8
[129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001
[129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007
[129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00
[129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80
[129988.642387] Kernel panic - not syncing: Asynchronous SError
Interrupt
[129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642390] Call trace:
[129988.642391]  dump_backtrace+0x0/0x1a0
[129988.642392]  show_stack+0x18/0x70
[129988.642392]  dump_stack+0xd0/0x12c
[129988.642393]  panic+0x170/0x338
[129988.642394]  nmi_panic+0x8c/0x90
[129988.642395]  arm64_serror_panic+0x78/0x84
[129988.642395]  do_serror+0x38/0xa0
[129988.642396]  el1_error+0x80/0xf8
[129988.642397]  free_page_and_swap_cache+0x0/0x110
[129988.642398]  rcu_core+0x310/0x5d0
[129988.642398]  rcu_core_si+0x10/0x20
[129988.642399]  _stext+0x128/0x28c
[129988.642400]  irq_exit+0xd8/0x100
[129988.642401]  __handle_domain_irq+0x68/0xc0
[129988.642401]  gic_handle_irq+0xa8/0xe0
[129988.642402]  el1_irq+0xbc/0x180
[129988.642403]  arch_cpu_idle+0x18/0x30
[129988.642404]  default_idle_call+0x20/0x68
[129988.642404]  do_idle+0x218/0x270
[129988.642405]  cpu_startup_entry+0x24/0x70
[129988.642406]  secondary_start_kernel+0x178/0x190
[129988.642418] SMP: stopping secondary CPUs
[129988.642419] Kernel Offset: disabled
[129988.642420] CPU features: 0x00240002,61082004
[129988.642421] Memory Limit: none

It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).

I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.

--
Stefan

_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

  parent reply	other threads:[~2021-06-22  7:41 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
2021-05-17  9:14 ` Stefan Agner
2021-05-17 21:09 ` Martin Blumenstingl
2021-05-17 21:09   ` Martin Blumenstingl
2021-05-18  9:16   ` Stefan Agner
2021-05-18  9:16     ` Stefan Agner
2021-05-18  9:35     ` Neil Armstrong
2021-05-18  9:35       ` Neil Armstrong
2021-05-18  1:33 ` Andrew Lunn
2021-05-18  1:33   ` Andrew Lunn
2021-05-18 10:15   ` Stefan Agner
2021-05-18 10:15     ` Stefan Agner
2021-05-19 20:09 ` Stefan Agner
2021-05-19 20:09   ` Stefan Agner
2021-06-22  7:39 ` Stefan Agner [this message]
2021-06-22  7:39   ` Stefan Agner
2021-07-23 14:25   ` Byron Stanoszek
2021-07-23 14:25     ` Byron Stanoszek
2021-07-23 15:36     ` Robin Murphy
2021-07-23 15:36       ` Robin Murphy
2021-07-23 15:56       ` Stefan Agner
2021-07-23 15:56         ` Stefan Agner
2021-07-23 16:14         ` Robin Murphy
2021-07-23 16:14           ` Robin Murphy
2021-07-23 17:47           ` Robin Murphy
2021-07-23 17:47             ` Robin Murphy
2021-07-23 19:48             ` Stefan Agner
2021-07-23 19:48               ` Stefan Agner
2021-07-26  7:54               ` Neil Armstrong
2021-07-26  7:54                 ` Neil Armstrong
2021-07-26 12:07                 ` Stefan Agner
2021-07-26 12:07                   ` Stefan Agner
2021-07-26 12:31                   ` Robin Murphy
2021-07-26 12:31                     ` Robin Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a0ef23c932d1a7412d620237d9acde0f@agner.ch \
    --to=stefan@agner.ch \
    --cc=jbrunet@baylibre.com \
    --cc=khilman@baylibre.com \
    --cc=linux-amlogic@lists.infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=martin.blumenstingl@googlemail.com \
    --cc=narmstrong@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.