linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Random reboots on ODROID-N2+
@ 2021-05-17  9:14 Stefan Agner
  2021-05-17 21:09 ` Martin Blumenstingl
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Stefan Agner @ 2021-05-17  9:14 UTC (permalink / raw)
  To: linux-amlogic, linux-arm-kernel
  Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl

Hi,

We are currently testing a new release using Linux 5.10.33. I've
received since several reports of random reboots every couple of days.
Unfortunately the log (journald) doesn't show anything, just a hard cut
at some point.

After running serial console on several instances, I was able to catch
this stack trace:

[202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
[202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
[202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
[202983.988160] sp : ffff8000132a3ae0
[202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
[202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
[202983.988165] x25: 0000000000000073 x24: ffff000007963e24
[202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
[202983.988169] x21: 0000000000000011 x20: 0000000000000073
[202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
[202983.988172] x17: 0000000000000000 x16: 0000000000000000
[202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
[202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
[202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
[202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
[202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
[202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
[202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
[202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
[202983.988186] Kernel panic - not syncing: Asynchronous SError
Interrupt
[202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988188] Call trace:
[202983.988188]  dump_backtrace+0x0/0x1a0
[202983.988189]  show_stack+0x18/0x70
[202983.988190]  dump_stack+0xd0/0x12c
[202983.988190]  panic+0x170/0x338
[202983.988191]  nmi_panic+0x8c/0x90
[202983.988191]  arm64_serror_panic+0x78/0x84
[202983.988192]  do_serror+0x38/0xa0
[202983.988193]  el1_error+0x88/0x108
[202983.988193]  udp_send_skb.isra.0+0x178/0x390
[202983.988194]  udp_sendmsg+0x7c8/0x9c0
[202983.988194]  inet_sendmsg+0x44/0x70
[202983.988195]  sock_sendmsg+0x4c/0x60
[202983.988196]  __sys_sendto+0xd0/0x140
[202983.988196]  __arm64_sys_sendto+0x28/0x40
[202983.988197]  el0_svc_common.constprop.0+0x78/0x1a0
[202983.988197]  do_el0_svc+0x24/0x90
[202983.988198]  el0_svc+0x14/0x20
[202983.988199]  el0_sync_handler+0xb0/0xc0
[202983.988199]  el0_sync+0x178/0x180
[202983.988211] SMP: stopping secondary CPUs
[202983.988212] Kernel Offset: disabled
[202983.988212] CPU features: 0x0240002,61082004
[202983.988213] Memory Limit: none

Anyone observed such an issue? I am pretty sure that this is a new issue
as we have many installations using Linux 5.9.16 running stable on the
same hardware,.

Now that I can tell that it is network related I'll try to increase
network load to see if I can find a quicker way to reproduce this.

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
@ 2021-05-17 21:09 ` Martin Blumenstingl
  2021-05-18  9:16   ` Stefan Agner
  2021-05-18  1:33 ` Andrew Lunn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Martin Blumenstingl @ 2021-05-17 21:09 UTC (permalink / raw)
  To: Stefan Agner
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman

Hi Stefan,

On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
I'm sorry to hear that some things are not working right

[...]
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188]  dump_backtrace+0x0/0x1a0
> [202983.988189]  show_stack+0x18/0x70
> [202983.988190]  dump_stack+0xd0/0x12c
> [202983.988190]  panic+0x170/0x338
> [202983.988191]  nmi_panic+0x8c/0x90
> [202983.988191]  arm64_serror_panic+0x78/0x84
> [202983.988192]  do_serror+0x38/0xa0
> [202983.988193]  el1_error+0x88/0x108
> [202983.988193]  udp_send_skb.isra.0+0x178/0x390
> [202983.988194]  udp_sendmsg+0x7c8/0x9c0
> [202983.988194]  inet_sendmsg+0x44/0x70
> [202983.988195]  sock_sendmsg+0x4c/0x60
> [202983.988196]  __sys_sendto+0xd0/0x140
> [202983.988196]  __arm64_sys_sendto+0x28/0x40
> [202983.988197]  el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197]  do_el0_svc+0x24/0x90
> [202983.988198]  el0_svc+0x14/0x20
> [202983.988199]  el0_sync_handler+0xb0/0xc0
> [202983.988199]  el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
that looks weird

> Anyone observed such an issue? I am pretty sure that this is a new issue
> as we have many installations using Linux 5.9.16 running stable on the
> same hardware,.
I haven't but I am currently trying to hunt down a (probably
unrelated) Ethernet issue on an older Meson8m2 SoC currently.
All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
there's a little bit of "glue" IP for the xMII connecting to the SoC's
IO pads

I think it's a good idea to involve the netdev and (probably even more
important) stmmac maintainers.
Anything skb related is handled by the stmmac driver.
So I am hoping that someone with expertise in that area can give any
hints for debugging or reproducing this.


Best regards,
Martin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
  2021-05-17 21:09 ` Martin Blumenstingl
@ 2021-05-18  1:33 ` Andrew Lunn
  2021-05-18 10:15   ` Stefan Agner
  2021-05-19 20:09 ` Stefan Agner
  2021-06-22  7:39 ` Stefan Agner
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Lunn @ 2021-05-18  1:33 UTC (permalink / raw)
  To: Stefan Agner
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl

On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
> Hi,
> 
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
> 
> After running serial console on several instances, I was able to catch
> this stack trace:
> 
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390

Hi Stefan

Could you generate net/ipv4/udp.lst so we can see what
udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
code it maps to.

     Andrew

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-17 21:09 ` Martin Blumenstingl
@ 2021-05-18  9:16   ` Stefan Agner
  2021-05-18  9:35     ` Neil Armstrong
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Agner @ 2021-05-18  9:16 UTC (permalink / raw)
  To: Martin Blumenstingl
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman

Hi Martin,

On 2021-05-17 23:09, Martin Blumenstingl wrote:
> Hi Stefan,
> 
> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
> I'm sorry to hear that some things are not working right
> 
> [...]
>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988188] Call trace:
>> [202983.988188]  dump_backtrace+0x0/0x1a0
>> [202983.988189]  show_stack+0x18/0x70
>> [202983.988190]  dump_stack+0xd0/0x12c
>> [202983.988190]  panic+0x170/0x338
>> [202983.988191]  nmi_panic+0x8c/0x90
>> [202983.988191]  arm64_serror_panic+0x78/0x84
>> [202983.988192]  do_serror+0x38/0xa0
>> [202983.988193]  el1_error+0x88/0x108
>> [202983.988193]  udp_send_skb.isra.0+0x178/0x390
>> [202983.988194]  udp_sendmsg+0x7c8/0x9c0
>> [202983.988194]  inet_sendmsg+0x44/0x70
>> [202983.988195]  sock_sendmsg+0x4c/0x60
>> [202983.988196]  __sys_sendto+0xd0/0x140
>> [202983.988196]  __arm64_sys_sendto+0x28/0x40
>> [202983.988197]  el0_svc_common.constprop.0+0x78/0x1a0
>> [202983.988197]  do_el0_svc+0x24/0x90
>> [202983.988198]  el0_svc+0x14/0x20
>> [202983.988199]  el0_sync_handler+0xb0/0xc0
>> [202983.988199]  el0_sync+0x178/0x180
>> [202983.988211] SMP: stopping secondary CPUs
>> [202983.988212] Kernel Offset: disabled
>> [202983.988212] CPU features: 0x0240002,61082004
>> [202983.988213] Memory Limit: none
> that looks weird
> 
>> Anyone observed such an issue? I am pretty sure that this is a new issue
>> as we have many installations using Linux 5.9.16 running stable on the
>> same hardware,.
> I haven't but I am currently trying to hunt down a (probably
> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
> there's a little bit of "glue" IP for the xMII connecting to the SoC's
> IO pads
> 
> I think it's a good idea to involve the netdev and (probably even more
> important) stmmac maintainers.
> Anything skb related is handled by the stmmac driver.
> So I am hoping that someone with expertise in that area can give any
> hints for debugging or reproducing this.

Ok I'll do that, I currently wait to see the same trace a second time,
just to make sure its really caused by that code path always.

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-18  9:16   ` Stefan Agner
@ 2021-05-18  9:35     ` Neil Armstrong
  0 siblings, 0 replies; 17+ messages in thread
From: Neil Armstrong @ 2021-05-18  9:35 UTC (permalink / raw)
  To: Stefan Agner, Martin Blumenstingl
  Cc: linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman

Hi Stefan,

On 18/05/2021 11:16, Stefan Agner wrote:
> Hi Martin,
> 
> On 2021-05-17 23:09, Martin Blumenstingl wrote:
>> Hi Stefan,
>>
>> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>>
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>> I'm sorry to hear that some things are not working right
>>
>> [...]
>>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988188] Call trace:
>>> [202983.988188]  dump_backtrace+0x0/0x1a0
>>> [202983.988189]  show_stack+0x18/0x70
>>> [202983.988190]  dump_stack+0xd0/0x12c
>>> [202983.988190]  panic+0x170/0x338
>>> [202983.988191]  nmi_panic+0x8c/0x90
>>> [202983.988191]  arm64_serror_panic+0x78/0x84
>>> [202983.988192]  do_serror+0x38/0xa0
>>> [202983.988193]  el1_error+0x88/0x108
>>> [202983.988193]  udp_send_skb.isra.0+0x178/0x390
>>> [202983.988194]  udp_sendmsg+0x7c8/0x9c0
>>> [202983.988194]  inet_sendmsg+0x44/0x70
>>> [202983.988195]  sock_sendmsg+0x4c/0x60
>>> [202983.988196]  __sys_sendto+0xd0/0x140
>>> [202983.988196]  __arm64_sys_sendto+0x28/0x40
>>> [202983.988197]  el0_svc_common.constprop.0+0x78/0x1a0
>>> [202983.988197]  do_el0_svc+0x24/0x90
>>> [202983.988198]  el0_svc+0x14/0x20
>>> [202983.988199]  el0_sync_handler+0xb0/0xc0
>>> [202983.988199]  el0_sync+0x178/0x180
>>> [202983.988211] SMP: stopping secondary CPUs
>>> [202983.988212] Kernel Offset: disabled
>>> [202983.988212] CPU features: 0x0240002,61082004
>>> [202983.988213] Memory Limit: none
>> that looks weird
>>
>>> Anyone observed such an issue? I am pretty sure that this is a new issue
>>> as we have many installations using Linux 5.9.16 running stable on the
>>> same hardware,.
>> I haven't but I am currently trying to hunt down a (probably
>> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
>> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
>> there's a little bit of "glue" IP for the xMII connecting to the SoC's
>> IO pads
>>
>> I think it's a good idea to involve the netdev and (probably even more
>> important) stmmac maintainers.
>> Anything skb related is handled by the stmmac driver.
>> So I am hoping that someone with expertise in that area can give any
>> hints for debugging or reproducing this.
> 
> Ok I'll do that, I currently wait to see the same trace a second time,
> just to make sure its really caused by that code path always.

A good work would be to eventually do a bisect between the last known working and
the currently working version.

SError Interrupt looks like an HW issue caused by a change in v5.10

Neil

> 
> --
> Stefan
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-18  1:33 ` Andrew Lunn
@ 2021-05-18 10:15   ` Stefan Agner
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Agner @ 2021-05-18 10:15 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl

On 2021-05-18 03:33, Andrew Lunn wrote:
> On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
> 
> Hi Stefan

Hi Andrew,

> 
> Could you generate net/ipv4/udp.lst so we can see what
> udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
> code it maps to.

Ok, built net/ipv4/udp.lst using the same build environment (buildroot)
the kernel which generated the stack trace has been built with, so I
think this should add up:

ffff800010c1bb60 <udp_send_skb.isra.0>:
static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
...
                udp4_hwcsum(skb, fl4->saddr, fl4->daddr);
ffff800010c1bc78:       29450ae1        ldp     w1, w2, [x23, #40]
ffff800010c1bc7c:       aa1303e0        mov     x0, x19
ffff800010c1bc80:       94000000        bl      ffff800010c184b0
<udp4_hwcsum>
                        ffff800010c1bc80: R_AARCH64_CALL26     
udp4_hwcsum
        err = ip_send_skb(sock_net(sk), skb);
ffff800010c1bc84:       f9401ac0        ldr     x0, [x22, #48]
ffff800010c1bc88:       aa1303e1        mov     x1, x19
ffff800010c1bc8c:       94000000        bl      0 <ip_send_skb>
                        ffff800010c1bc8c: R_AARCH64_CALL26     
ip_send_skb
        if (err) {
ffff800010c1bc90:       350008e0        cbnz    w0, ffff800010c1bdac
<udp_send_skb.isra.0+0x24c>
...
        u64 pc = READ_ONCE(ti->preempt_count);
ffff800010c1bcd4:       f9400820        ldr     x0, [x1, #16]
        WRITE_ONCE(ti->preempt.count, --pc);
ffff800010c1bcd8:       d1000400        sub     x0, x0, #0x1
ffff800010c1bcdc:       b9001020        str     w0, [x1, #16]
        return !pc || !READ_ONCE(ti->preempt_count);
...

The full udp.lst file:
https://drive.google.com/file/d/1j0RKOfuMXmCRWILpkG3uk_beohWrr-ho/view?usp=sharing

Since I only have this one trace, I am not 100% if this trace is just a
random one or always the case.

But things seem to add up to me: mdns-repeater deals with UDP packets,
and the it seems that the code tries to make use of HW check-summing
(from lr)? This would explain why this platform only shows the problem.

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
  2021-05-17 21:09 ` Martin Blumenstingl
  2021-05-18  1:33 ` Andrew Lunn
@ 2021-05-19 20:09 ` Stefan Agner
  2021-06-22  7:39 ` Stefan Agner
  3 siblings, 0 replies; 17+ messages in thread
From: Stefan Agner @ 2021-05-19 20:09 UTC (permalink / raw)
  To: linux-amlogic, linux-arm-kernel
  Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, andrew

On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
> 
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
> 
> After running serial console on several instances, I was able to catch
> this stack trace:
> 
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
> [202983.988160] sp : ffff8000132a3ae0
> [202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
> [202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
> [202983.988165] x25: 0000000000000073 x24: ffff000007963e24
> [202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
> [202983.988169] x21: 0000000000000011 x20: 0000000000000073
> [202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
> [202983.988172] x17: 0000000000000000 x16: 0000000000000000
> [202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
> [202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
> [202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
> [202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
> [202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
> [202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
> [202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
> [202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
> [202983.988186] Kernel panic - not syncing: Asynchronous SError
> Interrupt
> [202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188]  dump_backtrace+0x0/0x1a0
> [202983.988189]  show_stack+0x18/0x70
> [202983.988190]  dump_stack+0xd0/0x12c
> [202983.988190]  panic+0x170/0x338
> [202983.988191]  nmi_panic+0x8c/0x90
> [202983.988191]  arm64_serror_panic+0x78/0x84
> [202983.988192]  do_serror+0x38/0xa0
> [202983.988193]  el1_error+0x88/0x108
> [202983.988193]  udp_send_skb.isra.0+0x178/0x390
> [202983.988194]  udp_sendmsg+0x7c8/0x9c0
> [202983.988194]  inet_sendmsg+0x44/0x70
> [202983.988195]  sock_sendmsg+0x4c/0x60
> [202983.988196]  __sys_sendto+0xd0/0x140
> [202983.988196]  __arm64_sys_sendto+0x28/0x40
> [202983.988197]  el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197]  do_el0_svc+0x24/0x90
> [202983.988198]  el0_svc+0x14/0x20
> [202983.988199]  el0_sync_handler+0xb0/0xc0
> [202983.988199]  el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
> 

A second stack trace, same build etc. but different board (instance):

[48112.247242] SError Interrupt on CPU5, code 0xbf000000 -- SError
[48112.247244] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247245] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247246] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[48112.247247] pc : __rcu_read_lock+0x18/0x20
[48112.247248] lr : lock_page_memcg+0x28/0xd0
[48112.247249] sp : ffff800013e238e0
[48112.247249] x29: ffff800013e238e0 x28: ffff800013e23b18
[48112.247252] x27: ffff000055c5c780 x26: 0000ffff9163c000
[48112.247254] x25: ffff0000053000c0 x24: 00e00000d40e3bc3
[48112.247256] x23: fffffe00033038c0 x22: ffff800013e23a18
[48112.247257] x21: 0000ffff9163b000 x20: fffffe00033038c0
[48112.247259] x19: fffffe00033038c0 x18: 0000000000000000
[48112.247261] x17: 0000000000000000 x16: 0000000000000000
[48112.247262] x15: 0000000000000002 x14: 0000000000000001
[48112.247264] x13: fffffe0001acdd08 x12: 0000000000000000
[48112.247265] x11: ffff0000e4650100 x10: ffff00004c640000
[48112.247267] x9 : 000000000000000c x8 : 00000000ffffffff
[48112.247268] x7 : 0000000000000020 x6 : 0000000000000000
[48112.247270] x5 : 00000000000d40e3 x4 : 0000ffff9163b000
[48112.247271] x3 : 00000000ffffffff x2 : 0000000000000001
[48112.247273] x1 : ffff000000182ac0 x0 : 0000000000000001
[48112.247275] Kernel panic - not syncing: Asynchronous SError Interrupt
[48112.247275] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247276] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247277] Call trace:
[48112.247277]  dump_backtrace+0x0/0x1a0
[48112.247278]  show_stack+0x18/0x70
[48112.247279]  dump_stack+0xd0/0x12c
[48112.247279]  panic+0x170/0x338
[48112.247280]  nmi_panic+0x8c/0x90
[48112.247280]  arm64_serror_panic+0x78/0x84
[48112.247281]  do_serror+0x38/0xa0
[48112.247281]  el1_error+0x88/0x108
[48112.247282]  __rcu_read_lock+0x18/0x20
[48112.247283]  page_remove_rmap+0x1c/0x560
[48112.247283]  unmap_page_range+0x5b0/0x7b0
[48112.247284]  unmap_single_vma+0x4c/0xb0
[48112.247285]  unmap_vmas+0x70/0xf0
[48112.247285]  exit_mmap+0xc8/0x180
[48112.247286]  mmput+0x7c/0x150
[48112.247286]  begin_new_exec+0x2d4/0xa90
[48112.247287]  load_elf_binary+0x38c/0x1800
[48112.247288]  bprm_execve+0x29c/0x5d0
[48112.247288]  do_execveat_common.isra.0+0x178/0x1d0
[48112.247289]  __arm64_sys_execve+0x40/0x60
[48112.247290]  el0_svc_common.constprop.0+0x78/0x1a0
[48112.247290]  do_el0_svc+0x24/0x90
[48112.247291]  el0_svc+0x14/0x20
[48112.247291]  el0_sync_handler+0xb0/0xc0
[48112.247292]  el0_sync+0x178/0x180
[48112.247303] SMP: stopping secondary CPUs
[48112.247304] Kernel Offset: disabled
[48112.247305] CPU features: 0x0240002,61082004
[48112.247305] Memory Limit: none

The stack trace does not look related to me...

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
                   ` (2 preceding siblings ...)
  2021-05-19 20:09 ` Stefan Agner
@ 2021-06-22  7:39 ` Stefan Agner
  2021-07-23 14:25   ` Byron Stanoszek
  3 siblings, 1 reply; 17+ messages in thread
From: Stefan Agner @ 2021-06-22  7:39 UTC (permalink / raw)
  To: linux-amlogic, linux-arm-kernel
  Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl

On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
> 
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
> 
> After running serial console on several instances, I was able to catch
> this stack trace:
> 
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390

<snip>

We do see those crashes in similar frequency with Linux 5.12:

[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
[129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[129988.642352] pc : free_page_and_swap_cache+0x0/0x110
[129988.642352] lr : tlb_remove_table_rcu+0x30/0x60
[129988.642353] sp : ffff8000115bbdf0
[129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18
[129988.642358] x27: 000000000000000a x26: ffff000000120000
[129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90
[129988.642362] x23: ffff800011456680 x22: ffff0000e07df970
[129988.642365] x21: 0000000000000003 x20: 0000000000000001
[129988.642367] x19: ffff000005300000 x18: 0000000000000000
[129988.642369] x17: 0000000000000000 x16: 0000000000000000
[129988.642371] x15: 0000000000000000 x14: 0000000000000500
[129988.642373] x13: 0000000000000002 x12: 0000000000000000
[129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800
[129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8
[129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001
[129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007
[129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00
[129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80
[129988.642387] Kernel panic - not syncing: Asynchronous SError
Interrupt
[129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642390] Call trace:
[129988.642391]  dump_backtrace+0x0/0x1a0
[129988.642392]  show_stack+0x18/0x70
[129988.642392]  dump_stack+0xd0/0x12c
[129988.642393]  panic+0x170/0x338
[129988.642394]  nmi_panic+0x8c/0x90
[129988.642395]  arm64_serror_panic+0x78/0x84
[129988.642395]  do_serror+0x38/0xa0
[129988.642396]  el1_error+0x80/0xf8
[129988.642397]  free_page_and_swap_cache+0x0/0x110
[129988.642398]  rcu_core+0x310/0x5d0
[129988.642398]  rcu_core_si+0x10/0x20
[129988.642399]  _stext+0x128/0x28c
[129988.642400]  irq_exit+0xd8/0x100
[129988.642401]  __handle_domain_irq+0x68/0xc0
[129988.642401]  gic_handle_irq+0xa8/0xe0
[129988.642402]  el1_irq+0xbc/0x180
[129988.642403]  arch_cpu_idle+0x18/0x30
[129988.642404]  default_idle_call+0x20/0x68
[129988.642404]  do_idle+0x218/0x270
[129988.642405]  cpu_startup_entry+0x24/0x70
[129988.642406]  secondary_start_kernel+0x178/0x190
[129988.642418] SMP: stopping secondary CPUs
[129988.642419] Kernel Offset: disabled
[129988.642420] CPU features: 0x00240002,61082004
[129988.642421] Memory Limit: none

It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).

I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-06-22  7:39 ` Stefan Agner
@ 2021-07-23 14:25   ` Byron Stanoszek
  2021-07-23 15:36     ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Byron Stanoszek @ 2021-07-23 14:25 UTC (permalink / raw)
  To: Stefan Agner
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On Tue, 22 Jun 2021, Stefan Agner wrote:

> On 2021-05-17 11:14, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> <snip>
>
> We do see those crashes in similar frequency with Linux 5.12:
>
> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>
> It seems load and/or hardware dependent since we see it on some devices
> quite frequent (every few days), and on others it takes multiple weeks.
> Of course the once we see it frequently are the ones in production :).
>
> I am currently trying different stress-ng and other load to accelerate
> the crash rate before then trying to git bisect it.

I have an Odroid-N2+ and was able to track this problem down. The problem is
related to the following dmesg line that reads "failed to reserve memory"
below:

Machine model: Hardkernel ODROID-N2Plus
memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
OF: reserved mem: node linux,cma compatible matching fail
memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
...

A subsequent "cat /proc/iomem" shows that this memory region is still reserved
and the system appears to operate normally, until eventually the SError
Interrupt comes in under heavy memory/page-cache usage. The difference with
later kernels is that now the memory at 0x5000000-0x52fffff is registered under
the "System RAM" memory area, whereas previous kernels had dropped it from
"System RAM".

The culprit is this new code introduced in Linux 5.12, in this function in
drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():

int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
                                         phys_addr_t size, bool nomap)
{
         if (nomap) {
                 /*
                  * If the memory is already reserved (by another region), we
                  * should not allow it to be marked nomap.
                  */
                 if (memblock_is_region_reserved(base, size))  <------
                         return -EBUSY;                        <------

                 return memblock_mark_nomap(base, size);
         }
         return memblock_reserve(base, size);
}

"nomap" is true, due to this text being present in the FDT:

    reserved-memory {
      ranges secmon_reserved: secmon@5000000 {
        reg = <0x0 0x05000000 0x0 0x300000>
        no-map
      }
      ...

But memblock_is_region_reserved() is returning true because there is already an
entry for 0x5000000-0x52fffff in the memory map, which is already marked
reserved, at the time the __reserved_mem_reserve_reg() function is called.
(Perhaps this is being set reserved by u-boot? -- I did not research that far.)

This function is defined as:

bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
{
         return memblock_overlaps_region(&memblock.reserved, base, size);
}

Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
reserved region "0x5000000-0x52fffff", the function returns true.

If I comment out the "if (memblock_is_region_reserved(base, size))" code and
allow it to mark the region no-map, then the memory area is properly removed
from the "System RAM" area and the crashing stops.

I've had the system up and running for 15 days now under heavy load without any
crashes, using just the following patch as workaround:


--- linux-5.13.0/drivers/of/fdt.c.bak	2021-07-07 00:22:58.000000000 -0400
+++ linux-5.13.0/drivers/of/fdt.c	2021-07-07 00:23:08.000000000 -0400
@@ -1157,13 +1157,6 @@
  					phys_addr_t size, bool nomap)
  {
  	if (nomap) {
-		/*
-		 * If the memory is already reserved (by another region), we
-		 * should not allow it to be marked nomap.
-		 */
-		if (memblock_is_region_reserved(base, size))
-			return -EBUSY;
-
  		return memblock_mark_nomap(base, size);
  	}
  	return memblock_reserve(base, size);


The above patch applies to later versions of Linux 5.10.x through 5.12.x as
well.

Perhaps a more proper fix is to allow the no-map to still proceed, in the case
that the existing reserved region is identical (same start/end) to the region
getting marked no-map.

  -Byron


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 14:25   ` Byron Stanoszek
@ 2021-07-23 15:36     ` Robin Murphy
  2021-07-23 15:56       ` Stefan Agner
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2021-07-23 15:36 UTC (permalink / raw)
  To: Byron Stanoszek, Stefan Agner
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On 2021-07-23 15:25, Byron Stanoszek wrote:
> On Tue, 22 Jun 2021, Stefan Agner wrote:
> 
>> On 2021-05-17 11:14, Stefan Agner wrote:
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>>>
>>> After running serial console on several instances, I was able to catch
>>> this stack trace:
>>>
>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>> #1
>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>
>> <snip>
>>
>> We do see those crashes in similar frequency with Linux 5.12:
>>
>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>
>> It seems load and/or hardware dependent since we see it on some devices
>> quite frequent (every few days), and on others it takes multiple weeks.
>> Of course the once we see it frequently are the ones in production :).
>>
>> I am currently trying different stress-ng and other load to accelerate
>> the crash rate before then trying to git bisect it.
> 
> I have an Odroid-N2+ and was able to track this problem down. The 
> problem is
> related to the following dmesg line that reads "failed to reserve memory"
> below:
> 
> Machine model: Hardkernel ODROID-N2Plus
> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 
> 0xffffffc0107e36dc
> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 
> 0xffffffc0107feb50
> OF: fdt: Reserved memory: failed to reserve memory for node 
> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 
> 0xffffffc0107ff87c
> OF: reserved mem: node linux,cma compatible matching fail
> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
> ...
> 
> A subsequent "cat /proc/iomem" shows that this memory region is still 
> reserved
> and the system appears to operate normally, until eventually the SError
> Interrupt comes in under heavy memory/page-cache usage. The difference with
> later kernels is that now the memory at 0x5000000-0x52fffff is 
> registered under
> the "System RAM" memory area, whereas previous kernels had dropped it from
> "System RAM".
> 
> The culprit is this new code introduced in Linux 5.12, in this function in
> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
> 
> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>                                          phys_addr_t size, bool nomap)
> {
>          if (nomap) {
>                  /*
>                   * If the memory is already reserved (by another 
> region), we
>                   * should not allow it to be marked nomap.
>                   */
>                  if (memblock_is_region_reserved(base, size))  <------
>                          return -EBUSY;                        <------
> 
>                  return memblock_mark_nomap(base, size);
>          }
>          return memblock_reserve(base, size);
> }
> 
> "nomap" is true, due to this text being present in the FDT:
> 
>     reserved-memory {
>       ranges secmon_reserved: secmon@5000000 {
>         reg = <0x0 0x05000000 0x0 0x300000>
>         no-map
>       }
>       ...
> 
> But memblock_is_region_reserved() is returning true because there is 
> already an
> entry for 0x5000000-0x52fffff in the memory map, which is already marked
> reserved, at the time the __reserved_mem_reserve_reg() function is called.
> (Perhaps this is being set reserved by u-boot? -- I did not research 
> that far.)
> 
> This function is defined as:
> 
> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, 
> phys_addr_t size)
> {
>          return memblock_overlaps_region(&memblock.reserved, base, size);
> }
> 
> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the 
> existing
> reserved region "0x5000000-0x52fffff", the function returns true.
> 
> If I comment out the "if (memblock_is_region_reserved(base, size))" code 
> and
> allow it to mark the region no-map, then the memory area is properly 
> removed
> from the "System RAM" area and the crashing stops.
> 
> I've had the system up and running for 15 days now under heavy load 
> without any
> crashes, using just the following patch as workaround:
> 
> 
> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 
> -0400
> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
> @@ -1157,13 +1157,6 @@
>                       phys_addr_t size, bool nomap)
>   {
>       if (nomap) {
> -        /*
> -         * If the memory is already reserved (by another region), we
> -         * should not allow it to be marked nomap.
> -         */
> -        if (memblock_is_region_reserved(base, size))
> -            return -EBUSY;
> -
>           return memblock_mark_nomap(base, size);
>       }
>       return memblock_reserve(base, size);
> 
> 
> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
> well.
> 
> Perhaps a more proper fix is to allow the no-map to still proceed, in 
> the case
> that the existing reserved region is identical (same start/end) to the 
> region
> getting marked no-map.

If U-Boot is marking regions with the wrong type/attributes in the EFI 
memory map, then the best thing to do would be to fix that. I see a 
fairly recent commit which looks suspiciously relevant:

https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004

Booting with "efi=debug" should (among other things) print the memory 
map at boot if you want to double-check that that is the source of the 
mismatch. Our EFI code should be perfectly capable of setting the 
memblock flag if the region *is* described appropriately, see 
reserve_regions() in drivers/firmware/efi/efi-init.c.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 15:36     ` Robin Murphy
@ 2021-07-23 15:56       ` Stefan Agner
  2021-07-23 16:14         ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Agner @ 2021-07-23 15:56 UTC (permalink / raw)
  To: Robin Murphy, Byron Stanoszek
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

Hi Byron, Hi Robin,

Very interesting findings!

On 2021-07-23 17:36, Robin Murphy wrote:
> On 2021-07-23 15:25, Byron Stanoszek wrote:
>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>
>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>> received since several reports of random reboots every couple of days.
>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>> at some point.
>>>>
>>>> After running serial console on several instances, I was able to catch
>>>> this stack trace:
>>>>
>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>> #1
>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>
>>> <snip>
>>>
>>> We do see those crashes in similar frequency with Linux 5.12:
>>>
>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>
>>> It seems load and/or hardware dependent since we see it on some devices
>>> quite frequent (every few days), and on others it takes multiple weeks.
>>> Of course the once we see it frequently are the ones in production :).
>>>
>>> I am currently trying different stress-ng and other load to accelerate
>>> the crash rate before then trying to git bisect it.
>>
>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>> related to the following dmesg line that reads "failed to reserve memory"
>> below:
>>
>> Machine model: Hardkernel ODROID-N2Plus
>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB

In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.

>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>> OF: reserved mem: node linux,cma compatible matching fail
>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>> ...
>>
>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>> and the system appears to operate normally, until eventually the SError
>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>> the "System RAM" memory area, whereas previous kernels had dropped it from
>> "System RAM".
>>
>> The culprit is this new code introduced in Linux 5.12, in this function in
>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():

It seems that patch got also backported, so that is why I see it with
5.10 as well.

>>
>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>                                          phys_addr_t size, bool nomap)
>> {
>>          if (nomap) {
>>                  /*
>>                   * If the memory is already reserved (by another region), we
>>                   * should not allow it to be marked nomap.
>>                   */
>>                  if (memblock_is_region_reserved(base, size))  <------
>>                          return -EBUSY;                        <------
>>
>>                  return memblock_mark_nomap(base, size);
>>          }
>>          return memblock_reserve(base, size);
>> }
>>
>> "nomap" is true, due to this text being present in the FDT:
>>
>>     reserved-memory {
>>       ranges secmon_reserved: secmon@5000000 {
>>         reg = <0x0 0x05000000 0x0 0x300000>
>>         no-map
>>       }
>>       ...
>>
>> But memblock_is_region_reserved() is returning true because there is already an
>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>
>> This function is defined as:
>>
>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>> {
>>          return memblock_overlaps_region(&memblock.reserved, base, size);
>> }
>>
>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>> reserved region "0x5000000-0x52fffff", the function returns true.
>>
>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>> allow it to mark the region no-map, then the memory area is properly removed
>> from the "System RAM" area and the crashing stops.
>>
>> I've had the system up and running for 15 days now under heavy load without any
>> crashes, using just the following patch as workaround:
>>
>>
>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 -0400
>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
>> @@ -1157,13 +1157,6 @@
>>                       phys_addr_t size, bool nomap)
>>   {
>>       if (nomap) {
>> -        /*
>> -         * If the memory is already reserved (by another region), we
>> -         * should not allow it to be marked nomap.
>> -         */
>> -        if (memblock_is_region_reserved(base, size))
>> -            return -EBUSY;
>> -
>>           return memblock_mark_nomap(base, size);
>>       }
>>       return memblock_reserve(base, size);
>>
>>
>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>> well.

Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.

>>
>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>> that the existing reserved region is identical (same start/end) to the region
>> getting marked no-map.
> 
> If U-Boot is marking regions with the wrong type/attributes in the EFI
> memory map, then the best thing to do would be to fix that. I see a
> fairly recent commit which looks suspiciously relevant:
> 
> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004

It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.

> 
> Booting with "efi=debug" should (among other things) print the memory
> map at boot if you want to double-check that that is the source of the
> mismatch. Our EFI code should be perfectly capable of setting the
> memblock flag if the region *is* described appropriately, see
> reserve_regions() in drivers/firmware/efi/efi-init.c.

Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[    0.000000] Machine model: Hardkernel ODROID-N2Plus
[    0.000000] efi: Getting UEFI parameters from /chosen in DT:
[    0.000000] efi: UEFI not found.
[    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon@5000000': base 0x0000000005000000, size 3 MiB

So it seems UEFI is not in the play here?

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 15:56       ` Stefan Agner
@ 2021-07-23 16:14         ` Robin Murphy
  2021-07-23 17:47           ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2021-07-23 16:14 UTC (permalink / raw)
  To: Stefan Agner, Byron Stanoszek
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On 2021-07-23 16:56, Stefan Agner wrote:
> Hi Byron, Hi Robin,
> 
> Very interesting findings!
> 
> On 2021-07-23 17:36, Robin Murphy wrote:
>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>
>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>> Hi,
>>>>>
>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>> received since several reports of random reboots every couple of days.
>>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>>> at some point.
>>>>>
>>>>> After running serial console on several instances, I was able to catch
>>>>> this stack trace:
>>>>>
>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>>> #1
>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>
>>>> <snip>
>>>>
>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>
>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>
>>>> It seems load and/or hardware dependent since we see it on some devices
>>>> quite frequent (every few days), and on others it takes multiple weeks.
>>>> Of course the once we see it frequently are the ones in production :).
>>>>
>>>> I am currently trying different stress-ng and other load to accelerate
>>>> the crash rate before then trying to git bisect it.
>>>
>>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>>> related to the following dmesg line that reads "failed to reserve memory"
>>> below:
>>>
>>> Machine model: Hardkernel ODROID-N2Plus
>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
> 
> In my 5.9 builds that line isn't present, and it seems all logs I stored
> from 5.10 builds have the change already and show this line.
> 
>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>>> OF: reserved mem: node linux,cma compatible matching fail
>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>>> ...
>>>
>>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>>> and the system appears to operate normally, until eventually the SError
>>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>>> the "System RAM" memory area, whereas previous kernels had dropped it from
>>> "System RAM".
>>>
>>> The culprit is this new code introduced in Linux 5.12, in this function in
>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
> 
> It seems that patch got also backported, so that is why I see it with
> 5.10 as well.
> 
>>>
>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>>                                           phys_addr_t size, bool nomap)
>>> {
>>>           if (nomap) {
>>>                   /*
>>>                    * If the memory is already reserved (by another region), we
>>>                    * should not allow it to be marked nomap.
>>>                    */
>>>                   if (memblock_is_region_reserved(base, size))  <------
>>>                           return -EBUSY;                        <------
>>>
>>>                   return memblock_mark_nomap(base, size);
>>>           }
>>>           return memblock_reserve(base, size);
>>> }
>>>
>>> "nomap" is true, due to this text being present in the FDT:
>>>
>>>      reserved-memory {
>>>        ranges secmon_reserved: secmon@5000000 {
>>>          reg = <0x0 0x05000000 0x0 0x300000>
>>>          no-map
>>>        }
>>>        ...
>>>
>>> But memblock_is_region_reserved() is returning true because there is already an
>>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>>
>>> This function is defined as:
>>>
>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>>> {
>>>           return memblock_overlaps_region(&memblock.reserved, base, size);
>>> }
>>>
>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>
>>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>>> allow it to mark the region no-map, then the memory area is properly removed
>>> from the "System RAM" area and the crashing stops.
>>>
>>> I've had the system up and running for 15 days now under heavy load without any
>>> crashes, using just the following patch as workaround:
>>>
>>>
>>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 -0400
>>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
>>> @@ -1157,13 +1157,6 @@
>>>                        phys_addr_t size, bool nomap)
>>>    {
>>>        if (nomap) {
>>> -        /*
>>> -         * If the memory is already reserved (by another region), we
>>> -         * should not allow it to be marked nomap.
>>> -         */
>>> -        if (memblock_is_region_reserved(base, size))
>>> -            return -EBUSY;
>>> -
>>>            return memblock_mark_nomap(base, size);
>>>        }
>>>        return memblock_reserve(base, size);
>>>
>>>
>>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>>> well.
> 
> Eventhough probably not the correct solution, I'll give this a try on my
> end just to verify we are indeed experience the same problem and the
> change fixes it for me too.
> 
>>>
>>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>>> that the existing reserved region is identical (same start/end) to the region
>>> getting marked no-map.
>>
>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>> memory map, then the best thing to do would be to fix that. I see a
>> fairly recent commit which looks suspiciously relevant:
>>
>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
> 
> It seems that this patch went into U-Boot 2021.04 which I am using, so
> that (alone) seems not to fix the mapping.
> 
>>
>> Booting with "efi=debug" should (among other things) print the memory
>> map at boot if you want to double-check that that is the source of the
>> mismatch. Our EFI code should be perfectly capable of setting the
>> memblock flag if the region *is* described appropriately, see
>> reserve_regions() in drivers/firmware/efi/efi-init.c.
> 
> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
> [    0.000000] efi: UEFI not found.
> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
> 
> So it seems UEFI is not in the play here?

Ah, OK, in that case I guess the question remains why does 
early_init_dt_reserve_memory_arch() think the region is already 
reserved? My instinctive assumption was an EFI memory map being present; 
seeing that U-Boot does indeed reflect DT reservations there *and* has 
had a likely-looking bug recently was then just overwhelmingly suggestive :)

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 16:14         ` Robin Murphy
@ 2021-07-23 17:47           ` Robin Murphy
  2021-07-23 19:48             ` Stefan Agner
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2021-07-23 17:47 UTC (permalink / raw)
  To: Stefan Agner, Byron Stanoszek
  Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On 2021-07-23 17:14, Robin Murphy wrote:
> On 2021-07-23 16:56, Stefan Agner wrote:
>> Hi Byron, Hi Robin,
>>
>> Very interesting findings!
>>
>> On 2021-07-23 17:36, Robin Murphy wrote:
>>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>>
>>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>>> received since several reports of random reboots every couple of 
>>>>>> days.
>>>>>> Unfortunately the log (journald) doesn't show anything, just a 
>>>>>> hard cut
>>>>>> at some point.
>>>>>>
>>>>>> After running serial console on several instances, I was able to 
>>>>>> catch
>>>>>> this stack trace:
>>>>>>
>>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 
>>>>>> 5.10.33
>>>>>> #1
>>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>>
>>>>> <snip>
>>>>>
>>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>>
>>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>>
>>>>> It seems load and/or hardware dependent since we see it on some 
>>>>> devices
>>>>> quite frequent (every few days), and on others it takes multiple 
>>>>> weeks.
>>>>> Of course the once we see it frequently are the ones in production :).
>>>>>
>>>>> I am currently trying different stress-ng and other load to accelerate
>>>>> the crash rate before then trying to git bisect it.
>>>>
>>>> I have an Odroid-N2+ and was able to track this problem down. The 
>>>> problem is
>>>> related to the following dmesg line that reads "failed to reserve 
>>>> memory"
>>>> below:
>>>>
>>>> Machine model: Hardkernel ODROID-N2Plus
>>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 
>>>> 0xffffffc0107e3604
>>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 
>>>> 0xffffffc0107e3664
>>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 
>>>> 0xffffffc0107e36dc
>>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 
>>>> 0xffffffc0107feb50
>>>> OF: fdt: Reserved memory: failed to reserve memory for node 
>>>> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> In my 5.9 builds that line isn't present, and it seems all logs I stored
>> from 5.10 builds have the change already and show this line.
>>
>>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 
>>>> 0xffffffc0107ff87c
>>>> OF: reserved mem: node linux,cma compatible matching fail
>>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 
>>>> 0xffffffc0107ffca8
>>>> ...
>>>>
>>>> A subsequent "cat /proc/iomem" shows that this memory region is 
>>>> still reserved
>>>> and the system appears to operate normally, until eventually the SError
>>>> Interrupt comes in under heavy memory/page-cache usage. The 
>>>> difference with
>>>> later kernels is that now the memory at 0x5000000-0x52fffff is 
>>>> registered under
>>>> the "System RAM" memory area, whereas previous kernels had dropped 
>>>> it from
>>>> "System RAM".
>>>>
>>>> The culprit is this new code introduced in Linux 5.12, in this 
>>>> function in
>>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>>
>> It seems that patch got also backported, so that is why I see it with
>> 5.10 as well.
>>
>>>>
>>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>>>                                           phys_addr_t size, bool nomap)
>>>> {
>>>>           if (nomap) {
>>>>                   /*
>>>>                    * If the memory is already reserved (by another 
>>>> region), we
>>>>                    * should not allow it to be marked nomap.
>>>>                    */
>>>>                   if (memblock_is_region_reserved(base, size))  <------
>>>>                           return -EBUSY;                        <------
>>>>
>>>>                   return memblock_mark_nomap(base, size);
>>>>           }
>>>>           return memblock_reserve(base, size);
>>>> }
>>>>
>>>> "nomap" is true, due to this text being present in the FDT:
>>>>
>>>>      reserved-memory {
>>>>        ranges secmon_reserved: secmon@5000000 {
>>>>          reg = <0x0 0x05000000 0x0 0x300000>
>>>>          no-map
>>>>        }
>>>>        ...
>>>>
>>>> But memblock_is_region_reserved() is returning true because there is 
>>>> already an
>>>> entry for 0x5000000-0x52fffff in the memory map, which is already 
>>>> marked
>>>> reserved, at the time the __reserved_mem_reserve_reg() function is 
>>>> called.
>>>> (Perhaps this is being set reserved by u-boot? -- I did not research 
>>>> that far.)
>>>>
>>>> This function is defined as:
>>>>
>>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, 
>>>> phys_addr_t size)
>>>> {
>>>>           return memblock_overlaps_region(&memblock.reserved, base, 
>>>> size);
>>>> }
>>>>
>>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the 
>>>> existing
>>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>>
>>>> If I comment out the "if (memblock_is_region_reserved(base, size))" 
>>>> code and
>>>> allow it to mark the region no-map, then the memory area is properly 
>>>> removed
>>>> from the "System RAM" area and the crashing stops.
>>>>
>>>> I've had the system up and running for 15 days now under heavy load 
>>>> without any
>>>> crashes, using just the following patch as workaround:
>>>>
>>>>
>>>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 
>>>> 00:22:58.000000000 -0400
>>>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 
>>>> -0400
>>>> @@ -1157,13 +1157,6 @@
>>>>                        phys_addr_t size, bool nomap)
>>>>    {
>>>>        if (nomap) {
>>>> -        /*
>>>> -         * If the memory is already reserved (by another region), we
>>>> -         * should not allow it to be marked nomap.
>>>> -         */
>>>> -        if (memblock_is_region_reserved(base, size))
>>>> -            return -EBUSY;
>>>> -
>>>>            return memblock_mark_nomap(base, size);
>>>>        }
>>>>        return memblock_reserve(base, size);
>>>>
>>>>
>>>> The above patch applies to later versions of Linux 5.10.x through 
>>>> 5.12.x as
>>>> well.
>>
>> Eventhough probably not the correct solution, I'll give this a try on my
>> end just to verify we are indeed experience the same problem and the
>> change fixes it for me too.
>>
>>>>
>>>> Perhaps a more proper fix is to allow the no-map to still proceed, 
>>>> in the case
>>>> that the existing reserved region is identical (same start/end) to 
>>>> the region
>>>> getting marked no-map.
>>>
>>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>>> memory map, then the best thing to do would be to fix that. I see a
>>> fairly recent commit which looks suspiciously relevant:
>>>
>>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 
>>>
>>
>> It seems that this patch went into U-Boot 2021.04 which I am using, so
>> that (alone) seems not to fix the mapping.
>>
>>>
>>> Booting with "efi=debug" should (among other things) print the memory
>>> map at boot if you want to double-check that that is the source of the
>>> mismatch. Our EFI code should be perfectly capable of setting the
>>> memblock flag if the region *is* described appropriately, see
>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>
>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>> [    0.000000] efi: UEFI not found.
>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> So it seems UEFI is not in the play here?
> 
> Ah, OK, in that case I guess the question remains why does 
> early_init_dt_reserve_memory_arch() think the region is already 
> reserved? My instinctive assumption was an EFI memory map being present; 
> seeing that U-Boot does indeed reflect DT reservations there *and* has 
> had a likely-looking bug recently was then just overwhelmingly 
> suggestive :)

Actually, poking at U-Boot a bit more I find 
meson_board_add_reserved_memory() - can you check /sys/firmware/fdt and 
see if the region ends up being passed as a /memreserve/ as well as a 
proper reserved-memory node?

IIRC the semantics of /memreserve/ aren't really well-defined enough to 
be suitable for the kind of things which require no-map, and my new 
guess is that that's what ends up conflicting here.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 17:47           ` Robin Murphy
@ 2021-07-23 19:48             ` Stefan Agner
  2021-07-26  7:54               ` Neil Armstrong
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Agner @ 2021-07-23 19:48 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Neil Armstrong,
	Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On 2021-07-23 19:47, Robin Murphy wrote:
> On 2021-07-23 17:14, Robin Murphy wrote:
>> On 2021-07-23 16:56, Stefan Agner wrote:
<snip>
>>>>
>>>> Booting with "efi=debug" should (among other things) print the memory
>>>> map at boot if you want to double-check that that is the source of the
>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>> memblock flag if the region *is* described appropriately, see
>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>
>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>> [    0.000000] efi: UEFI not found.
>>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>
>>> So it seems UEFI is not in the play here?
>>
>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
> 
> Actually, poking at U-Boot a bit more I find
> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
> and see if the region ends up being passed as a /memreserve/ as well
> as a proper reserved-memory node?
> 
> IIRC the semantics of /memreserve/ aren't really well-defined enough
> to be suitable for the kind of things which require no-map, and my new
> guess is that that's what ends up conflicting here.

Seems to be present in booth:

On v5.12.10
# fdtdump /sys/firmware/fdt
...
/memreserve/ 0x5000000 0x300000;
...
    reserved-memory {
        #address-cells = <0x00000002>;
        #size-cells = <0x00000002>;
        ranges;
        secmon@5000000 {
            reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
            no-map;
            phandle = <0x00000068>;
        };
        linux,cma {
            compatible = "shared-dma-pool";
            reusable;
            size = <0x00000000 0x10000000>;
            alignment = <0x00000000 0x00400000>;
            linux,cma-default;
        };
    };

--
Stefan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-23 19:48             ` Stefan Agner
@ 2021-07-26  7:54               ` Neil Armstrong
  2021-07-26 12:07                 ` Stefan Agner
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Armstrong @ 2021-07-26  7:54 UTC (permalink / raw)
  To: Stefan Agner, Robin Murphy
  Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

Hi,

On 23/07/2021 21:48, Stefan Agner wrote:
> On 2021-07-23 19:47, Robin Murphy wrote:
>> On 2021-07-23 17:14, Robin Murphy wrote:
>>> On 2021-07-23 16:56, Stefan Agner wrote:
> <snip>
>>>>>
>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>> map at boot if you want to double-check that that is the source of the
>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>> memblock flag if the region *is* described appropriately, see
>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>
>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>> [    0.000000] efi: UEFI not found.
>>>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>
>>>> So it seems UEFI is not in the play here?
>>>
>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>
>> Actually, poking at U-Boot a bit more I find
>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>> and see if the region ends up being passed as a /memreserve/ as well
>> as a proper reserved-memory node?
>>
>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>> to be suitable for the kind of things which require no-map, and my new
>> guess is that that's what ends up conflicting here.
> 
> Seems to be present in booth:

Indeed, in order so support any combination:
- upstream u-boot
- vendor u-boot
- upstream linux
- other OS

The secmon is in the upstream Linux DT, and upstream u-boot reads the secure memory regions
from the first stage bootloaders and adds them into the DT memreserve.

It worked fine since Linux 4.10-ish, until 5.10.

Neil

> 
> On v5.12.10
> # fdtdump /sys/firmware/fdt
> ...
> /memreserve/ 0x5000000 0x300000;
> ...
>     reserved-memory {
>         #address-cells = <0x00000002>;
>         #size-cells = <0x00000002>;
>         ranges;
>         secmon@5000000 {
>             reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
>             no-map;
>             phandle = <0x00000068>;
>         };
>         linux,cma {
>             compatible = "shared-dma-pool";
>             reusable;
>             size = <0x00000000 0x10000000>;
>             alignment = <0x00000000 0x00400000>;
>             linux,cma-default;
>         };
>     };
> 
> --
> Stefan
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-26  7:54               ` Neil Armstrong
@ 2021-07-26 12:07                 ` Stefan Agner
  2021-07-26 12:31                   ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Agner @ 2021-07-26 12:07 UTC (permalink / raw)
  To: Neil Armstrong
  Cc: Robin Murphy, Byron Stanoszek, linux-amlogic, linux-arm-kernel,
	Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport

FWIW, I did run two boards over the weekend with stress-ng vm test
running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
Make sure no-map does not remove already reserved regions") reverted.
The one without the revert crashed after ~24h, the other did run through
the weekend. Basically confirming what Byron reported.

On 2021-07-26 09:54, Neil Armstrong wrote:
> Hi,
> 
> On 23/07/2021 21:48, Stefan Agner wrote:
>> On 2021-07-23 19:47, Robin Murphy wrote:
>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>> <snip>
>>>>>>
>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>
>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>> [    0.000000] efi: UEFI not found.
>>>>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>
>>>>> So it seems UEFI is not in the play here?
>>>>
>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>
>>> Actually, poking at U-Boot a bit more I find
>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>> and see if the region ends up being passed as a /memreserve/ as well
>>> as a proper reserved-memory node?
>>>
>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>> to be suitable for the kind of things which require no-map, and my new
>>> guess is that that's what ends up conflicting here.
>>
>> Seems to be present in booth:
> 
> Indeed, in order so support any combination:
> - upstream u-boot
> - vendor u-boot
> - upstream linux
> - other OS
> 
> The secmon is in the upstream Linux DT, and upstream u-boot reads the
> secure memory regions
> from the first stage bootloaders and adds them into the DT memreserve.
> 
> It worked fine since Linux 4.10-ish, until 5.10.

Just verified what is probably obvious at this point: By removing
meson_board_add_reserved_memory() the /memreserve/ region isn't present
and "failed to reserve memory" message disappears indeed.

Why is reserving memory not enough? From what I've read no-map also make
sure there is no VM mapping, but if the region is reserved, shouldn't
that be enough for Linux to not access the region? I've read that no-map
also preventsaccess due to speculation, is this what is happening here?

What is the proper solution here? Could maybe
meson_board_add_reserved_memory() check if reserved-memory is present,
and if so avoid adding /memreserve/?

--
Stefan

> 
> Neil
> 
>>
>> On v5.12.10
>> # fdtdump /sys/firmware/fdt
>> ...
>> /memreserve/ 0x5000000 0x300000;
>> ...
>>     reserved-memory {
>>         #address-cells = <0x00000002>;
>>         #size-cells = <0x00000002>;
>>         ranges;
>>         secmon@5000000 {
>>             reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
>>             no-map;
>>             phandle = <0x00000068>;
>>         };
>>         linux,cma {
>>             compatible = "shared-dma-pool";
>>             reusable;
>>             size = <0x00000000 0x10000000>;
>>             alignment = <0x00000000 0x00400000>;
>>             linux,cma-default;
>>         };
>>     };
>>
>> --
>> Stefan
>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random reboots on ODROID-N2+
  2021-07-26 12:07                 ` Stefan Agner
@ 2021-07-26 12:31                   ` Robin Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Robin Murphy @ 2021-07-26 12:31 UTC (permalink / raw)
  To: Stefan Agner, Neil Armstrong
  Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
	Kevin Hilman, Martin Blumenstingl, Mike Rapoport

On 2021-07-26 13:07, Stefan Agner wrote:
> FWIW, I did run two boards over the weekend with stress-ng vm test
> running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
> Make sure no-map does not remove already reserved regions") reverted.
> The one without the revert crashed after ~24h, the other did run through
> the weekend. Basically confirming what Byron reported.
> 
> On 2021-07-26 09:54, Neil Armstrong wrote:
>> Hi,
>>
>> On 23/07/2021 21:48, Stefan Agner wrote:
>>> On 2021-07-23 19:47, Robin Murphy wrote:
>>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>>> <snip>
>>>>>>>
>>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>>
>>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>>> [    0.000000] efi: UEFI not found.
>>>>>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>>
>>>>>> So it seems UEFI is not in the play here?
>>>>>
>>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>>
>>>> Actually, poking at U-Boot a bit more I find
>>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>>> and see if the region ends up being passed as a /memreserve/ as well
>>>> as a proper reserved-memory node?
>>>>
>>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>>> to be suitable for the kind of things which require no-map, and my new
>>>> guess is that that's what ends up conflicting here.
>>>
>>> Seems to be present in booth:
>>
>> Indeed, in order so support any combination:
>> - upstream u-boot
>> - vendor u-boot
>> - upstream linux
>> - other OS
>>
>> The secmon is in the upstream Linux DT, and upstream u-boot reads the
>> secure memory regions
>> from the first stage bootloaders and adds them into the DT memreserve.
>>
>> It worked fine since Linux 4.10-ish, until 5.10.
> 
> Just verified what is probably obvious at this point: By removing
> meson_board_add_reserved_memory() the /memreserve/ region isn't present
> and "failed to reserve memory" message disappears indeed.
> 
> Why is reserving memory not enough? From what I've read no-map also make
> sure there is no VM mapping, but if the region is reserved, shouldn't
> that be enough for Linux to not access the region? I've read that no-map
> also preventsaccess due to speculation, is this what is happening here?

Almost certainly - being reserved either way means that Linux won't try 
to access those pages directly, but if they are still present in the 
linear map as Normal memory which allows speculation, legitimate access 
to adjacent pages may well cause the CPU to end up prefetching into them.

> What is the proper solution here? Could maybe
> meson_board_add_reserved_memory() check if reserved-memory is present,
> and if so avoid adding /memreserve/?

Perhaps, although it doesn't help people who can't or don't want to 
update their firmware. As I say, I'm not sure what the expectations are 
supposed to be for /memreserve/, particularly if it duplicates 
reserved-memory. Furthermore, looking at 8a5a75e5e9e55 I'm also not 
really convinced that making the kernel boot for the sake of debugging a 
fundamentally broken bootloader is a common and realistic enough issue 
to justify breaking the existing not-necessarily-invalid bootloader 
behaviour of other widely-deployed systems :/

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-26 12:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17  9:14 Random reboots on ODROID-N2+ Stefan Agner
2021-05-17 21:09 ` Martin Blumenstingl
2021-05-18  9:16   ` Stefan Agner
2021-05-18  9:35     ` Neil Armstrong
2021-05-18  1:33 ` Andrew Lunn
2021-05-18 10:15   ` Stefan Agner
2021-05-19 20:09 ` Stefan Agner
2021-06-22  7:39 ` Stefan Agner
2021-07-23 14:25   ` Byron Stanoszek
2021-07-23 15:36     ` Robin Murphy
2021-07-23 15:56       ` Stefan Agner
2021-07-23 16:14         ` Robin Murphy
2021-07-23 17:47           ` Robin Murphy
2021-07-23 19:48             ` Stefan Agner
2021-07-26  7:54               ` Neil Armstrong
2021-07-26 12:07                 ` Stefan Agner
2021-07-26 12:31                   ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).