* Random reboots on ODROID-N2+
@ 2021-05-17 9:14 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-17 9:14 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl
Hi,
We are currently testing a new release using Linux 5.10.33. I've
received since several reports of random reboots every couple of days.
Unfortunately the log (journald) doesn't show anything, just a hard cut
at some point.
After running serial console on several instances, I was able to catch
this stack trace:
[202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
[202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
[202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
[202983.988160] sp : ffff8000132a3ae0
[202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
[202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
[202983.988165] x25: 0000000000000073 x24: ffff000007963e24
[202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
[202983.988169] x21: 0000000000000011 x20: 0000000000000073
[202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
[202983.988172] x17: 0000000000000000 x16: 0000000000000000
[202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
[202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
[202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
[202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
[202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
[202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
[202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
[202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
[202983.988186] Kernel panic - not syncing: Asynchronous SError
Interrupt
[202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988188] Call trace:
[202983.988188] dump_backtrace+0x0/0x1a0
[202983.988189] show_stack+0x18/0x70
[202983.988190] dump_stack+0xd0/0x12c
[202983.988190] panic+0x170/0x338
[202983.988191] nmi_panic+0x8c/0x90
[202983.988191] arm64_serror_panic+0x78/0x84
[202983.988192] do_serror+0x38/0xa0
[202983.988193] el1_error+0x88/0x108
[202983.988193] udp_send_skb.isra.0+0x178/0x390
[202983.988194] udp_sendmsg+0x7c8/0x9c0
[202983.988194] inet_sendmsg+0x44/0x70
[202983.988195] sock_sendmsg+0x4c/0x60
[202983.988196] __sys_sendto+0xd0/0x140
[202983.988196] __arm64_sys_sendto+0x28/0x40
[202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
[202983.988197] do_el0_svc+0x24/0x90
[202983.988198] el0_svc+0x14/0x20
[202983.988199] el0_sync_handler+0xb0/0xc0
[202983.988199] el0_sync+0x178/0x180
[202983.988211] SMP: stopping secondary CPUs
[202983.988212] Kernel Offset: disabled
[202983.988212] CPU features: 0x0240002,61082004
[202983.988213] Memory Limit: none
Anyone observed such an issue? I am pretty sure that this is a new issue
as we have many installations using Linux 5.9.16 running stable on the
same hardware,.
Now that I can tell that it is network related I'll try to increase
network load to see if I can find a quicker way to reproduce this.
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Random reboots on ODROID-N2+
@ 2021-05-17 9:14 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-17 9:14 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl
Hi,
We are currently testing a new release using Linux 5.10.33. I've
received since several reports of random reboots every couple of days.
Unfortunately the log (journald) doesn't show anything, just a hard cut
at some point.
After running serial console on several instances, I was able to catch
this stack trace:
[202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
[202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
[202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
[202983.988160] sp : ffff8000132a3ae0
[202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
[202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
[202983.988165] x25: 0000000000000073 x24: ffff000007963e24
[202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
[202983.988169] x21: 0000000000000011 x20: 0000000000000073
[202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
[202983.988172] x17: 0000000000000000 x16: 0000000000000000
[202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
[202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
[202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
[202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
[202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
[202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
[202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
[202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
[202983.988186] Kernel panic - not syncing: Asynchronous SError
Interrupt
[202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988188] Call trace:
[202983.988188] dump_backtrace+0x0/0x1a0
[202983.988189] show_stack+0x18/0x70
[202983.988190] dump_stack+0xd0/0x12c
[202983.988190] panic+0x170/0x338
[202983.988191] nmi_panic+0x8c/0x90
[202983.988191] arm64_serror_panic+0x78/0x84
[202983.988192] do_serror+0x38/0xa0
[202983.988193] el1_error+0x88/0x108
[202983.988193] udp_send_skb.isra.0+0x178/0x390
[202983.988194] udp_sendmsg+0x7c8/0x9c0
[202983.988194] inet_sendmsg+0x44/0x70
[202983.988195] sock_sendmsg+0x4c/0x60
[202983.988196] __sys_sendto+0xd0/0x140
[202983.988196] __arm64_sys_sendto+0x28/0x40
[202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
[202983.988197] do_el0_svc+0x24/0x90
[202983.988198] el0_svc+0x14/0x20
[202983.988199] el0_sync_handler+0xb0/0xc0
[202983.988199] el0_sync+0x178/0x180
[202983.988211] SMP: stopping secondary CPUs
[202983.988212] Kernel Offset: disabled
[202983.988212] CPU features: 0x0240002,61082004
[202983.988213] Memory Limit: none
Anyone observed such an issue? I am pretty sure that this is a new issue
as we have many installations using Linux 5.9.16 running stable on the
same hardware,.
Now that I can tell that it is network related I'll try to increase
network load to see if I can find a quicker way to reproduce this.
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-17 9:14 ` Stefan Agner
@ 2021-05-17 21:09 ` Martin Blumenstingl
-1 siblings, 0 replies; 34+ messages in thread
From: Martin Blumenstingl @ 2021-05-17 21:09 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman
Hi Stefan,
On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
I'm sorry to hear that some things are not working right
[...]
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188] dump_backtrace+0x0/0x1a0
> [202983.988189] show_stack+0x18/0x70
> [202983.988190] dump_stack+0xd0/0x12c
> [202983.988190] panic+0x170/0x338
> [202983.988191] nmi_panic+0x8c/0x90
> [202983.988191] arm64_serror_panic+0x78/0x84
> [202983.988192] do_serror+0x38/0xa0
> [202983.988193] el1_error+0x88/0x108
> [202983.988193] udp_send_skb.isra.0+0x178/0x390
> [202983.988194] udp_sendmsg+0x7c8/0x9c0
> [202983.988194] inet_sendmsg+0x44/0x70
> [202983.988195] sock_sendmsg+0x4c/0x60
> [202983.988196] __sys_sendto+0xd0/0x140
> [202983.988196] __arm64_sys_sendto+0x28/0x40
> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197] do_el0_svc+0x24/0x90
> [202983.988198] el0_svc+0x14/0x20
> [202983.988199] el0_sync_handler+0xb0/0xc0
> [202983.988199] el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
that looks weird
> Anyone observed such an issue? I am pretty sure that this is a new issue
> as we have many installations using Linux 5.9.16 running stable on the
> same hardware,.
I haven't but I am currently trying to hunt down a (probably
unrelated) Ethernet issue on an older Meson8m2 SoC currently.
All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
there's a little bit of "glue" IP for the xMII connecting to the SoC's
IO pads
I think it's a good idea to involve the netdev and (probably even more
important) stmmac maintainers.
Anything skb related is handled by the stmmac driver.
So I am hoping that someone with expertise in that area can give any
hints for debugging or reproducing this.
Best regards,
Martin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-17 21:09 ` Martin Blumenstingl
0 siblings, 0 replies; 34+ messages in thread
From: Martin Blumenstingl @ 2021-05-17 21:09 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman
Hi Stefan,
On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
I'm sorry to hear that some things are not working right
[...]
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188] dump_backtrace+0x0/0x1a0
> [202983.988189] show_stack+0x18/0x70
> [202983.988190] dump_stack+0xd0/0x12c
> [202983.988190] panic+0x170/0x338
> [202983.988191] nmi_panic+0x8c/0x90
> [202983.988191] arm64_serror_panic+0x78/0x84
> [202983.988192] do_serror+0x38/0xa0
> [202983.988193] el1_error+0x88/0x108
> [202983.988193] udp_send_skb.isra.0+0x178/0x390
> [202983.988194] udp_sendmsg+0x7c8/0x9c0
> [202983.988194] inet_sendmsg+0x44/0x70
> [202983.988195] sock_sendmsg+0x4c/0x60
> [202983.988196] __sys_sendto+0xd0/0x140
> [202983.988196] __arm64_sys_sendto+0x28/0x40
> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197] do_el0_svc+0x24/0x90
> [202983.988198] el0_svc+0x14/0x20
> [202983.988199] el0_sync_handler+0xb0/0xc0
> [202983.988199] el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
that looks weird
> Anyone observed such an issue? I am pretty sure that this is a new issue
> as we have many installations using Linux 5.9.16 running stable on the
> same hardware,.
I haven't but I am currently trying to hunt down a (probably
unrelated) Ethernet issue on an older Meson8m2 SoC currently.
All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
there's a little bit of "glue" IP for the xMII connecting to the SoC's
IO pads
I think it's a good idea to involve the netdev and (probably even more
important) stmmac maintainers.
Anything skb related is handled by the stmmac driver.
So I am hoping that someone with expertise in that area can give any
hints for debugging or reproducing this.
Best regards,
Martin
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-17 9:14 ` Stefan Agner
@ 2021-05-18 1:33 ` Andrew Lunn
-1 siblings, 0 replies; 34+ messages in thread
From: Andrew Lunn @ 2021-05-18 1:33 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl
On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
Hi Stefan
Could you generate net/ipv4/udp.lst so we can see what
udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
code it maps to.
Andrew
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-18 1:33 ` Andrew Lunn
0 siblings, 0 replies; 34+ messages in thread
From: Andrew Lunn @ 2021-05-18 1:33 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl
On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
Hi Stefan
Could you generate net/ipv4/udp.lst so we can see what
udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
code it maps to.
Andrew
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-17 21:09 ` Martin Blumenstingl
@ 2021-05-18 9:16 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-18 9:16 UTC (permalink / raw)
To: Martin Blumenstingl
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman
Hi Martin,
On 2021-05-17 23:09, Martin Blumenstingl wrote:
> Hi Stefan,
>
> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
> I'm sorry to hear that some things are not working right
>
> [...]
>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988188] Call trace:
>> [202983.988188] dump_backtrace+0x0/0x1a0
>> [202983.988189] show_stack+0x18/0x70
>> [202983.988190] dump_stack+0xd0/0x12c
>> [202983.988190] panic+0x170/0x338
>> [202983.988191] nmi_panic+0x8c/0x90
>> [202983.988191] arm64_serror_panic+0x78/0x84
>> [202983.988192] do_serror+0x38/0xa0
>> [202983.988193] el1_error+0x88/0x108
>> [202983.988193] udp_send_skb.isra.0+0x178/0x390
>> [202983.988194] udp_sendmsg+0x7c8/0x9c0
>> [202983.988194] inet_sendmsg+0x44/0x70
>> [202983.988195] sock_sendmsg+0x4c/0x60
>> [202983.988196] __sys_sendto+0xd0/0x140
>> [202983.988196] __arm64_sys_sendto+0x28/0x40
>> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
>> [202983.988197] do_el0_svc+0x24/0x90
>> [202983.988198] el0_svc+0x14/0x20
>> [202983.988199] el0_sync_handler+0xb0/0xc0
>> [202983.988199] el0_sync+0x178/0x180
>> [202983.988211] SMP: stopping secondary CPUs
>> [202983.988212] Kernel Offset: disabled
>> [202983.988212] CPU features: 0x0240002,61082004
>> [202983.988213] Memory Limit: none
> that looks weird
>
>> Anyone observed such an issue? I am pretty sure that this is a new issue
>> as we have many installations using Linux 5.9.16 running stable on the
>> same hardware,.
> I haven't but I am currently trying to hunt down a (probably
> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
> there's a little bit of "glue" IP for the xMII connecting to the SoC's
> IO pads
>
> I think it's a good idea to involve the netdev and (probably even more
> important) stmmac maintainers.
> Anything skb related is handled by the stmmac driver.
> So I am hoping that someone with expertise in that area can give any
> hints for debugging or reproducing this.
Ok I'll do that, I currently wait to see the same trace a second time,
just to make sure its really caused by that code path always.
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-18 9:16 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-18 9:16 UTC (permalink / raw)
To: Martin Blumenstingl
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman
Hi Martin,
On 2021-05-17 23:09, Martin Blumenstingl wrote:
> Hi Stefan,
>
> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
> I'm sorry to hear that some things are not working right
>
> [...]
>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988188] Call trace:
>> [202983.988188] dump_backtrace+0x0/0x1a0
>> [202983.988189] show_stack+0x18/0x70
>> [202983.988190] dump_stack+0xd0/0x12c
>> [202983.988190] panic+0x170/0x338
>> [202983.988191] nmi_panic+0x8c/0x90
>> [202983.988191] arm64_serror_panic+0x78/0x84
>> [202983.988192] do_serror+0x38/0xa0
>> [202983.988193] el1_error+0x88/0x108
>> [202983.988193] udp_send_skb.isra.0+0x178/0x390
>> [202983.988194] udp_sendmsg+0x7c8/0x9c0
>> [202983.988194] inet_sendmsg+0x44/0x70
>> [202983.988195] sock_sendmsg+0x4c/0x60
>> [202983.988196] __sys_sendto+0xd0/0x140
>> [202983.988196] __arm64_sys_sendto+0x28/0x40
>> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
>> [202983.988197] do_el0_svc+0x24/0x90
>> [202983.988198] el0_svc+0x14/0x20
>> [202983.988199] el0_sync_handler+0xb0/0xc0
>> [202983.988199] el0_sync+0x178/0x180
>> [202983.988211] SMP: stopping secondary CPUs
>> [202983.988212] Kernel Offset: disabled
>> [202983.988212] CPU features: 0x0240002,61082004
>> [202983.988213] Memory Limit: none
> that looks weird
>
>> Anyone observed such an issue? I am pretty sure that this is a new issue
>> as we have many installations using Linux 5.9.16 running stable on the
>> same hardware,.
> I haven't but I am currently trying to hunt down a (probably
> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
> there's a little bit of "glue" IP for the xMII connecting to the SoC's
> IO pads
>
> I think it's a good idea to involve the netdev and (probably even more
> important) stmmac maintainers.
> Anything skb related is handled by the stmmac driver.
> So I am hoping that someone with expertise in that area can give any
> hints for debugging or reproducing this.
Ok I'll do that, I currently wait to see the same trace a second time,
just to make sure its really caused by that code path always.
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-18 9:16 ` Stefan Agner
@ 2021-05-18 9:35 ` Neil Armstrong
-1 siblings, 0 replies; 34+ messages in thread
From: Neil Armstrong @ 2021-05-18 9:35 UTC (permalink / raw)
To: Stefan Agner, Martin Blumenstingl
Cc: linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman
Hi Stefan,
On 18/05/2021 11:16, Stefan Agner wrote:
> Hi Martin,
>
> On 2021-05-17 23:09, Martin Blumenstingl wrote:
>> Hi Stefan,
>>
>> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>>
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>> I'm sorry to hear that some things are not working right
>>
>> [...]
>>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988188] Call trace:
>>> [202983.988188] dump_backtrace+0x0/0x1a0
>>> [202983.988189] show_stack+0x18/0x70
>>> [202983.988190] dump_stack+0xd0/0x12c
>>> [202983.988190] panic+0x170/0x338
>>> [202983.988191] nmi_panic+0x8c/0x90
>>> [202983.988191] arm64_serror_panic+0x78/0x84
>>> [202983.988192] do_serror+0x38/0xa0
>>> [202983.988193] el1_error+0x88/0x108
>>> [202983.988193] udp_send_skb.isra.0+0x178/0x390
>>> [202983.988194] udp_sendmsg+0x7c8/0x9c0
>>> [202983.988194] inet_sendmsg+0x44/0x70
>>> [202983.988195] sock_sendmsg+0x4c/0x60
>>> [202983.988196] __sys_sendto+0xd0/0x140
>>> [202983.988196] __arm64_sys_sendto+0x28/0x40
>>> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
>>> [202983.988197] do_el0_svc+0x24/0x90
>>> [202983.988198] el0_svc+0x14/0x20
>>> [202983.988199] el0_sync_handler+0xb0/0xc0
>>> [202983.988199] el0_sync+0x178/0x180
>>> [202983.988211] SMP: stopping secondary CPUs
>>> [202983.988212] Kernel Offset: disabled
>>> [202983.988212] CPU features: 0x0240002,61082004
>>> [202983.988213] Memory Limit: none
>> that looks weird
>>
>>> Anyone observed such an issue? I am pretty sure that this is a new issue
>>> as we have many installations using Linux 5.9.16 running stable on the
>>> same hardware,.
>> I haven't but I am currently trying to hunt down a (probably
>> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
>> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
>> there's a little bit of "glue" IP for the xMII connecting to the SoC's
>> IO pads
>>
>> I think it's a good idea to involve the netdev and (probably even more
>> important) stmmac maintainers.
>> Anything skb related is handled by the stmmac driver.
>> So I am hoping that someone with expertise in that area can give any
>> hints for debugging or reproducing this.
>
> Ok I'll do that, I currently wait to see the same trace a second time,
> just to make sure its really caused by that code path always.
A good work would be to eventually do a bisect between the last known working and
the currently working version.
SError Interrupt looks like an HW issue caused by a change in v5.10
Neil
>
> --
> Stefan
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-18 9:35 ` Neil Armstrong
0 siblings, 0 replies; 34+ messages in thread
From: Neil Armstrong @ 2021-05-18 9:35 UTC (permalink / raw)
To: Stefan Agner, Martin Blumenstingl
Cc: linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman
Hi Stefan,
On 18/05/2021 11:16, Stefan Agner wrote:
> Hi Martin,
>
> On 2021-05-17 23:09, Martin Blumenstingl wrote:
>> Hi Stefan,
>>
>> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote:
>>>
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>> I'm sorry to hear that some things are not working right
>>
>> [...]
>>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988188] Call trace:
>>> [202983.988188] dump_backtrace+0x0/0x1a0
>>> [202983.988189] show_stack+0x18/0x70
>>> [202983.988190] dump_stack+0xd0/0x12c
>>> [202983.988190] panic+0x170/0x338
>>> [202983.988191] nmi_panic+0x8c/0x90
>>> [202983.988191] arm64_serror_panic+0x78/0x84
>>> [202983.988192] do_serror+0x38/0xa0
>>> [202983.988193] el1_error+0x88/0x108
>>> [202983.988193] udp_send_skb.isra.0+0x178/0x390
>>> [202983.988194] udp_sendmsg+0x7c8/0x9c0
>>> [202983.988194] inet_sendmsg+0x44/0x70
>>> [202983.988195] sock_sendmsg+0x4c/0x60
>>> [202983.988196] __sys_sendto+0xd0/0x140
>>> [202983.988196] __arm64_sys_sendto+0x28/0x40
>>> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
>>> [202983.988197] do_el0_svc+0x24/0x90
>>> [202983.988198] el0_svc+0x14/0x20
>>> [202983.988199] el0_sync_handler+0xb0/0xc0
>>> [202983.988199] el0_sync+0x178/0x180
>>> [202983.988211] SMP: stopping secondary CPUs
>>> [202983.988212] Kernel Offset: disabled
>>> [202983.988212] CPU features: 0x0240002,61082004
>>> [202983.988213] Memory Limit: none
>> that looks weird
>>
>>> Anyone observed such an issue? I am pretty sure that this is a new issue
>>> as we have many installations using Linux 5.9.16 running stable on the
>>> same hardware,.
>> I haven't but I am currently trying to hunt down a (probably
>> unrelated) Ethernet issue on an older Meson8m2 SoC currently.
>> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus
>> there's a little bit of "glue" IP for the xMII connecting to the SoC's
>> IO pads
>>
>> I think it's a good idea to involve the netdev and (probably even more
>> important) stmmac maintainers.
>> Anything skb related is handled by the stmmac driver.
>> So I am hoping that someone with expertise in that area can give any
>> hints for debugging or reproducing this.
>
> Ok I'll do that, I currently wait to see the same trace a second time,
> just to make sure its really caused by that code path always.
A good work would be to eventually do a bisect between the last known working and
the currently working version.
SError Interrupt looks like an HW issue caused by a change in v5.10
Neil
>
> --
> Stefan
>
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-18 1:33 ` Andrew Lunn
@ 2021-05-18 10:15 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-18 10:15 UTC (permalink / raw)
To: Andrew Lunn
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl
On 2021-05-18 03:33, Andrew Lunn wrote:
> On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> Hi Stefan
Hi Andrew,
>
> Could you generate net/ipv4/udp.lst so we can see what
> udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
> code it maps to.
Ok, built net/ipv4/udp.lst using the same build environment (buildroot)
the kernel which generated the stack trace has been built with, so I
think this should add up:
ffff800010c1bb60 <udp_send_skb.isra.0>:
static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
...
udp4_hwcsum(skb, fl4->saddr, fl4->daddr);
ffff800010c1bc78: 29450ae1 ldp w1, w2, [x23, #40]
ffff800010c1bc7c: aa1303e0 mov x0, x19
ffff800010c1bc80: 94000000 bl ffff800010c184b0
<udp4_hwcsum>
ffff800010c1bc80: R_AARCH64_CALL26
udp4_hwcsum
err = ip_send_skb(sock_net(sk), skb);
ffff800010c1bc84: f9401ac0 ldr x0, [x22, #48]
ffff800010c1bc88: aa1303e1 mov x1, x19
ffff800010c1bc8c: 94000000 bl 0 <ip_send_skb>
ffff800010c1bc8c: R_AARCH64_CALL26
ip_send_skb
if (err) {
ffff800010c1bc90: 350008e0 cbnz w0, ffff800010c1bdac
<udp_send_skb.isra.0+0x24c>
...
u64 pc = READ_ONCE(ti->preempt_count);
ffff800010c1bcd4: f9400820 ldr x0, [x1, #16]
WRITE_ONCE(ti->preempt.count, --pc);
ffff800010c1bcd8: d1000400 sub x0, x0, #0x1
ffff800010c1bcdc: b9001020 str w0, [x1, #16]
return !pc || !READ_ONCE(ti->preempt_count);
...
The full udp.lst file:
https://drive.google.com/file/d/1j0RKOfuMXmCRWILpkG3uk_beohWrr-ho/view?usp=sharing
Since I only have this one trace, I am not 100% if this trace is just a
random one or always the case.
But things seem to add up to me: mdns-repeater deals with UDP packets,
and the it seems that the code tries to make use of HW check-summing
(from lr)? This would explain why this platform only shows the problem.
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-18 10:15 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-18 10:15 UTC (permalink / raw)
To: Andrew Lunn
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl
On 2021-05-18 03:33, Andrew Lunn wrote:
> On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> Hi Stefan
Hi Andrew,
>
> Could you generate net/ipv4/udp.lst so we can see what
> udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C
> code it maps to.
Ok, built net/ipv4/udp.lst using the same build environment (buildroot)
the kernel which generated the stack trace has been built with, so I
think this should add up:
ffff800010c1bb60 <udp_send_skb.isra.0>:
static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
...
udp4_hwcsum(skb, fl4->saddr, fl4->daddr);
ffff800010c1bc78: 29450ae1 ldp w1, w2, [x23, #40]
ffff800010c1bc7c: aa1303e0 mov x0, x19
ffff800010c1bc80: 94000000 bl ffff800010c184b0
<udp4_hwcsum>
ffff800010c1bc80: R_AARCH64_CALL26
udp4_hwcsum
err = ip_send_skb(sock_net(sk), skb);
ffff800010c1bc84: f9401ac0 ldr x0, [x22, #48]
ffff800010c1bc88: aa1303e1 mov x1, x19
ffff800010c1bc8c: 94000000 bl 0 <ip_send_skb>
ffff800010c1bc8c: R_AARCH64_CALL26
ip_send_skb
if (err) {
ffff800010c1bc90: 350008e0 cbnz w0, ffff800010c1bdac
<udp_send_skb.isra.0+0x24c>
...
u64 pc = READ_ONCE(ti->preempt_count);
ffff800010c1bcd4: f9400820 ldr x0, [x1, #16]
WRITE_ONCE(ti->preempt.count, --pc);
ffff800010c1bcd8: d1000400 sub x0, x0, #0x1
ffff800010c1bcdc: b9001020 str w0, [x1, #16]
return !pc || !READ_ONCE(ti->preempt_count);
...
The full udp.lst file:
https://drive.google.com/file/d/1j0RKOfuMXmCRWILpkG3uk_beohWrr-ho/view?usp=sharing
Since I only have this one trace, I am not 100% if this trace is just a
random one or always the case.
But things seem to add up to me: mdns-repeater deals with UDP packets,
and the it seems that the code tries to make use of HW check-summing
(from lr)? This would explain why this platform only shows the problem.
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-17 9:14 ` Stefan Agner
@ 2021-05-19 20:09 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-19 20:09 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, andrew
On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
> [202983.988160] sp : ffff8000132a3ae0
> [202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
> [202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
> [202983.988165] x25: 0000000000000073 x24: ffff000007963e24
> [202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
> [202983.988169] x21: 0000000000000011 x20: 0000000000000073
> [202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
> [202983.988172] x17: 0000000000000000 x16: 0000000000000000
> [202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
> [202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
> [202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
> [202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
> [202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
> [202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
> [202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
> [202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
> [202983.988186] Kernel panic - not syncing: Asynchronous SError
> Interrupt
> [202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188] dump_backtrace+0x0/0x1a0
> [202983.988189] show_stack+0x18/0x70
> [202983.988190] dump_stack+0xd0/0x12c
> [202983.988190] panic+0x170/0x338
> [202983.988191] nmi_panic+0x8c/0x90
> [202983.988191] arm64_serror_panic+0x78/0x84
> [202983.988192] do_serror+0x38/0xa0
> [202983.988193] el1_error+0x88/0x108
> [202983.988193] udp_send_skb.isra.0+0x178/0x390
> [202983.988194] udp_sendmsg+0x7c8/0x9c0
> [202983.988194] inet_sendmsg+0x44/0x70
> [202983.988195] sock_sendmsg+0x4c/0x60
> [202983.988196] __sys_sendto+0xd0/0x140
> [202983.988196] __arm64_sys_sendto+0x28/0x40
> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197] do_el0_svc+0x24/0x90
> [202983.988198] el0_svc+0x14/0x20
> [202983.988199] el0_sync_handler+0xb0/0xc0
> [202983.988199] el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
>
A second stack trace, same build etc. but different board (instance):
[48112.247242] SError Interrupt on CPU5, code 0xbf000000 -- SError
[48112.247244] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247245] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247246] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[48112.247247] pc : __rcu_read_lock+0x18/0x20
[48112.247248] lr : lock_page_memcg+0x28/0xd0
[48112.247249] sp : ffff800013e238e0
[48112.247249] x29: ffff800013e238e0 x28: ffff800013e23b18
[48112.247252] x27: ffff000055c5c780 x26: 0000ffff9163c000
[48112.247254] x25: ffff0000053000c0 x24: 00e00000d40e3bc3
[48112.247256] x23: fffffe00033038c0 x22: ffff800013e23a18
[48112.247257] x21: 0000ffff9163b000 x20: fffffe00033038c0
[48112.247259] x19: fffffe00033038c0 x18: 0000000000000000
[48112.247261] x17: 0000000000000000 x16: 0000000000000000
[48112.247262] x15: 0000000000000002 x14: 0000000000000001
[48112.247264] x13: fffffe0001acdd08 x12: 0000000000000000
[48112.247265] x11: ffff0000e4650100 x10: ffff00004c640000
[48112.247267] x9 : 000000000000000c x8 : 00000000ffffffff
[48112.247268] x7 : 0000000000000020 x6 : 0000000000000000
[48112.247270] x5 : 00000000000d40e3 x4 : 0000ffff9163b000
[48112.247271] x3 : 00000000ffffffff x2 : 0000000000000001
[48112.247273] x1 : ffff000000182ac0 x0 : 0000000000000001
[48112.247275] Kernel panic - not syncing: Asynchronous SError Interrupt
[48112.247275] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247276] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247277] Call trace:
[48112.247277] dump_backtrace+0x0/0x1a0
[48112.247278] show_stack+0x18/0x70
[48112.247279] dump_stack+0xd0/0x12c
[48112.247279] panic+0x170/0x338
[48112.247280] nmi_panic+0x8c/0x90
[48112.247280] arm64_serror_panic+0x78/0x84
[48112.247281] do_serror+0x38/0xa0
[48112.247281] el1_error+0x88/0x108
[48112.247282] __rcu_read_lock+0x18/0x20
[48112.247283] page_remove_rmap+0x1c/0x560
[48112.247283] unmap_page_range+0x5b0/0x7b0
[48112.247284] unmap_single_vma+0x4c/0xb0
[48112.247285] unmap_vmas+0x70/0xf0
[48112.247285] exit_mmap+0xc8/0x180
[48112.247286] mmput+0x7c/0x150
[48112.247286] begin_new_exec+0x2d4/0xa90
[48112.247287] load_elf_binary+0x38c/0x1800
[48112.247288] bprm_execve+0x29c/0x5d0
[48112.247288] do_execveat_common.isra.0+0x178/0x1d0
[48112.247289] __arm64_sys_execve+0x40/0x60
[48112.247290] el0_svc_common.constprop.0+0x78/0x1a0
[48112.247290] do_el0_svc+0x24/0x90
[48112.247291] el0_svc+0x14/0x20
[48112.247291] el0_sync_handler+0xb0/0xc0
[48112.247292] el0_sync+0x178/0x180
[48112.247303] SMP: stopping secondary CPUs
[48112.247304] Kernel Offset: disabled
[48112.247305] CPU features: 0x0240002,61082004
[48112.247305] Memory Limit: none
The stack trace does not look related to me...
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-05-19 20:09 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-05-19 20:09 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, andrew
On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
> [202983.988160] sp : ffff8000132a3ae0
> [202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0
> [202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58
> [202983.988165] x25: 0000000000000073 x24: ffff000007963e24
> [202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80
> [202983.988169] x21: 0000000000000011 x20: 0000000000000073
> [202983.988170] x19: ffff000001a92c00 x18: 0000000000000001
> [202983.988172] x17: 0000000000000000 x16: 0000000000000000
> [202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001
> [202983.988175] x13: ffff0000079181a0 x12: 0000000000000028
> [202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0
> [202983.988178] x9 : 000000000000001b x8 : ffff000007963e00
> [202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28
> [202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000
> [202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000
> [202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001
> [202983.988186] Kernel panic - not syncing: Asynchronous SError
> Interrupt
> [202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988188] Call trace:
> [202983.988188] dump_backtrace+0x0/0x1a0
> [202983.988189] show_stack+0x18/0x70
> [202983.988190] dump_stack+0xd0/0x12c
> [202983.988190] panic+0x170/0x338
> [202983.988191] nmi_panic+0x8c/0x90
> [202983.988191] arm64_serror_panic+0x78/0x84
> [202983.988192] do_serror+0x38/0xa0
> [202983.988193] el1_error+0x88/0x108
> [202983.988193] udp_send_skb.isra.0+0x178/0x390
> [202983.988194] udp_sendmsg+0x7c8/0x9c0
> [202983.988194] inet_sendmsg+0x44/0x70
> [202983.988195] sock_sendmsg+0x4c/0x60
> [202983.988196] __sys_sendto+0xd0/0x140
> [202983.988196] __arm64_sys_sendto+0x28/0x40
> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0
> [202983.988197] do_el0_svc+0x24/0x90
> [202983.988198] el0_svc+0x14/0x20
> [202983.988199] el0_sync_handler+0xb0/0xc0
> [202983.988199] el0_sync+0x178/0x180
> [202983.988211] SMP: stopping secondary CPUs
> [202983.988212] Kernel Offset: disabled
> [202983.988212] CPU features: 0x0240002,61082004
> [202983.988213] Memory Limit: none
>
A second stack trace, same build etc. but different board (instance):
[48112.247242] SError Interrupt on CPU5, code 0xbf000000 -- SError
[48112.247244] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247245] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247246] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[48112.247247] pc : __rcu_read_lock+0x18/0x20
[48112.247248] lr : lock_page_memcg+0x28/0xd0
[48112.247249] sp : ffff800013e238e0
[48112.247249] x29: ffff800013e238e0 x28: ffff800013e23b18
[48112.247252] x27: ffff000055c5c780 x26: 0000ffff9163c000
[48112.247254] x25: ffff0000053000c0 x24: 00e00000d40e3bc3
[48112.247256] x23: fffffe00033038c0 x22: ffff800013e23a18
[48112.247257] x21: 0000ffff9163b000 x20: fffffe00033038c0
[48112.247259] x19: fffffe00033038c0 x18: 0000000000000000
[48112.247261] x17: 0000000000000000 x16: 0000000000000000
[48112.247262] x15: 0000000000000002 x14: 0000000000000001
[48112.247264] x13: fffffe0001acdd08 x12: 0000000000000000
[48112.247265] x11: ffff0000e4650100 x10: ffff00004c640000
[48112.247267] x9 : 000000000000000c x8 : 00000000ffffffff
[48112.247268] x7 : 0000000000000020 x6 : 0000000000000000
[48112.247270] x5 : 00000000000d40e3 x4 : 0000ffff9163b000
[48112.247271] x3 : 00000000ffffffff x2 : 0000000000000001
[48112.247273] x1 : ffff000000182ac0 x0 : 0000000000000001
[48112.247275] Kernel panic - not syncing: Asynchronous SError Interrupt
[48112.247275] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1
[48112.247276] Hardware name: Hardkernel ODROID-N2Plus (DT)
[48112.247277] Call trace:
[48112.247277] dump_backtrace+0x0/0x1a0
[48112.247278] show_stack+0x18/0x70
[48112.247279] dump_stack+0xd0/0x12c
[48112.247279] panic+0x170/0x338
[48112.247280] nmi_panic+0x8c/0x90
[48112.247280] arm64_serror_panic+0x78/0x84
[48112.247281] do_serror+0x38/0xa0
[48112.247281] el1_error+0x88/0x108
[48112.247282] __rcu_read_lock+0x18/0x20
[48112.247283] page_remove_rmap+0x1c/0x560
[48112.247283] unmap_page_range+0x5b0/0x7b0
[48112.247284] unmap_single_vma+0x4c/0xb0
[48112.247285] unmap_vmas+0x70/0xf0
[48112.247285] exit_mmap+0xc8/0x180
[48112.247286] mmput+0x7c/0x150
[48112.247286] begin_new_exec+0x2d4/0xa90
[48112.247287] load_elf_binary+0x38c/0x1800
[48112.247288] bprm_execve+0x29c/0x5d0
[48112.247288] do_execveat_common.isra.0+0x178/0x1d0
[48112.247289] __arm64_sys_execve+0x40/0x60
[48112.247290] el0_svc_common.constprop.0+0x78/0x1a0
[48112.247290] do_el0_svc+0x24/0x90
[48112.247291] el0_svc+0x14/0x20
[48112.247291] el0_sync_handler+0xb0/0xc0
[48112.247292] el0_sync+0x178/0x180
[48112.247303] SMP: stopping secondary CPUs
[48112.247304] Kernel Offset: disabled
[48112.247305] CPU features: 0x0240002,61082004
[48112.247305] Memory Limit: none
The stack trace does not look related to me...
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-05-17 9:14 ` Stefan Agner
@ 2021-06-22 7:39 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-06-22 7:39 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl
On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
<snip>
We do see those crashes in similar frequency with Linux 5.12:
[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
[129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[129988.642352] pc : free_page_and_swap_cache+0x0/0x110
[129988.642352] lr : tlb_remove_table_rcu+0x30/0x60
[129988.642353] sp : ffff8000115bbdf0
[129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18
[129988.642358] x27: 000000000000000a x26: ffff000000120000
[129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90
[129988.642362] x23: ffff800011456680 x22: ffff0000e07df970
[129988.642365] x21: 0000000000000003 x20: 0000000000000001
[129988.642367] x19: ffff000005300000 x18: 0000000000000000
[129988.642369] x17: 0000000000000000 x16: 0000000000000000
[129988.642371] x15: 0000000000000000 x14: 0000000000000500
[129988.642373] x13: 0000000000000002 x12: 0000000000000000
[129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800
[129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8
[129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001
[129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007
[129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00
[129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80
[129988.642387] Kernel panic - not syncing: Asynchronous SError
Interrupt
[129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642390] Call trace:
[129988.642391] dump_backtrace+0x0/0x1a0
[129988.642392] show_stack+0x18/0x70
[129988.642392] dump_stack+0xd0/0x12c
[129988.642393] panic+0x170/0x338
[129988.642394] nmi_panic+0x8c/0x90
[129988.642395] arm64_serror_panic+0x78/0x84
[129988.642395] do_serror+0x38/0xa0
[129988.642396] el1_error+0x80/0xf8
[129988.642397] free_page_and_swap_cache+0x0/0x110
[129988.642398] rcu_core+0x310/0x5d0
[129988.642398] rcu_core_si+0x10/0x20
[129988.642399] _stext+0x128/0x28c
[129988.642400] irq_exit+0xd8/0x100
[129988.642401] __handle_domain_irq+0x68/0xc0
[129988.642401] gic_handle_irq+0xa8/0xe0
[129988.642402] el1_irq+0xbc/0x180
[129988.642403] arch_cpu_idle+0x18/0x30
[129988.642404] default_idle_call+0x20/0x68
[129988.642404] do_idle+0x218/0x270
[129988.642405] cpu_startup_entry+0x24/0x70
[129988.642406] secondary_start_kernel+0x178/0x190
[129988.642418] SMP: stopping secondary CPUs
[129988.642419] Kernel Offset: disabled
[129988.642420] CPU features: 0x00240002,61082004
[129988.642421] Memory Limit: none
It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).
I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-06-22 7:39 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-06-22 7:39 UTC (permalink / raw)
To: linux-amlogic, linux-arm-kernel
Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl
On 2021-05-17 11:14, Stefan Agner wrote:
> Hi,
>
> We are currently testing a new release using Linux 5.10.33. I've
> received since several reports of random reboots every couple of days.
> Unfortunately the log (journald) doesn't show anything, just a hard cut
> at some point.
>
> After running serial console on several instances, I was able to catch
> this stack trace:
>
> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
> #1
> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
<snip>
We do see those crashes in similar frequency with Linux 5.12:
[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
[129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[129988.642352] pc : free_page_and_swap_cache+0x0/0x110
[129988.642352] lr : tlb_remove_table_rcu+0x30/0x60
[129988.642353] sp : ffff8000115bbdf0
[129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18
[129988.642358] x27: 000000000000000a x26: ffff000000120000
[129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90
[129988.642362] x23: ffff800011456680 x22: ffff0000e07df970
[129988.642365] x21: 0000000000000003 x20: 0000000000000001
[129988.642367] x19: ffff000005300000 x18: 0000000000000000
[129988.642369] x17: 0000000000000000 x16: 0000000000000000
[129988.642371] x15: 0000000000000000 x14: 0000000000000500
[129988.642373] x13: 0000000000000002 x12: 0000000000000000
[129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800
[129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8
[129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001
[129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007
[129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00
[129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80
[129988.642387] Kernel panic - not syncing: Asynchronous SError
Interrupt
[129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1
[129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT)
[129988.642390] Call trace:
[129988.642391] dump_backtrace+0x0/0x1a0
[129988.642392] show_stack+0x18/0x70
[129988.642392] dump_stack+0xd0/0x12c
[129988.642393] panic+0x170/0x338
[129988.642394] nmi_panic+0x8c/0x90
[129988.642395] arm64_serror_panic+0x78/0x84
[129988.642395] do_serror+0x38/0xa0
[129988.642396] el1_error+0x80/0xf8
[129988.642397] free_page_and_swap_cache+0x0/0x110
[129988.642398] rcu_core+0x310/0x5d0
[129988.642398] rcu_core_si+0x10/0x20
[129988.642399] _stext+0x128/0x28c
[129988.642400] irq_exit+0xd8/0x100
[129988.642401] __handle_domain_irq+0x68/0xc0
[129988.642401] gic_handle_irq+0xa8/0xe0
[129988.642402] el1_irq+0xbc/0x180
[129988.642403] arch_cpu_idle+0x18/0x30
[129988.642404] default_idle_call+0x20/0x68
[129988.642404] do_idle+0x218/0x270
[129988.642405] cpu_startup_entry+0x24/0x70
[129988.642406] secondary_start_kernel+0x178/0x190
[129988.642418] SMP: stopping secondary CPUs
[129988.642419] Kernel Offset: disabled
[129988.642420] CPU features: 0x00240002,61082004
[129988.642421] Memory Limit: none
It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).
I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-06-22 7:39 ` Stefan Agner
@ 2021-07-23 14:25 ` Byron Stanoszek
-1 siblings, 0 replies; 34+ messages in thread
From: Byron Stanoszek @ 2021-07-23 14:25 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On Tue, 22 Jun 2021, Stefan Agner wrote:
> On 2021-05-17 11:14, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> <snip>
>
> We do see those crashes in similar frequency with Linux 5.12:
>
> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>
> It seems load and/or hardware dependent since we see it on some devices
> quite frequent (every few days), and on others it takes multiple weeks.
> Of course the once we see it frequently are the ones in production :).
>
> I am currently trying different stress-ng and other load to accelerate
> the crash rate before then trying to git bisect it.
I have an Odroid-N2+ and was able to track this problem down. The problem is
related to the following dmesg line that reads "failed to reserve memory"
below:
Machine model: Hardkernel ODROID-N2Plus
memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
OF: reserved mem: node linux,cma compatible matching fail
memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
...
A subsequent "cat /proc/iomem" shows that this memory region is still reserved
and the system appears to operate normally, until eventually the SError
Interrupt comes in under heavy memory/page-cache usage. The difference with
later kernels is that now the memory at 0x5000000-0x52fffff is registered under
the "System RAM" memory area, whereas previous kernels had dropped it from
"System RAM".
The culprit is this new code introduced in Linux 5.12, in this function in
drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
phys_addr_t size, bool nomap)
{
if (nomap) {
/*
* If the memory is already reserved (by another region), we
* should not allow it to be marked nomap.
*/
if (memblock_is_region_reserved(base, size)) <------
return -EBUSY; <------
return memblock_mark_nomap(base, size);
}
return memblock_reserve(base, size);
}
"nomap" is true, due to this text being present in the FDT:
reserved-memory {
ranges secmon_reserved: secmon@5000000 {
reg = <0x0 0x05000000 0x0 0x300000>
no-map
}
...
But memblock_is_region_reserved() is returning true because there is already an
entry for 0x5000000-0x52fffff in the memory map, which is already marked
reserved, at the time the __reserved_mem_reserve_reg() function is called.
(Perhaps this is being set reserved by u-boot? -- I did not research that far.)
This function is defined as:
bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
{
return memblock_overlaps_region(&memblock.reserved, base, size);
}
Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
reserved region "0x5000000-0x52fffff", the function returns true.
If I comment out the "if (memblock_is_region_reserved(base, size))" code and
allow it to mark the region no-map, then the memory area is properly removed
from the "System RAM" area and the crashing stops.
I've had the system up and running for 15 days now under heavy load without any
crashes, using just the following patch as workaround:
--- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
+++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
@@ -1157,13 +1157,6 @@
phys_addr_t size, bool nomap)
{
if (nomap) {
- /*
- * If the memory is already reserved (by another region), we
- * should not allow it to be marked nomap.
- */
- if (memblock_is_region_reserved(base, size))
- return -EBUSY;
-
return memblock_mark_nomap(base, size);
}
return memblock_reserve(base, size);
The above patch applies to later versions of Linux 5.10.x through 5.12.x as
well.
Perhaps a more proper fix is to allow the no-map to still proceed, in the case
that the existing reserved region is identical (same start/end) to the region
getting marked no-map.
-Byron
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 14:25 ` Byron Stanoszek
0 siblings, 0 replies; 34+ messages in thread
From: Byron Stanoszek @ 2021-07-23 14:25 UTC (permalink / raw)
To: Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On Tue, 22 Jun 2021, Stefan Agner wrote:
> On 2021-05-17 11:14, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> <snip>
>
> We do see those crashes in similar frequency with Linux 5.12:
>
> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>
> It seems load and/or hardware dependent since we see it on some devices
> quite frequent (every few days), and on others it takes multiple weeks.
> Of course the once we see it frequently are the ones in production :).
>
> I am currently trying different stress-ng and other load to accelerate
> the crash rate before then trying to git bisect it.
I have an Odroid-N2+ and was able to track this problem down. The problem is
related to the following dmesg line that reads "failed to reserve memory"
below:
Machine model: Hardkernel ODROID-N2Plus
memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
OF: reserved mem: node linux,cma compatible matching fail
memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
...
A subsequent "cat /proc/iomem" shows that this memory region is still reserved
and the system appears to operate normally, until eventually the SError
Interrupt comes in under heavy memory/page-cache usage. The difference with
later kernels is that now the memory at 0x5000000-0x52fffff is registered under
the "System RAM" memory area, whereas previous kernels had dropped it from
"System RAM".
The culprit is this new code introduced in Linux 5.12, in this function in
drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
phys_addr_t size, bool nomap)
{
if (nomap) {
/*
* If the memory is already reserved (by another region), we
* should not allow it to be marked nomap.
*/
if (memblock_is_region_reserved(base, size)) <------
return -EBUSY; <------
return memblock_mark_nomap(base, size);
}
return memblock_reserve(base, size);
}
"nomap" is true, due to this text being present in the FDT:
reserved-memory {
ranges secmon_reserved: secmon@5000000 {
reg = <0x0 0x05000000 0x0 0x300000>
no-map
}
...
But memblock_is_region_reserved() is returning true because there is already an
entry for 0x5000000-0x52fffff in the memory map, which is already marked
reserved, at the time the __reserved_mem_reserve_reg() function is called.
(Perhaps this is being set reserved by u-boot? -- I did not research that far.)
This function is defined as:
bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
{
return memblock_overlaps_region(&memblock.reserved, base, size);
}
Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
reserved region "0x5000000-0x52fffff", the function returns true.
If I comment out the "if (memblock_is_region_reserved(base, size))" code and
allow it to mark the region no-map, then the memory area is properly removed
from the "System RAM" area and the crashing stops.
I've had the system up and running for 15 days now under heavy load without any
crashes, using just the following patch as workaround:
--- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
+++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
@@ -1157,13 +1157,6 @@
phys_addr_t size, bool nomap)
{
if (nomap) {
- /*
- * If the memory is already reserved (by another region), we
- * should not allow it to be marked nomap.
- */
- if (memblock_is_region_reserved(base, size))
- return -EBUSY;
-
return memblock_mark_nomap(base, size);
}
return memblock_reserve(base, size);
The above patch applies to later versions of Linux 5.10.x through 5.12.x as
well.
Perhaps a more proper fix is to allow the no-map to still proceed, in the case
that the existing reserved region is identical (same start/end) to the region
getting marked no-map.
-Byron
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 14:25 ` Byron Stanoszek
@ 2021-07-23 15:36 ` Robin Murphy
-1 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 15:36 UTC (permalink / raw)
To: Byron Stanoszek, Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 15:25, Byron Stanoszek wrote:
> On Tue, 22 Jun 2021, Stefan Agner wrote:
>
>> On 2021-05-17 11:14, Stefan Agner wrote:
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>>>
>>> After running serial console on several instances, I was able to catch
>>> this stack trace:
>>>
>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>> #1
>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>
>> <snip>
>>
>> We do see those crashes in similar frequency with Linux 5.12:
>>
>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>
>> It seems load and/or hardware dependent since we see it on some devices
>> quite frequent (every few days), and on others it takes multiple weeks.
>> Of course the once we see it frequently are the ones in production :).
>>
>> I am currently trying different stress-ng and other load to accelerate
>> the crash rate before then trying to git bisect it.
>
> I have an Odroid-N2+ and was able to track this problem down. The
> problem is
> related to the following dmesg line that reads "failed to reserve memory"
> below:
>
> Machine model: Hardkernel ODROID-N2Plus
> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
> memblock_reserve: [0x0000000008210000-0x0000000008baffff]
> 0xffffffc0107e36dc
> memblock_reserve: [0x0000000005000000-0x00000000052fffff]
> 0xffffffc0107feb50
> OF: fdt: Reserved memory: failed to reserve memory for node
> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff]
> 0xffffffc0107ff87c
> OF: reserved mem: node linux,cma compatible matching fail
> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
> ...
>
> A subsequent "cat /proc/iomem" shows that this memory region is still
> reserved
> and the system appears to operate normally, until eventually the SError
> Interrupt comes in under heavy memory/page-cache usage. The difference with
> later kernels is that now the memory at 0x5000000-0x52fffff is
> registered under
> the "System RAM" memory area, whereas previous kernels had dropped it from
> "System RAM".
>
> The culprit is this new code introduced in Linux 5.12, in this function in
> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>
> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> /*
> * If the memory is already reserved (by another
> region), we
> * should not allow it to be marked nomap.
> */
> if (memblock_is_region_reserved(base, size)) <------
> return -EBUSY; <------
>
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
> }
>
> "nomap" is true, due to this text being present in the FDT:
>
> reserved-memory {
> ranges secmon_reserved: secmon@5000000 {
> reg = <0x0 0x05000000 0x0 0x300000>
> no-map
> }
> ...
>
> But memblock_is_region_reserved() is returning true because there is
> already an
> entry for 0x5000000-0x52fffff in the memory map, which is already marked
> reserved, at the time the __reserved_mem_reserve_reg() function is called.
> (Perhaps this is being set reserved by u-boot? -- I did not research
> that far.)
>
> This function is defined as:
>
> bool __init_memblock memblock_is_region_reserved(phys_addr_t base,
> phys_addr_t size)
> {
> return memblock_overlaps_region(&memblock.reserved, base, size);
> }
>
> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the
> existing
> reserved region "0x5000000-0x52fffff", the function returns true.
>
> If I comment out the "if (memblock_is_region_reserved(base, size))" code
> and
> allow it to mark the region no-map, then the memory area is properly
> removed
> from the "System RAM" area and the crashing stops.
>
> I've had the system up and running for 15 days now under heavy load
> without any
> crashes, using just the following patch as workaround:
>
>
> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000
> -0400
> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
> @@ -1157,13 +1157,6 @@
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> - /*
> - * If the memory is already reserved (by another region), we
> - * should not allow it to be marked nomap.
> - */
> - if (memblock_is_region_reserved(base, size))
> - return -EBUSY;
> -
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
>
>
> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
> well.
>
> Perhaps a more proper fix is to allow the no-map to still proceed, in
> the case
> that the existing reserved region is identical (same start/end) to the
> region
> getting marked no-map.
If U-Boot is marking regions with the wrong type/attributes in the EFI
memory map, then the best thing to do would be to fix that. I see a
fairly recent commit which looks suspiciously relevant:
https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
Booting with "efi=debug" should (among other things) print the memory
map at boot if you want to double-check that that is the source of the
mismatch. Our EFI code should be perfectly capable of setting the
memblock flag if the region *is* described appropriately, see
reserve_regions() in drivers/firmware/efi/efi-init.c.
Robin.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 15:36 ` Robin Murphy
0 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 15:36 UTC (permalink / raw)
To: Byron Stanoszek, Stefan Agner
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 15:25, Byron Stanoszek wrote:
> On Tue, 22 Jun 2021, Stefan Agner wrote:
>
>> On 2021-05-17 11:14, Stefan Agner wrote:
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>>>
>>> After running serial console on several instances, I was able to catch
>>> this stack trace:
>>>
>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>> #1
>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>
>> <snip>
>>
>> We do see those crashes in similar frequency with Linux 5.12:
>>
>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>
>> It seems load and/or hardware dependent since we see it on some devices
>> quite frequent (every few days), and on others it takes multiple weeks.
>> Of course the once we see it frequently are the ones in production :).
>>
>> I am currently trying different stress-ng and other load to accelerate
>> the crash rate before then trying to git bisect it.
>
> I have an Odroid-N2+ and was able to track this problem down. The
> problem is
> related to the following dmesg line that reads "failed to reserve memory"
> below:
>
> Machine model: Hardkernel ODROID-N2Plus
> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
> memblock_reserve: [0x0000000008210000-0x0000000008baffff]
> 0xffffffc0107e36dc
> memblock_reserve: [0x0000000005000000-0x00000000052fffff]
> 0xffffffc0107feb50
> OF: fdt: Reserved memory: failed to reserve memory for node
> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff]
> 0xffffffc0107ff87c
> OF: reserved mem: node linux,cma compatible matching fail
> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
> ...
>
> A subsequent "cat /proc/iomem" shows that this memory region is still
> reserved
> and the system appears to operate normally, until eventually the SError
> Interrupt comes in under heavy memory/page-cache usage. The difference with
> later kernels is that now the memory at 0x5000000-0x52fffff is
> registered under
> the "System RAM" memory area, whereas previous kernels had dropped it from
> "System RAM".
>
> The culprit is this new code introduced in Linux 5.12, in this function in
> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>
> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> /*
> * If the memory is already reserved (by another
> region), we
> * should not allow it to be marked nomap.
> */
> if (memblock_is_region_reserved(base, size)) <------
> return -EBUSY; <------
>
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
> }
>
> "nomap" is true, due to this text being present in the FDT:
>
> reserved-memory {
> ranges secmon_reserved: secmon@5000000 {
> reg = <0x0 0x05000000 0x0 0x300000>
> no-map
> }
> ...
>
> But memblock_is_region_reserved() is returning true because there is
> already an
> entry for 0x5000000-0x52fffff in the memory map, which is already marked
> reserved, at the time the __reserved_mem_reserve_reg() function is called.
> (Perhaps this is being set reserved by u-boot? -- I did not research
> that far.)
>
> This function is defined as:
>
> bool __init_memblock memblock_is_region_reserved(phys_addr_t base,
> phys_addr_t size)
> {
> return memblock_overlaps_region(&memblock.reserved, base, size);
> }
>
> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the
> existing
> reserved region "0x5000000-0x52fffff", the function returns true.
>
> If I comment out the "if (memblock_is_region_reserved(base, size))" code
> and
> allow it to mark the region no-map, then the memory area is properly
> removed
> from the "System RAM" area and the crashing stops.
>
> I've had the system up and running for 15 days now under heavy load
> without any
> crashes, using just the following patch as workaround:
>
>
> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000
> -0400
> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
> @@ -1157,13 +1157,6 @@
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> - /*
> - * If the memory is already reserved (by another region), we
> - * should not allow it to be marked nomap.
> - */
> - if (memblock_is_region_reserved(base, size))
> - return -EBUSY;
> -
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
>
>
> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
> well.
>
> Perhaps a more proper fix is to allow the no-map to still proceed, in
> the case
> that the existing reserved region is identical (same start/end) to the
> region
> getting marked no-map.
If U-Boot is marking regions with the wrong type/attributes in the EFI
memory map, then the best thing to do would be to fix that. I see a
fairly recent commit which looks suspiciously relevant:
https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
Booting with "efi=debug" should (among other things) print the memory
map at boot if you want to double-check that that is the source of the
mismatch. Our EFI code should be perfectly capable of setting the
memblock flag if the region *is* described appropriately, see
reserve_regions() in drivers/firmware/efi/efi-init.c.
Robin.
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 15:36 ` Robin Murphy
@ 2021-07-23 15:56 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-23 15:56 UTC (permalink / raw)
To: Robin Murphy, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
Hi Byron, Hi Robin,
Very interesting findings!
On 2021-07-23 17:36, Robin Murphy wrote:
> On 2021-07-23 15:25, Byron Stanoszek wrote:
>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>
>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>> received since several reports of random reboots every couple of days.
>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>> at some point.
>>>>
>>>> After running serial console on several instances, I was able to catch
>>>> this stack trace:
>>>>
>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>> #1
>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>
>>> <snip>
>>>
>>> We do see those crashes in similar frequency with Linux 5.12:
>>>
>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>
>>> It seems load and/or hardware dependent since we see it on some devices
>>> quite frequent (every few days), and on others it takes multiple weeks.
>>> Of course the once we see it frequently are the ones in production :).
>>>
>>> I am currently trying different stress-ng and other load to accelerate
>>> the crash rate before then trying to git bisect it.
>>
>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>> related to the following dmesg line that reads "failed to reserve memory"
>> below:
>>
>> Machine model: Hardkernel ODROID-N2Plus
>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.
>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>> OF: reserved mem: node linux,cma compatible matching fail
>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>> ...
>>
>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>> and the system appears to operate normally, until eventually the SError
>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>> the "System RAM" memory area, whereas previous kernels had dropped it from
>> "System RAM".
>>
>> The culprit is this new code introduced in Linux 5.12, in this function in
>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
It seems that patch got also backported, so that is why I see it with
5.10 as well.
>>
>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>> phys_addr_t size, bool nomap)
>> {
>> if (nomap) {
>> /*
>> * If the memory is already reserved (by another region), we
>> * should not allow it to be marked nomap.
>> */
>> if (memblock_is_region_reserved(base, size)) <------
>> return -EBUSY; <------
>>
>> return memblock_mark_nomap(base, size);
>> }
>> return memblock_reserve(base, size);
>> }
>>
>> "nomap" is true, due to this text being present in the FDT:
>>
>> reserved-memory {
>> ranges secmon_reserved: secmon@5000000 {
>> reg = <0x0 0x05000000 0x0 0x300000>
>> no-map
>> }
>> ...
>>
>> But memblock_is_region_reserved() is returning true because there is already an
>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>
>> This function is defined as:
>>
>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>> {
>> return memblock_overlaps_region(&memblock.reserved, base, size);
>> }
>>
>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>> reserved region "0x5000000-0x52fffff", the function returns true.
>>
>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>> allow it to mark the region no-map, then the memory area is properly removed
>> from the "System RAM" area and the crashing stops.
>>
>> I've had the system up and running for 15 days now under heavy load without any
>> crashes, using just the following patch as workaround:
>>
>>
>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
>> @@ -1157,13 +1157,6 @@
>> phys_addr_t size, bool nomap)
>> {
>> if (nomap) {
>> - /*
>> - * If the memory is already reserved (by another region), we
>> - * should not allow it to be marked nomap.
>> - */
>> - if (memblock_is_region_reserved(base, size))
>> - return -EBUSY;
>> -
>> return memblock_mark_nomap(base, size);
>> }
>> return memblock_reserve(base, size);
>>
>>
>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>> well.
Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.
>>
>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>> that the existing reserved region is identical (same start/end) to the region
>> getting marked no-map.
>
> If U-Boot is marking regions with the wrong type/attributes in the EFI
> memory map, then the best thing to do would be to fix that. I see a
> fairly recent commit which looks suspiciously relevant:
>
> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.
>
> Booting with "efi=debug" should (among other things) print the memory
> map at boot if you want to double-check that that is the source of the
> mismatch. Our EFI code should be perfectly capable of setting the
> memblock flag if the region *is* described appropriately, see
> reserve_regions() in drivers/firmware/efi/efi-init.c.
Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[ 0.000000] Machine model: Hardkernel ODROID-N2Plus
[ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
[ 0.000000] efi: UEFI not found.
[ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
So it seems UEFI is not in the play here?
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 15:56 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-23 15:56 UTC (permalink / raw)
To: Robin Murphy, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
Hi Byron, Hi Robin,
Very interesting findings!
On 2021-07-23 17:36, Robin Murphy wrote:
> On 2021-07-23 15:25, Byron Stanoszek wrote:
>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>
>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>> received since several reports of random reboots every couple of days.
>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>> at some point.
>>>>
>>>> After running serial console on several instances, I was able to catch
>>>> this stack trace:
>>>>
>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>> #1
>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>
>>> <snip>
>>>
>>> We do see those crashes in similar frequency with Linux 5.12:
>>>
>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>
>>> It seems load and/or hardware dependent since we see it on some devices
>>> quite frequent (every few days), and on others it takes multiple weeks.
>>> Of course the once we see it frequently are the ones in production :).
>>>
>>> I am currently trying different stress-ng and other load to accelerate
>>> the crash rate before then trying to git bisect it.
>>
>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>> related to the following dmesg line that reads "failed to reserve memory"
>> below:
>>
>> Machine model: Hardkernel ODROID-N2Plus
>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.
>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>> OF: reserved mem: node linux,cma compatible matching fail
>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>> ...
>>
>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>> and the system appears to operate normally, until eventually the SError
>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>> the "System RAM" memory area, whereas previous kernels had dropped it from
>> "System RAM".
>>
>> The culprit is this new code introduced in Linux 5.12, in this function in
>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
It seems that patch got also backported, so that is why I see it with
5.10 as well.
>>
>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>> phys_addr_t size, bool nomap)
>> {
>> if (nomap) {
>> /*
>> * If the memory is already reserved (by another region), we
>> * should not allow it to be marked nomap.
>> */
>> if (memblock_is_region_reserved(base, size)) <------
>> return -EBUSY; <------
>>
>> return memblock_mark_nomap(base, size);
>> }
>> return memblock_reserve(base, size);
>> }
>>
>> "nomap" is true, due to this text being present in the FDT:
>>
>> reserved-memory {
>> ranges secmon_reserved: secmon@5000000 {
>> reg = <0x0 0x05000000 0x0 0x300000>
>> no-map
>> }
>> ...
>>
>> But memblock_is_region_reserved() is returning true because there is already an
>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>
>> This function is defined as:
>>
>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>> {
>> return memblock_overlaps_region(&memblock.reserved, base, size);
>> }
>>
>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>> reserved region "0x5000000-0x52fffff", the function returns true.
>>
>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>> allow it to mark the region no-map, then the memory area is properly removed
>> from the "System RAM" area and the crashing stops.
>>
>> I've had the system up and running for 15 days now under heavy load without any
>> crashes, using just the following patch as workaround:
>>
>>
>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
>> @@ -1157,13 +1157,6 @@
>> phys_addr_t size, bool nomap)
>> {
>> if (nomap) {
>> - /*
>> - * If the memory is already reserved (by another region), we
>> - * should not allow it to be marked nomap.
>> - */
>> - if (memblock_is_region_reserved(base, size))
>> - return -EBUSY;
>> -
>> return memblock_mark_nomap(base, size);
>> }
>> return memblock_reserve(base, size);
>>
>>
>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>> well.
Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.
>>
>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>> that the existing reserved region is identical (same start/end) to the region
>> getting marked no-map.
>
> If U-Boot is marking regions with the wrong type/attributes in the EFI
> memory map, then the best thing to do would be to fix that. I see a
> fairly recent commit which looks suspiciously relevant:
>
> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.
>
> Booting with "efi=debug" should (among other things) print the memory
> map at boot if you want to double-check that that is the source of the
> mismatch. Our EFI code should be perfectly capable of setting the
> memblock flag if the region *is* described appropriately, see
> reserve_regions() in drivers/firmware/efi/efi-init.c.
Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[ 0.000000] Machine model: Hardkernel ODROID-N2Plus
[ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
[ 0.000000] efi: UEFI not found.
[ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
So it seems UEFI is not in the play here?
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 15:56 ` Stefan Agner
@ 2021-07-23 16:14 ` Robin Murphy
-1 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 16:14 UTC (permalink / raw)
To: Stefan Agner, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 16:56, Stefan Agner wrote:
> Hi Byron, Hi Robin,
>
> Very interesting findings!
>
> On 2021-07-23 17:36, Robin Murphy wrote:
>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>
>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>> Hi,
>>>>>
>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>> received since several reports of random reboots every couple of days.
>>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>>> at some point.
>>>>>
>>>>> After running serial console on several instances, I was able to catch
>>>>> this stack trace:
>>>>>
>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>>> #1
>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>
>>>> <snip>
>>>>
>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>
>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>
>>>> It seems load and/or hardware dependent since we see it on some devices
>>>> quite frequent (every few days), and on others it takes multiple weeks.
>>>> Of course the once we see it frequently are the ones in production :).
>>>>
>>>> I am currently trying different stress-ng and other load to accelerate
>>>> the crash rate before then trying to git bisect it.
>>>
>>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>>> related to the following dmesg line that reads "failed to reserve memory"
>>> below:
>>>
>>> Machine model: Hardkernel ODROID-N2Plus
>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>
> In my 5.9 builds that line isn't present, and it seems all logs I stored
> from 5.10 builds have the change already and show this line.
>
>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>>> OF: reserved mem: node linux,cma compatible matching fail
>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>>> ...
>>>
>>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>>> and the system appears to operate normally, until eventually the SError
>>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>>> the "System RAM" memory area, whereas previous kernels had dropped it from
>>> "System RAM".
>>>
>>> The culprit is this new code introduced in Linux 5.12, in this function in
>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>
> It seems that patch got also backported, so that is why I see it with
> 5.10 as well.
>
>>>
>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>> phys_addr_t size, bool nomap)
>>> {
>>> if (nomap) {
>>> /*
>>> * If the memory is already reserved (by another region), we
>>> * should not allow it to be marked nomap.
>>> */
>>> if (memblock_is_region_reserved(base, size)) <------
>>> return -EBUSY; <------
>>>
>>> return memblock_mark_nomap(base, size);
>>> }
>>> return memblock_reserve(base, size);
>>> }
>>>
>>> "nomap" is true, due to this text being present in the FDT:
>>>
>>> reserved-memory {
>>> ranges secmon_reserved: secmon@5000000 {
>>> reg = <0x0 0x05000000 0x0 0x300000>
>>> no-map
>>> }
>>> ...
>>>
>>> But memblock_is_region_reserved() is returning true because there is already an
>>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>>
>>> This function is defined as:
>>>
>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>>> {
>>> return memblock_overlaps_region(&memblock.reserved, base, size);
>>> }
>>>
>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>
>>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>>> allow it to mark the region no-map, then the memory area is properly removed
>>> from the "System RAM" area and the crashing stops.
>>>
>>> I've had the system up and running for 15 days now under heavy load without any
>>> crashes, using just the following patch as workaround:
>>>
>>>
>>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
>>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
>>> @@ -1157,13 +1157,6 @@
>>> phys_addr_t size, bool nomap)
>>> {
>>> if (nomap) {
>>> - /*
>>> - * If the memory is already reserved (by another region), we
>>> - * should not allow it to be marked nomap.
>>> - */
>>> - if (memblock_is_region_reserved(base, size))
>>> - return -EBUSY;
>>> -
>>> return memblock_mark_nomap(base, size);
>>> }
>>> return memblock_reserve(base, size);
>>>
>>>
>>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>>> well.
>
> Eventhough probably not the correct solution, I'll give this a try on my
> end just to verify we are indeed experience the same problem and the
> change fixes it for me too.
>
>>>
>>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>>> that the existing reserved region is identical (same start/end) to the region
>>> getting marked no-map.
>>
>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>> memory map, then the best thing to do would be to fix that. I see a
>> fairly recent commit which looks suspiciously relevant:
>>
>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
>
> It seems that this patch went into U-Boot 2021.04 which I am using, so
> that (alone) seems not to fix the mapping.
>
>>
>> Booting with "efi=debug" should (among other things) print the memory
>> map at boot if you want to double-check that that is the source of the
>> mismatch. Our EFI code should be perfectly capable of setting the
>> memblock flag if the region *is* described appropriately, see
>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>
> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
> [ 0.000000] efi: UEFI not found.
> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>
> So it seems UEFI is not in the play here?
Ah, OK, in that case I guess the question remains why does
early_init_dt_reserve_memory_arch() think the region is already
reserved? My instinctive assumption was an EFI memory map being present;
seeing that U-Boot does indeed reflect DT reservations there *and* has
had a likely-looking bug recently was then just overwhelmingly suggestive :)
Robin.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 16:14 ` Robin Murphy
0 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 16:14 UTC (permalink / raw)
To: Stefan Agner, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 16:56, Stefan Agner wrote:
> Hi Byron, Hi Robin,
>
> Very interesting findings!
>
> On 2021-07-23 17:36, Robin Murphy wrote:
>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>
>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>> Hi,
>>>>>
>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>> received since several reports of random reboots every couple of days.
>>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>>> at some point.
>>>>>
>>>>> After running serial console on several instances, I was able to catch
>>>>> this stack trace:
>>>>>
>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>>> #1
>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>
>>>> <snip>
>>>>
>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>
>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>
>>>> It seems load and/or hardware dependent since we see it on some devices
>>>> quite frequent (every few days), and on others it takes multiple weeks.
>>>> Of course the once we see it frequently are the ones in production :).
>>>>
>>>> I am currently trying different stress-ng and other load to accelerate
>>>> the crash rate before then trying to git bisect it.
>>>
>>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>>> related to the following dmesg line that reads "failed to reserve memory"
>>> below:
>>>
>>> Machine model: Hardkernel ODROID-N2Plus
>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>
> In my 5.9 builds that line isn't present, and it seems all logs I stored
> from 5.10 builds have the change already and show this line.
>
>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>>> OF: reserved mem: node linux,cma compatible matching fail
>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>>> ...
>>>
>>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>>> and the system appears to operate normally, until eventually the SError
>>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>>> the "System RAM" memory area, whereas previous kernels had dropped it from
>>> "System RAM".
>>>
>>> The culprit is this new code introduced in Linux 5.12, in this function in
>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>
> It seems that patch got also backported, so that is why I see it with
> 5.10 as well.
>
>>>
>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>> phys_addr_t size, bool nomap)
>>> {
>>> if (nomap) {
>>> /*
>>> * If the memory is already reserved (by another region), we
>>> * should not allow it to be marked nomap.
>>> */
>>> if (memblock_is_region_reserved(base, size)) <------
>>> return -EBUSY; <------
>>>
>>> return memblock_mark_nomap(base, size);
>>> }
>>> return memblock_reserve(base, size);
>>> }
>>>
>>> "nomap" is true, due to this text being present in the FDT:
>>>
>>> reserved-memory {
>>> ranges secmon_reserved: secmon@5000000 {
>>> reg = <0x0 0x05000000 0x0 0x300000>
>>> no-map
>>> }
>>> ...
>>>
>>> But memblock_is_region_reserved() is returning true because there is already an
>>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>>
>>> This function is defined as:
>>>
>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>>> {
>>> return memblock_overlaps_region(&memblock.reserved, base, size);
>>> }
>>>
>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>
>>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>>> allow it to mark the region no-map, then the memory area is properly removed
>>> from the "System RAM" area and the crashing stops.
>>>
>>> I've had the system up and running for 15 days now under heavy load without any
>>> crashes, using just the following patch as workaround:
>>>
>>>
>>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400
>>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
>>> @@ -1157,13 +1157,6 @@
>>> phys_addr_t size, bool nomap)
>>> {
>>> if (nomap) {
>>> - /*
>>> - * If the memory is already reserved (by another region), we
>>> - * should not allow it to be marked nomap.
>>> - */
>>> - if (memblock_is_region_reserved(base, size))
>>> - return -EBUSY;
>>> -
>>> return memblock_mark_nomap(base, size);
>>> }
>>> return memblock_reserve(base, size);
>>>
>>>
>>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>>> well.
>
> Eventhough probably not the correct solution, I'll give this a try on my
> end just to verify we are indeed experience the same problem and the
> change fixes it for me too.
>
>>>
>>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>>> that the existing reserved region is identical (same start/end) to the region
>>> getting marked no-map.
>>
>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>> memory map, then the best thing to do would be to fix that. I see a
>> fairly recent commit which looks suspiciously relevant:
>>
>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
>
> It seems that this patch went into U-Boot 2021.04 which I am using, so
> that (alone) seems not to fix the mapping.
>
>>
>> Booting with "efi=debug" should (among other things) print the memory
>> map at boot if you want to double-check that that is the source of the
>> mismatch. Our EFI code should be perfectly capable of setting the
>> memblock flag if the region *is* described appropriately, see
>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>
> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
> [ 0.000000] efi: UEFI not found.
> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>
> So it seems UEFI is not in the play here?
Ah, OK, in that case I guess the question remains why does
early_init_dt_reserve_memory_arch() think the region is already
reserved? My instinctive assumption was an EFI memory map being present;
seeing that U-Boot does indeed reflect DT reservations there *and* has
had a likely-looking bug recently was then just overwhelmingly suggestive :)
Robin.
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 16:14 ` Robin Murphy
@ 2021-07-23 17:47 ` Robin Murphy
-1 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 17:47 UTC (permalink / raw)
To: Stefan Agner, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 17:14, Robin Murphy wrote:
> On 2021-07-23 16:56, Stefan Agner wrote:
>> Hi Byron, Hi Robin,
>>
>> Very interesting findings!
>>
>> On 2021-07-23 17:36, Robin Murphy wrote:
>>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>>
>>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>>> received since several reports of random reboots every couple of
>>>>>> days.
>>>>>> Unfortunately the log (journald) doesn't show anything, just a
>>>>>> hard cut
>>>>>> at some point.
>>>>>>
>>>>>> After running serial console on several instances, I was able to
>>>>>> catch
>>>>>> this stack trace:
>>>>>>
>>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted
>>>>>> 5.10.33
>>>>>> #1
>>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>>
>>>>> <snip>
>>>>>
>>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>>
>>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>>
>>>>> It seems load and/or hardware dependent since we see it on some
>>>>> devices
>>>>> quite frequent (every few days), and on others it takes multiple
>>>>> weeks.
>>>>> Of course the once we see it frequently are the ones in production :).
>>>>>
>>>>> I am currently trying different stress-ng and other load to accelerate
>>>>> the crash rate before then trying to git bisect it.
>>>>
>>>> I have an Odroid-N2+ and was able to track this problem down. The
>>>> problem is
>>>> related to the following dmesg line that reads "failed to reserve
>>>> memory"
>>>> below:
>>>>
>>>> Machine model: Hardkernel ODROID-N2Plus
>>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe]
>>>> 0xffffffc0107e3604
>>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe]
>>>> 0xffffffc0107e3664
>>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff]
>>>> 0xffffffc0107e36dc
>>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff]
>>>> 0xffffffc0107feb50
>>>> OF: fdt: Reserved memory: failed to reserve memory for node
>>>> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> In my 5.9 builds that line isn't present, and it seems all logs I stored
>> from 5.10 builds have the change already and show this line.
>>
>>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff]
>>>> 0xffffffc0107ff87c
>>>> OF: reserved mem: node linux,cma compatible matching fail
>>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff]
>>>> 0xffffffc0107ffca8
>>>> ...
>>>>
>>>> A subsequent "cat /proc/iomem" shows that this memory region is
>>>> still reserved
>>>> and the system appears to operate normally, until eventually the SError
>>>> Interrupt comes in under heavy memory/page-cache usage. The
>>>> difference with
>>>> later kernels is that now the memory at 0x5000000-0x52fffff is
>>>> registered under
>>>> the "System RAM" memory area, whereas previous kernels had dropped
>>>> it from
>>>> "System RAM".
>>>>
>>>> The culprit is this new code introduced in Linux 5.12, in this
>>>> function in
>>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>>
>> It seems that patch got also backported, so that is why I see it with
>> 5.10 as well.
>>
>>>>
>>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>>> phys_addr_t size, bool nomap)
>>>> {
>>>> if (nomap) {
>>>> /*
>>>> * If the memory is already reserved (by another
>>>> region), we
>>>> * should not allow it to be marked nomap.
>>>> */
>>>> if (memblock_is_region_reserved(base, size)) <------
>>>> return -EBUSY; <------
>>>>
>>>> return memblock_mark_nomap(base, size);
>>>> }
>>>> return memblock_reserve(base, size);
>>>> }
>>>>
>>>> "nomap" is true, due to this text being present in the FDT:
>>>>
>>>> reserved-memory {
>>>> ranges secmon_reserved: secmon@5000000 {
>>>> reg = <0x0 0x05000000 0x0 0x300000>
>>>> no-map
>>>> }
>>>> ...
>>>>
>>>> But memblock_is_region_reserved() is returning true because there is
>>>> already an
>>>> entry for 0x5000000-0x52fffff in the memory map, which is already
>>>> marked
>>>> reserved, at the time the __reserved_mem_reserve_reg() function is
>>>> called.
>>>> (Perhaps this is being set reserved by u-boot? -- I did not research
>>>> that far.)
>>>>
>>>> This function is defined as:
>>>>
>>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base,
>>>> phys_addr_t size)
>>>> {
>>>> return memblock_overlaps_region(&memblock.reserved, base,
>>>> size);
>>>> }
>>>>
>>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the
>>>> existing
>>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>>
>>>> If I comment out the "if (memblock_is_region_reserved(base, size))"
>>>> code and
>>>> allow it to mark the region no-map, then the memory area is properly
>>>> removed
>>>> from the "System RAM" area and the crashing stops.
>>>>
>>>> I've had the system up and running for 15 days now under heavy load
>>>> without any
>>>> crashes, using just the following patch as workaround:
>>>>
>>>>
>>>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07
>>>> 00:22:58.000000000 -0400
>>>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000
>>>> -0400
>>>> @@ -1157,13 +1157,6 @@
>>>> phys_addr_t size, bool nomap)
>>>> {
>>>> if (nomap) {
>>>> - /*
>>>> - * If the memory is already reserved (by another region), we
>>>> - * should not allow it to be marked nomap.
>>>> - */
>>>> - if (memblock_is_region_reserved(base, size))
>>>> - return -EBUSY;
>>>> -
>>>> return memblock_mark_nomap(base, size);
>>>> }
>>>> return memblock_reserve(base, size);
>>>>
>>>>
>>>> The above patch applies to later versions of Linux 5.10.x through
>>>> 5.12.x as
>>>> well.
>>
>> Eventhough probably not the correct solution, I'll give this a try on my
>> end just to verify we are indeed experience the same problem and the
>> change fixes it for me too.
>>
>>>>
>>>> Perhaps a more proper fix is to allow the no-map to still proceed,
>>>> in the case
>>>> that the existing reserved region is identical (same start/end) to
>>>> the region
>>>> getting marked no-map.
>>>
>>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>>> memory map, then the best thing to do would be to fix that. I see a
>>> fairly recent commit which looks suspiciously relevant:
>>>
>>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
>>>
>>
>> It seems that this patch went into U-Boot 2021.04 which I am using, so
>> that (alone) seems not to fix the mapping.
>>
>>>
>>> Booting with "efi=debug" should (among other things) print the memory
>>> map at boot if you want to double-check that that is the source of the
>>> mismatch. Our EFI code should be perfectly capable of setting the
>>> memblock flag if the region *is* described appropriately, see
>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>
>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>> [ 0.000000] efi: UEFI not found.
>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> So it seems UEFI is not in the play here?
>
> Ah, OK, in that case I guess the question remains why does
> early_init_dt_reserve_memory_arch() think the region is already
> reserved? My instinctive assumption was an EFI memory map being present;
> seeing that U-Boot does indeed reflect DT reservations there *and* has
> had a likely-looking bug recently was then just overwhelmingly
> suggestive :)
Actually, poking at U-Boot a bit more I find
meson_board_add_reserved_memory() - can you check /sys/firmware/fdt and
see if the region ends up being passed as a /memreserve/ as well as a
proper reserved-memory node?
IIRC the semantics of /memreserve/ aren't really well-defined enough to
be suitable for the kind of things which require no-map, and my new
guess is that that's what ends up conflicting here.
Robin.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 17:47 ` Robin Murphy
0 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-23 17:47 UTC (permalink / raw)
To: Stefan Agner, Byron Stanoszek
Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 17:14, Robin Murphy wrote:
> On 2021-07-23 16:56, Stefan Agner wrote:
>> Hi Byron, Hi Robin,
>>
>> Very interesting findings!
>>
>> On 2021-07-23 17:36, Robin Murphy wrote:
>>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>>
>>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>>> received since several reports of random reboots every couple of
>>>>>> days.
>>>>>> Unfortunately the log (journald) doesn't show anything, just a
>>>>>> hard cut
>>>>>> at some point.
>>>>>>
>>>>>> After running serial console on several instances, I was able to
>>>>>> catch
>>>>>> this stack trace:
>>>>>>
>>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted
>>>>>> 5.10.33
>>>>>> #1
>>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>>
>>>>> <snip>
>>>>>
>>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>>
>>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>>
>>>>> It seems load and/or hardware dependent since we see it on some
>>>>> devices
>>>>> quite frequent (every few days), and on others it takes multiple
>>>>> weeks.
>>>>> Of course the once we see it frequently are the ones in production :).
>>>>>
>>>>> I am currently trying different stress-ng and other load to accelerate
>>>>> the crash rate before then trying to git bisect it.
>>>>
>>>> I have an Odroid-N2+ and was able to track this problem down. The
>>>> problem is
>>>> related to the following dmesg line that reads "failed to reserve
>>>> memory"
>>>> below:
>>>>
>>>> Machine model: Hardkernel ODROID-N2Plus
>>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe]
>>>> 0xffffffc0107e3604
>>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe]
>>>> 0xffffffc0107e3664
>>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff]
>>>> 0xffffffc0107e36dc
>>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff]
>>>> 0xffffffc0107feb50
>>>> OF: fdt: Reserved memory: failed to reserve memory for node
>>>> 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> In my 5.9 builds that line isn't present, and it seems all logs I stored
>> from 5.10 builds have the change already and show this line.
>>
>>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff]
>>>> 0xffffffc0107ff87c
>>>> OF: reserved mem: node linux,cma compatible matching fail
>>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff]
>>>> 0xffffffc0107ffca8
>>>> ...
>>>>
>>>> A subsequent "cat /proc/iomem" shows that this memory region is
>>>> still reserved
>>>> and the system appears to operate normally, until eventually the SError
>>>> Interrupt comes in under heavy memory/page-cache usage. The
>>>> difference with
>>>> later kernels is that now the memory at 0x5000000-0x52fffff is
>>>> registered under
>>>> the "System RAM" memory area, whereas previous kernels had dropped
>>>> it from
>>>> "System RAM".
>>>>
>>>> The culprit is this new code introduced in Linux 5.12, in this
>>>> function in
>>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>>
>> It seems that patch got also backported, so that is why I see it with
>> 5.10 as well.
>>
>>>>
>>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>>> phys_addr_t size, bool nomap)
>>>> {
>>>> if (nomap) {
>>>> /*
>>>> * If the memory is already reserved (by another
>>>> region), we
>>>> * should not allow it to be marked nomap.
>>>> */
>>>> if (memblock_is_region_reserved(base, size)) <------
>>>> return -EBUSY; <------
>>>>
>>>> return memblock_mark_nomap(base, size);
>>>> }
>>>> return memblock_reserve(base, size);
>>>> }
>>>>
>>>> "nomap" is true, due to this text being present in the FDT:
>>>>
>>>> reserved-memory {
>>>> ranges secmon_reserved: secmon@5000000 {
>>>> reg = <0x0 0x05000000 0x0 0x300000>
>>>> no-map
>>>> }
>>>> ...
>>>>
>>>> But memblock_is_region_reserved() is returning true because there is
>>>> already an
>>>> entry for 0x5000000-0x52fffff in the memory map, which is already
>>>> marked
>>>> reserved, at the time the __reserved_mem_reserve_reg() function is
>>>> called.
>>>> (Perhaps this is being set reserved by u-boot? -- I did not research
>>>> that far.)
>>>>
>>>> This function is defined as:
>>>>
>>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base,
>>>> phys_addr_t size)
>>>> {
>>>> return memblock_overlaps_region(&memblock.reserved, base,
>>>> size);
>>>> }
>>>>
>>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the
>>>> existing
>>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>>
>>>> If I comment out the "if (memblock_is_region_reserved(base, size))"
>>>> code and
>>>> allow it to mark the region no-map, then the memory area is properly
>>>> removed
>>>> from the "System RAM" area and the crashing stops.
>>>>
>>>> I've had the system up and running for 15 days now under heavy load
>>>> without any
>>>> crashes, using just the following patch as workaround:
>>>>
>>>>
>>>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07
>>>> 00:22:58.000000000 -0400
>>>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000
>>>> -0400
>>>> @@ -1157,13 +1157,6 @@
>>>> phys_addr_t size, bool nomap)
>>>> {
>>>> if (nomap) {
>>>> - /*
>>>> - * If the memory is already reserved (by another region), we
>>>> - * should not allow it to be marked nomap.
>>>> - */
>>>> - if (memblock_is_region_reserved(base, size))
>>>> - return -EBUSY;
>>>> -
>>>> return memblock_mark_nomap(base, size);
>>>> }
>>>> return memblock_reserve(base, size);
>>>>
>>>>
>>>> The above patch applies to later versions of Linux 5.10.x through
>>>> 5.12.x as
>>>> well.
>>
>> Eventhough probably not the correct solution, I'll give this a try on my
>> end just to verify we are indeed experience the same problem and the
>> change fixes it for me too.
>>
>>>>
>>>> Perhaps a more proper fix is to allow the no-map to still proceed,
>>>> in the case
>>>> that the existing reserved region is identical (same start/end) to
>>>> the region
>>>> getting marked no-map.
>>>
>>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>>> memory map, then the best thing to do would be to fix that. I see a
>>> fairly recent commit which looks suspiciously relevant:
>>>
>>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
>>>
>>
>> It seems that this patch went into U-Boot 2021.04 which I am using, so
>> that (alone) seems not to fix the mapping.
>>
>>>
>>> Booting with "efi=debug" should (among other things) print the memory
>>> map at boot if you want to double-check that that is the source of the
>>> mismatch. Our EFI code should be perfectly capable of setting the
>>> memblock flag if the region *is* described appropriately, see
>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>
>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>> [ 0.000000] efi: UEFI not found.
>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>
>> So it seems UEFI is not in the play here?
>
> Ah, OK, in that case I guess the question remains why does
> early_init_dt_reserve_memory_arch() think the region is already
> reserved? My instinctive assumption was an EFI memory map being present;
> seeing that U-Boot does indeed reflect DT reservations there *and* has
> had a likely-looking bug recently was then just overwhelmingly
> suggestive :)
Actually, poking at U-Boot a bit more I find
meson_board_add_reserved_memory() - can you check /sys/firmware/fdt and
see if the region ends up being passed as a /memreserve/ as well as a
proper reserved-memory node?
IIRC the semantics of /memreserve/ aren't really well-defined enough to
be suitable for the kind of things which require no-map, and my new
guess is that that's what ends up conflicting here.
Robin.
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 17:47 ` Robin Murphy
@ 2021-07-23 19:48 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-23 19:48 UTC (permalink / raw)
To: Robin Murphy
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Neil Armstrong,
Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 19:47, Robin Murphy wrote:
> On 2021-07-23 17:14, Robin Murphy wrote:
>> On 2021-07-23 16:56, Stefan Agner wrote:
<snip>
>>>>
>>>> Booting with "efi=debug" should (among other things) print the memory
>>>> map at boot if you want to double-check that that is the source of the
>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>> memblock flag if the region *is* described appropriately, see
>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>
>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>> [ 0.000000] efi: UEFI not found.
>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>
>>> So it seems UEFI is not in the play here?
>>
>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>
> Actually, poking at U-Boot a bit more I find
> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
> and see if the region ends up being passed as a /memreserve/ as well
> as a proper reserved-memory node?
>
> IIRC the semantics of /memreserve/ aren't really well-defined enough
> to be suitable for the kind of things which require no-map, and my new
> guess is that that's what ends up conflicting here.
Seems to be present in booth:
On v5.12.10
# fdtdump /sys/firmware/fdt
...
/memreserve/ 0x5000000 0x300000;
...
reserved-memory {
#address-cells = <0x00000002>;
#size-cells = <0x00000002>;
ranges;
secmon@5000000 {
reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
no-map;
phandle = <0x00000068>;
};
linux,cma {
compatible = "shared-dma-pool";
reusable;
size = <0x00000000 0x10000000>;
alignment = <0x00000000 0x00400000>;
linux,cma-default;
};
};
--
Stefan
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-23 19:48 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-23 19:48 UTC (permalink / raw)
To: Robin Murphy
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Neil Armstrong,
Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-23 19:47, Robin Murphy wrote:
> On 2021-07-23 17:14, Robin Murphy wrote:
>> On 2021-07-23 16:56, Stefan Agner wrote:
<snip>
>>>>
>>>> Booting with "efi=debug" should (among other things) print the memory
>>>> map at boot if you want to double-check that that is the source of the
>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>> memblock flag if the region *is* described appropriately, see
>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>
>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>> [ 0.000000] efi: UEFI not found.
>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>
>>> So it seems UEFI is not in the play here?
>>
>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>
> Actually, poking at U-Boot a bit more I find
> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
> and see if the region ends up being passed as a /memreserve/ as well
> as a proper reserved-memory node?
>
> IIRC the semantics of /memreserve/ aren't really well-defined enough
> to be suitable for the kind of things which require no-map, and my new
> guess is that that's what ends up conflicting here.
Seems to be present in booth:
On v5.12.10
# fdtdump /sys/firmware/fdt
...
/memreserve/ 0x5000000 0x300000;
...
reserved-memory {
#address-cells = <0x00000002>;
#size-cells = <0x00000002>;
ranges;
secmon@5000000 {
reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
no-map;
phandle = <0x00000068>;
};
linux,cma {
compatible = "shared-dma-pool";
reusable;
size = <0x00000000 0x10000000>;
alignment = <0x00000000 0x00400000>;
linux,cma-default;
};
};
--
Stefan
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-23 19:48 ` Stefan Agner
@ 2021-07-26 7:54 ` Neil Armstrong
-1 siblings, 0 replies; 34+ messages in thread
From: Neil Armstrong @ 2021-07-26 7:54 UTC (permalink / raw)
To: Stefan Agner, Robin Murphy
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
Hi,
On 23/07/2021 21:48, Stefan Agner wrote:
> On 2021-07-23 19:47, Robin Murphy wrote:
>> On 2021-07-23 17:14, Robin Murphy wrote:
>>> On 2021-07-23 16:56, Stefan Agner wrote:
> <snip>
>>>>>
>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>> map at boot if you want to double-check that that is the source of the
>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>> memblock flag if the region *is* described appropriately, see
>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>
>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>> [ 0.000000] efi: UEFI not found.
>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>
>>>> So it seems UEFI is not in the play here?
>>>
>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>
>> Actually, poking at U-Boot a bit more I find
>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>> and see if the region ends up being passed as a /memreserve/ as well
>> as a proper reserved-memory node?
>>
>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>> to be suitable for the kind of things which require no-map, and my new
>> guess is that that's what ends up conflicting here.
>
> Seems to be present in booth:
Indeed, in order so support any combination:
- upstream u-boot
- vendor u-boot
- upstream linux
- other OS
The secmon is in the upstream Linux DT, and upstream u-boot reads the secure memory regions
from the first stage bootloaders and adds them into the DT memreserve.
It worked fine since Linux 4.10-ish, until 5.10.
Neil
>
> On v5.12.10
> # fdtdump /sys/firmware/fdt
> ...
> /memreserve/ 0x5000000 0x300000;
> ...
> reserved-memory {
> #address-cells = <0x00000002>;
> #size-cells = <0x00000002>;
> ranges;
> secmon@5000000 {
> reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
> no-map;
> phandle = <0x00000068>;
> };
> linux,cma {
> compatible = "shared-dma-pool";
> reusable;
> size = <0x00000000 0x10000000>;
> alignment = <0x00000000 0x00400000>;
> linux,cma-default;
> };
> };
>
> --
> Stefan
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-26 7:54 ` Neil Armstrong
0 siblings, 0 replies; 34+ messages in thread
From: Neil Armstrong @ 2021-07-26 7:54 UTC (permalink / raw)
To: Stefan Agner, Robin Murphy
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
Hi,
On 23/07/2021 21:48, Stefan Agner wrote:
> On 2021-07-23 19:47, Robin Murphy wrote:
>> On 2021-07-23 17:14, Robin Murphy wrote:
>>> On 2021-07-23 16:56, Stefan Agner wrote:
> <snip>
>>>>>
>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>> map at boot if you want to double-check that that is the source of the
>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>> memblock flag if the region *is* described appropriately, see
>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>
>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>> [ 0.000000] efi: UEFI not found.
>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>
>>>> So it seems UEFI is not in the play here?
>>>
>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>
>> Actually, poking at U-Boot a bit more I find
>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>> and see if the region ends up being passed as a /memreserve/ as well
>> as a proper reserved-memory node?
>>
>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>> to be suitable for the kind of things which require no-map, and my new
>> guess is that that's what ends up conflicting here.
>
> Seems to be present in booth:
Indeed, in order so support any combination:
- upstream u-boot
- vendor u-boot
- upstream linux
- other OS
The secmon is in the upstream Linux DT, and upstream u-boot reads the secure memory regions
from the first stage bootloaders and adds them into the DT memreserve.
It worked fine since Linux 4.10-ish, until 5.10.
Neil
>
> On v5.12.10
> # fdtdump /sys/firmware/fdt
> ...
> /memreserve/ 0x5000000 0x300000;
> ...
> reserved-memory {
> #address-cells = <0x00000002>;
> #size-cells = <0x00000002>;
> ranges;
> secmon@5000000 {
> reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
> no-map;
> phandle = <0x00000068>;
> };
> linux,cma {
> compatible = "shared-dma-pool";
> reusable;
> size = <0x00000000 0x10000000>;
> alignment = <0x00000000 0x00400000>;
> linux,cma-default;
> };
> };
>
> --
> Stefan
>
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-26 7:54 ` Neil Armstrong
@ 2021-07-26 12:07 ` Stefan Agner
-1 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-26 12:07 UTC (permalink / raw)
To: Neil Armstrong
Cc: Robin Murphy, Byron Stanoszek, linux-amlogic, linux-arm-kernel,
Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport
FWIW, I did run two boards over the weekend with stress-ng vm test
running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
Make sure no-map does not remove already reserved regions") reverted.
The one without the revert crashed after ~24h, the other did run through
the weekend. Basically confirming what Byron reported.
On 2021-07-26 09:54, Neil Armstrong wrote:
> Hi,
>
> On 23/07/2021 21:48, Stefan Agner wrote:
>> On 2021-07-23 19:47, Robin Murphy wrote:
>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>> <snip>
>>>>>>
>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>
>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>> [ 0.000000] efi: UEFI not found.
>>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>
>>>>> So it seems UEFI is not in the play here?
>>>>
>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>
>>> Actually, poking at U-Boot a bit more I find
>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>> and see if the region ends up being passed as a /memreserve/ as well
>>> as a proper reserved-memory node?
>>>
>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>> to be suitable for the kind of things which require no-map, and my new
>>> guess is that that's what ends up conflicting here.
>>
>> Seems to be present in booth:
>
> Indeed, in order so support any combination:
> - upstream u-boot
> - vendor u-boot
> - upstream linux
> - other OS
>
> The secmon is in the upstream Linux DT, and upstream u-boot reads the
> secure memory regions
> from the first stage bootloaders and adds them into the DT memreserve.
>
> It worked fine since Linux 4.10-ish, until 5.10.
Just verified what is probably obvious at this point: By removing
meson_board_add_reserved_memory() the /memreserve/ region isn't present
and "failed to reserve memory" message disappears indeed.
Why is reserving memory not enough? From what I've read no-map also make
sure there is no VM mapping, but if the region is reserved, shouldn't
that be enough for Linux to not access the region? I've read that no-map
also preventsaccess due to speculation, is this what is happening here?
What is the proper solution here? Could maybe
meson_board_add_reserved_memory() check if reserved-memory is present,
and if so avoid adding /memreserve/?
--
Stefan
>
> Neil
>
>>
>> On v5.12.10
>> # fdtdump /sys/firmware/fdt
>> ...
>> /memreserve/ 0x5000000 0x300000;
>> ...
>> reserved-memory {
>> #address-cells = <0x00000002>;
>> #size-cells = <0x00000002>;
>> ranges;
>> secmon@5000000 {
>> reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
>> no-map;
>> phandle = <0x00000068>;
>> };
>> linux,cma {
>> compatible = "shared-dma-pool";
>> reusable;
>> size = <0x00000000 0x10000000>;
>> alignment = <0x00000000 0x00400000>;
>> linux,cma-default;
>> };
>> };
>>
>> --
>> Stefan
>>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-26 12:07 ` Stefan Agner
0 siblings, 0 replies; 34+ messages in thread
From: Stefan Agner @ 2021-07-26 12:07 UTC (permalink / raw)
To: Neil Armstrong
Cc: Robin Murphy, Byron Stanoszek, linux-amlogic, linux-arm-kernel,
Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport
FWIW, I did run two boards over the weekend with stress-ng vm test
running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
Make sure no-map does not remove already reserved regions") reverted.
The one without the revert crashed after ~24h, the other did run through
the weekend. Basically confirming what Byron reported.
On 2021-07-26 09:54, Neil Armstrong wrote:
> Hi,
>
> On 23/07/2021 21:48, Stefan Agner wrote:
>> On 2021-07-23 19:47, Robin Murphy wrote:
>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>> <snip>
>>>>>>
>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>
>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>> [ 0.000000] efi: UEFI not found.
>>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>
>>>>> So it seems UEFI is not in the play here?
>>>>
>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>
>>> Actually, poking at U-Boot a bit more I find
>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>> and see if the region ends up being passed as a /memreserve/ as well
>>> as a proper reserved-memory node?
>>>
>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>> to be suitable for the kind of things which require no-map, and my new
>>> guess is that that's what ends up conflicting here.
>>
>> Seems to be present in booth:
>
> Indeed, in order so support any combination:
> - upstream u-boot
> - vendor u-boot
> - upstream linux
> - other OS
>
> The secmon is in the upstream Linux DT, and upstream u-boot reads the
> secure memory regions
> from the first stage bootloaders and adds them into the DT memreserve.
>
> It worked fine since Linux 4.10-ish, until 5.10.
Just verified what is probably obvious at this point: By removing
meson_board_add_reserved_memory() the /memreserve/ region isn't present
and "failed to reserve memory" message disappears indeed.
Why is reserving memory not enough? From what I've read no-map also make
sure there is no VM mapping, but if the region is reserved, shouldn't
that be enough for Linux to not access the region? I've read that no-map
also preventsaccess due to speculation, is this what is happening here?
What is the proper solution here? Could maybe
meson_board_add_reserved_memory() check if reserved-memory is present,
and if so avoid adding /memreserve/?
--
Stefan
>
> Neil
>
>>
>> On v5.12.10
>> # fdtdump /sys/firmware/fdt
>> ...
>> /memreserve/ 0x5000000 0x300000;
>> ...
>> reserved-memory {
>> #address-cells = <0x00000002>;
>> #size-cells = <0x00000002>;
>> ranges;
>> secmon@5000000 {
>> reg = <0x00000000 0x05000000 0x00000000 0x00300000>;
>> no-map;
>> phandle = <0x00000068>;
>> };
>> linux,cma {
>> compatible = "shared-dma-pool";
>> reusable;
>> size = <0x00000000 0x10000000>;
>> alignment = <0x00000000 0x00400000>;
>> linux,cma-default;
>> };
>> };
>>
>> --
>> Stefan
>>
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
2021-07-26 12:07 ` Stefan Agner
@ 2021-07-26 12:31 ` Robin Murphy
-1 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-26 12:31 UTC (permalink / raw)
To: Stefan Agner, Neil Armstrong
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-26 13:07, Stefan Agner wrote:
> FWIW, I did run two boards over the weekend with stress-ng vm test
> running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
> Make sure no-map does not remove already reserved regions") reverted.
> The one without the revert crashed after ~24h, the other did run through
> the weekend. Basically confirming what Byron reported.
>
> On 2021-07-26 09:54, Neil Armstrong wrote:
>> Hi,
>>
>> On 23/07/2021 21:48, Stefan Agner wrote:
>>> On 2021-07-23 19:47, Robin Murphy wrote:
>>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>>> <snip>
>>>>>>>
>>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>>
>>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>>> [ 0.000000] efi: UEFI not found.
>>>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>>
>>>>>> So it seems UEFI is not in the play here?
>>>>>
>>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>>
>>>> Actually, poking at U-Boot a bit more I find
>>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>>> and see if the region ends up being passed as a /memreserve/ as well
>>>> as a proper reserved-memory node?
>>>>
>>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>>> to be suitable for the kind of things which require no-map, and my new
>>>> guess is that that's what ends up conflicting here.
>>>
>>> Seems to be present in booth:
>>
>> Indeed, in order so support any combination:
>> - upstream u-boot
>> - vendor u-boot
>> - upstream linux
>> - other OS
>>
>> The secmon is in the upstream Linux DT, and upstream u-boot reads the
>> secure memory regions
>> from the first stage bootloaders and adds them into the DT memreserve.
>>
>> It worked fine since Linux 4.10-ish, until 5.10.
>
> Just verified what is probably obvious at this point: By removing
> meson_board_add_reserved_memory() the /memreserve/ region isn't present
> and "failed to reserve memory" message disappears indeed.
>
> Why is reserving memory not enough? From what I've read no-map also make
> sure there is no VM mapping, but if the region is reserved, shouldn't
> that be enough for Linux to not access the region? I've read that no-map
> also preventsaccess due to speculation, is this what is happening here?
Almost certainly - being reserved either way means that Linux won't try
to access those pages directly, but if they are still present in the
linear map as Normal memory which allows speculation, legitimate access
to adjacent pages may well cause the CPU to end up prefetching into them.
> What is the proper solution here? Could maybe
> meson_board_add_reserved_memory() check if reserved-memory is present,
> and if so avoid adding /memreserve/?
Perhaps, although it doesn't help people who can't or don't want to
update their firmware. As I say, I'm not sure what the expectations are
supposed to be for /memreserve/, particularly if it duplicates
reserved-memory. Furthermore, looking at 8a5a75e5e9e55 I'm also not
really convinced that making the kernel boot for the sake of debugging a
fundamentally broken bootloader is a common and realistic enough issue
to justify breaking the existing not-necessarily-invalid bootloader
behaviour of other widely-deployed systems :/
Robin.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Random reboots on ODROID-N2+
@ 2021-07-26 12:31 ` Robin Murphy
0 siblings, 0 replies; 34+ messages in thread
From: Robin Murphy @ 2021-07-26 12:31 UTC (permalink / raw)
To: Stefan Agner, Neil Armstrong
Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet,
Kevin Hilman, Martin Blumenstingl, Mike Rapoport
On 2021-07-26 13:07, Stefan Agner wrote:
> FWIW, I did run two boards over the weekend with stress-ng vm test
> running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt:
> Make sure no-map does not remove already reserved regions") reverted.
> The one without the revert crashed after ~24h, the other did run through
> the weekend. Basically confirming what Byron reported.
>
> On 2021-07-26 09:54, Neil Armstrong wrote:
>> Hi,
>>
>> On 23/07/2021 21:48, Stefan Agner wrote:
>>> On 2021-07-23 19:47, Robin Murphy wrote:
>>>> On 2021-07-23 17:14, Robin Murphy wrote:
>>>>> On 2021-07-23 16:56, Stefan Agner wrote:
>>> <snip>
>>>>>>>
>>>>>>> Booting with "efi=debug" should (among other things) print the memory
>>>>>>> map at boot if you want to double-check that that is the source of the
>>>>>>> mismatch. Our EFI code should be perfectly capable of setting the
>>>>>>> memblock flag if the region *is* described appropriately, see
>>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>>>>>
>>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>>>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus
>>>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT:
>>>>>> [ 0.000000] efi: UEFI not found.
>>>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
>>>>>>
>>>>>> So it seems UEFI is not in the play here?
>>>>>
>>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :)
>>>>
>>>> Actually, poking at U-Boot a bit more I find
>>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt
>>>> and see if the region ends up being passed as a /memreserve/ as well
>>>> as a proper reserved-memory node?
>>>>
>>>> IIRC the semantics of /memreserve/ aren't really well-defined enough
>>>> to be suitable for the kind of things which require no-map, and my new
>>>> guess is that that's what ends up conflicting here.
>>>
>>> Seems to be present in booth:
>>
>> Indeed, in order so support any combination:
>> - upstream u-boot
>> - vendor u-boot
>> - upstream linux
>> - other OS
>>
>> The secmon is in the upstream Linux DT, and upstream u-boot reads the
>> secure memory regions
>> from the first stage bootloaders and adds them into the DT memreserve.
>>
>> It worked fine since Linux 4.10-ish, until 5.10.
>
> Just verified what is probably obvious at this point: By removing
> meson_board_add_reserved_memory() the /memreserve/ region isn't present
> and "failed to reserve memory" message disappears indeed.
>
> Why is reserving memory not enough? From what I've read no-map also make
> sure there is no VM mapping, but if the region is reserved, shouldn't
> that be enough for Linux to not access the region? I've read that no-map
> also preventsaccess due to speculation, is this what is happening here?
Almost certainly - being reserved either way means that Linux won't try
to access those pages directly, but if they are still present in the
linear map as Normal memory which allows speculation, legitimate access
to adjacent pages may well cause the CPU to end up prefetching into them.
> What is the proper solution here? Could maybe
> meson_board_add_reserved_memory() check if reserved-memory is present,
> and if so avoid adding /memreserve/?
Perhaps, although it doesn't help people who can't or don't want to
update their firmware. As I say, I'm not sure what the expectations are
supposed to be for /memreserve/, particularly if it duplicates
reserved-memory. Furthermore, looking at 8a5a75e5e9e55 I'm also not
really convinced that making the kernel boot for the sake of debugging a
fundamentally broken bootloader is a common and realistic enough issue
to justify breaking the existing not-necessarily-invalid bootloader
behaviour of other widely-deployed systems :/
Robin.
_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2021-07-26 12:35 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner
2021-05-17 9:14 ` Stefan Agner
2021-05-17 21:09 ` Martin Blumenstingl
2021-05-17 21:09 ` Martin Blumenstingl
2021-05-18 9:16 ` Stefan Agner
2021-05-18 9:16 ` Stefan Agner
2021-05-18 9:35 ` Neil Armstrong
2021-05-18 9:35 ` Neil Armstrong
2021-05-18 1:33 ` Andrew Lunn
2021-05-18 1:33 ` Andrew Lunn
2021-05-18 10:15 ` Stefan Agner
2021-05-18 10:15 ` Stefan Agner
2021-05-19 20:09 ` Stefan Agner
2021-05-19 20:09 ` Stefan Agner
2021-06-22 7:39 ` Stefan Agner
2021-06-22 7:39 ` Stefan Agner
2021-07-23 14:25 ` Byron Stanoszek
2021-07-23 14:25 ` Byron Stanoszek
2021-07-23 15:36 ` Robin Murphy
2021-07-23 15:36 ` Robin Murphy
2021-07-23 15:56 ` Stefan Agner
2021-07-23 15:56 ` Stefan Agner
2021-07-23 16:14 ` Robin Murphy
2021-07-23 16:14 ` Robin Murphy
2021-07-23 17:47 ` Robin Murphy
2021-07-23 17:47 ` Robin Murphy
2021-07-23 19:48 ` Stefan Agner
2021-07-23 19:48 ` Stefan Agner
2021-07-26 7:54 ` Neil Armstrong
2021-07-26 7:54 ` Neil Armstrong
2021-07-26 12:07 ` Stefan Agner
2021-07-26 12:07 ` Stefan Agner
2021-07-26 12:31 ` Robin Murphy
2021-07-26 12:31 ` Robin Murphy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.