* Random reboots on ODROID-N2+ @ 2021-05-17 9:14 Stefan Agner 2021-05-17 21:09 ` Martin Blumenstingl ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: Stefan Agner @ 2021-05-17 9:14 UTC (permalink / raw) To: linux-amlogic, linux-arm-kernel Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl Hi, We are currently testing a new release using Linux 5.10.33. I've received since several reports of random reboots every couple of days. Unfortunately the log (journald) doesn't show anything, just a hard cut at some point. After running serial console on several instances, I was able to catch this stack trace: [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 #1 [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 [202983.988160] sp : ffff8000132a3ae0 [202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0 [202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58 [202983.988165] x25: 0000000000000073 x24: ffff000007963e24 [202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80 [202983.988169] x21: 0000000000000011 x20: 0000000000000073 [202983.988170] x19: ffff000001a92c00 x18: 0000000000000001 [202983.988172] x17: 0000000000000000 x16: 0000000000000000 [202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001 [202983.988175] x13: ffff0000079181a0 x12: 0000000000000028 [202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0 [202983.988178] x9 : 000000000000001b x8 : ffff000007963e00 [202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28 [202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000 [202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000 [202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001 [202983.988186] Kernel panic - not syncing: Asynchronous SError Interrupt [202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 #1 [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT) [202983.988188] Call trace: [202983.988188] dump_backtrace+0x0/0x1a0 [202983.988189] show_stack+0x18/0x70 [202983.988190] dump_stack+0xd0/0x12c [202983.988190] panic+0x170/0x338 [202983.988191] nmi_panic+0x8c/0x90 [202983.988191] arm64_serror_panic+0x78/0x84 [202983.988192] do_serror+0x38/0xa0 [202983.988193] el1_error+0x88/0x108 [202983.988193] udp_send_skb.isra.0+0x178/0x390 [202983.988194] udp_sendmsg+0x7c8/0x9c0 [202983.988194] inet_sendmsg+0x44/0x70 [202983.988195] sock_sendmsg+0x4c/0x60 [202983.988196] __sys_sendto+0xd0/0x140 [202983.988196] __arm64_sys_sendto+0x28/0x40 [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0 [202983.988197] do_el0_svc+0x24/0x90 [202983.988198] el0_svc+0x14/0x20 [202983.988199] el0_sync_handler+0xb0/0xc0 [202983.988199] el0_sync+0x178/0x180 [202983.988211] SMP: stopping secondary CPUs [202983.988212] Kernel Offset: disabled [202983.988212] CPU features: 0x0240002,61082004 [202983.988213] Memory Limit: none Anyone observed such an issue? I am pretty sure that this is a new issue as we have many installations using Linux 5.9.16 running stable on the same hardware,. Now that I can tell that it is network related I'll try to increase network load to see if I can find a quicker way to reproduce this. -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner @ 2021-05-17 21:09 ` Martin Blumenstingl 2021-05-18 9:16 ` Stefan Agner 2021-05-18 1:33 ` Andrew Lunn ` (2 subsequent siblings) 3 siblings, 1 reply; 17+ messages in thread From: Martin Blumenstingl @ 2021-05-17 21:09 UTC (permalink / raw) To: Stefan Agner Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman Hi Stefan, On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote: > > Hi, > > We are currently testing a new release using Linux 5.10.33. I've > received since several reports of random reboots every couple of days. > Unfortunately the log (journald) doesn't show anything, just a hard cut > at some point. I'm sorry to hear that some things are not working right [...] > [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT) > [202983.988188] Call trace: > [202983.988188] dump_backtrace+0x0/0x1a0 > [202983.988189] show_stack+0x18/0x70 > [202983.988190] dump_stack+0xd0/0x12c > [202983.988190] panic+0x170/0x338 > [202983.988191] nmi_panic+0x8c/0x90 > [202983.988191] arm64_serror_panic+0x78/0x84 > [202983.988192] do_serror+0x38/0xa0 > [202983.988193] el1_error+0x88/0x108 > [202983.988193] udp_send_skb.isra.0+0x178/0x390 > [202983.988194] udp_sendmsg+0x7c8/0x9c0 > [202983.988194] inet_sendmsg+0x44/0x70 > [202983.988195] sock_sendmsg+0x4c/0x60 > [202983.988196] __sys_sendto+0xd0/0x140 > [202983.988196] __arm64_sys_sendto+0x28/0x40 > [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0 > [202983.988197] do_el0_svc+0x24/0x90 > [202983.988198] el0_svc+0x14/0x20 > [202983.988199] el0_sync_handler+0xb0/0xc0 > [202983.988199] el0_sync+0x178/0x180 > [202983.988211] SMP: stopping secondary CPUs > [202983.988212] Kernel Offset: disabled > [202983.988212] CPU features: 0x0240002,61082004 > [202983.988213] Memory Limit: none that looks weird > Anyone observed such an issue? I am pretty sure that this is a new issue > as we have many installations using Linux 5.9.16 running stable on the > same hardware,. I haven't but I am currently trying to hunt down a (probably unrelated) Ethernet issue on an older Meson8m2 SoC currently. All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus there's a little bit of "glue" IP for the xMII connecting to the SoC's IO pads I think it's a good idea to involve the netdev and (probably even more important) stmmac maintainers. Anything skb related is handled by the stmmac driver. So I am hoping that someone with expertise in that area can give any hints for debugging or reproducing this. Best regards, Martin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-17 21:09 ` Martin Blumenstingl @ 2021-05-18 9:16 ` Stefan Agner 2021-05-18 9:35 ` Neil Armstrong 0 siblings, 1 reply; 17+ messages in thread From: Stefan Agner @ 2021-05-18 9:16 UTC (permalink / raw) To: Martin Blumenstingl Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman Hi Martin, On 2021-05-17 23:09, Martin Blumenstingl wrote: > Hi Stefan, > > On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote: >> >> Hi, >> >> We are currently testing a new release using Linux 5.10.33. I've >> received since several reports of random reboots every couple of days. >> Unfortunately the log (journald) doesn't show anything, just a hard cut >> at some point. > I'm sorry to hear that some things are not working right > > [...] >> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT) >> [202983.988188] Call trace: >> [202983.988188] dump_backtrace+0x0/0x1a0 >> [202983.988189] show_stack+0x18/0x70 >> [202983.988190] dump_stack+0xd0/0x12c >> [202983.988190] panic+0x170/0x338 >> [202983.988191] nmi_panic+0x8c/0x90 >> [202983.988191] arm64_serror_panic+0x78/0x84 >> [202983.988192] do_serror+0x38/0xa0 >> [202983.988193] el1_error+0x88/0x108 >> [202983.988193] udp_send_skb.isra.0+0x178/0x390 >> [202983.988194] udp_sendmsg+0x7c8/0x9c0 >> [202983.988194] inet_sendmsg+0x44/0x70 >> [202983.988195] sock_sendmsg+0x4c/0x60 >> [202983.988196] __sys_sendto+0xd0/0x140 >> [202983.988196] __arm64_sys_sendto+0x28/0x40 >> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0 >> [202983.988197] do_el0_svc+0x24/0x90 >> [202983.988198] el0_svc+0x14/0x20 >> [202983.988199] el0_sync_handler+0xb0/0xc0 >> [202983.988199] el0_sync+0x178/0x180 >> [202983.988211] SMP: stopping secondary CPUs >> [202983.988212] Kernel Offset: disabled >> [202983.988212] CPU features: 0x0240002,61082004 >> [202983.988213] Memory Limit: none > that looks weird > >> Anyone observed such an issue? I am pretty sure that this is a new issue >> as we have many installations using Linux 5.9.16 running stable on the >> same hardware,. > I haven't but I am currently trying to hunt down a (probably > unrelated) Ethernet issue on an older Meson8m2 SoC currently. > All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus > there's a little bit of "glue" IP for the xMII connecting to the SoC's > IO pads > > I think it's a good idea to involve the netdev and (probably even more > important) stmmac maintainers. > Anything skb related is handled by the stmmac driver. > So I am hoping that someone with expertise in that area can give any > hints for debugging or reproducing this. Ok I'll do that, I currently wait to see the same trace a second time, just to make sure its really caused by that code path always. -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-18 9:16 ` Stefan Agner @ 2021-05-18 9:35 ` Neil Armstrong 0 siblings, 0 replies; 17+ messages in thread From: Neil Armstrong @ 2021-05-18 9:35 UTC (permalink / raw) To: Stefan Agner, Martin Blumenstingl Cc: linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman Hi Stefan, On 18/05/2021 11:16, Stefan Agner wrote: > Hi Martin, > > On 2021-05-17 23:09, Martin Blumenstingl wrote: >> Hi Stefan, >> >> On Mon, May 17, 2021 at 11:14 AM Stefan Agner <stefan@agner.ch> wrote: >>> >>> Hi, >>> >>> We are currently testing a new release using Linux 5.10.33. I've >>> received since several reports of random reboots every couple of days. >>> Unfortunately the log (journald) doesn't show anything, just a hard cut >>> at some point. >> I'm sorry to hear that some things are not working right >> >> [...] >>> [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT) >>> [202983.988188] Call trace: >>> [202983.988188] dump_backtrace+0x0/0x1a0 >>> [202983.988189] show_stack+0x18/0x70 >>> [202983.988190] dump_stack+0xd0/0x12c >>> [202983.988190] panic+0x170/0x338 >>> [202983.988191] nmi_panic+0x8c/0x90 >>> [202983.988191] arm64_serror_panic+0x78/0x84 >>> [202983.988192] do_serror+0x38/0xa0 >>> [202983.988193] el1_error+0x88/0x108 >>> [202983.988193] udp_send_skb.isra.0+0x178/0x390 >>> [202983.988194] udp_sendmsg+0x7c8/0x9c0 >>> [202983.988194] inet_sendmsg+0x44/0x70 >>> [202983.988195] sock_sendmsg+0x4c/0x60 >>> [202983.988196] __sys_sendto+0xd0/0x140 >>> [202983.988196] __arm64_sys_sendto+0x28/0x40 >>> [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0 >>> [202983.988197] do_el0_svc+0x24/0x90 >>> [202983.988198] el0_svc+0x14/0x20 >>> [202983.988199] el0_sync_handler+0xb0/0xc0 >>> [202983.988199] el0_sync+0x178/0x180 >>> [202983.988211] SMP: stopping secondary CPUs >>> [202983.988212] Kernel Offset: disabled >>> [202983.988212] CPU features: 0x0240002,61082004 >>> [202983.988213] Memory Limit: none >> that looks weird >> >>> Anyone observed such an issue? I am pretty sure that this is a new issue >>> as we have many installations using Linux 5.9.16 running stable on the >>> same hardware,. >> I haven't but I am currently trying to hunt down a (probably >> unrelated) Ethernet issue on an older Meson8m2 SoC currently. >> All Amlogic Meson SoCs use a DWMAC IP for Ethernet connectivity plus >> there's a little bit of "glue" IP for the xMII connecting to the SoC's >> IO pads >> >> I think it's a good idea to involve the netdev and (probably even more >> important) stmmac maintainers. >> Anything skb related is handled by the stmmac driver. >> So I am hoping that someone with expertise in that area can give any >> hints for debugging or reproducing this. > > Ok I'll do that, I currently wait to see the same trace a second time, > just to make sure its really caused by that code path always. A good work would be to eventually do a bisect between the last known working and the currently working version. SError Interrupt looks like an HW issue caused by a change in v5.10 Neil > > -- > Stefan > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner 2021-05-17 21:09 ` Martin Blumenstingl @ 2021-05-18 1:33 ` Andrew Lunn 2021-05-18 10:15 ` Stefan Agner 2021-05-19 20:09 ` Stefan Agner 2021-06-22 7:39 ` Stefan Agner 3 siblings, 1 reply; 17+ messages in thread From: Andrew Lunn @ 2021-05-18 1:33 UTC (permalink / raw) To: Stefan Agner Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote: > Hi, > > We are currently testing a new release using Linux 5.10.33. I've > received since several reports of random reboots every couple of days. > Unfortunately the log (journald) doesn't show anything, just a hard cut > at some point. > > After running serial console on several instances, I was able to catch > this stack trace: > > [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError > [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 > #1 > [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) > [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) > [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 > [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 Hi Stefan Could you generate net/ipv4/udp.lst so we can see what udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C code it maps to. Andrew _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-18 1:33 ` Andrew Lunn @ 2021-05-18 10:15 ` Stefan Agner 0 siblings, 0 replies; 17+ messages in thread From: Stefan Agner @ 2021-05-18 10:15 UTC (permalink / raw) To: Andrew Lunn Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl On 2021-05-18 03:33, Andrew Lunn wrote: > On Mon, May 17, 2021 at 11:14:18AM +0200, Stefan Agner wrote: >> Hi, >> >> We are currently testing a new release using Linux 5.10.33. I've >> received since several reports of random reboots every couple of days. >> Unfortunately the log (journald) doesn't show anything, just a hard cut >> at some point. >> >> After running serial console on several instances, I was able to catch >> this stack trace: >> >> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 >> #1 >> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 > > Hi Stefan Hi Andrew, > > Could you generate net/ipv4/udp.lst so we can see what > udp_send_skb.isra.0+0x178/0x390 is trying to do, and what bit of C > code it maps to. Ok, built net/ipv4/udp.lst using the same build environment (buildroot) the kernel which generated the stack trace has been built with, so I think this should add up: ffff800010c1bb60 <udp_send_skb.isra.0>: static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, ... udp4_hwcsum(skb, fl4->saddr, fl4->daddr); ffff800010c1bc78: 29450ae1 ldp w1, w2, [x23, #40] ffff800010c1bc7c: aa1303e0 mov x0, x19 ffff800010c1bc80: 94000000 bl ffff800010c184b0 <udp4_hwcsum> ffff800010c1bc80: R_AARCH64_CALL26 udp4_hwcsum err = ip_send_skb(sock_net(sk), skb); ffff800010c1bc84: f9401ac0 ldr x0, [x22, #48] ffff800010c1bc88: aa1303e1 mov x1, x19 ffff800010c1bc8c: 94000000 bl 0 <ip_send_skb> ffff800010c1bc8c: R_AARCH64_CALL26 ip_send_skb if (err) { ffff800010c1bc90: 350008e0 cbnz w0, ffff800010c1bdac <udp_send_skb.isra.0+0x24c> ... u64 pc = READ_ONCE(ti->preempt_count); ffff800010c1bcd4: f9400820 ldr x0, [x1, #16] WRITE_ONCE(ti->preempt.count, --pc); ffff800010c1bcd8: d1000400 sub x0, x0, #0x1 ffff800010c1bcdc: b9001020 str w0, [x1, #16] return !pc || !READ_ONCE(ti->preempt_count); ... The full udp.lst file: https://drive.google.com/file/d/1j0RKOfuMXmCRWILpkG3uk_beohWrr-ho/view?usp=sharing Since I only have this one trace, I am not 100% if this trace is just a random one or always the case. But things seem to add up to me: mdns-repeater deals with UDP packets, and the it seems that the code tries to make use of HW check-summing (from lr)? This would explain why this platform only shows the problem. -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner 2021-05-17 21:09 ` Martin Blumenstingl 2021-05-18 1:33 ` Andrew Lunn @ 2021-05-19 20:09 ` Stefan Agner 2021-06-22 7:39 ` Stefan Agner 3 siblings, 0 replies; 17+ messages in thread From: Stefan Agner @ 2021-05-19 20:09 UTC (permalink / raw) To: linux-amlogic, linux-arm-kernel Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, andrew On 2021-05-17 11:14, Stefan Agner wrote: > Hi, > > We are currently testing a new release using Linux 5.10.33. I've > received since several reports of random reboots every couple of days. > Unfortunately the log (journald) doesn't show anything, just a hard cut > at some point. > > After running serial console on several instances, I was able to catch > this stack trace: > > [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError > [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 > #1 > [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) > [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) > [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 > [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 > [202983.988160] sp : ffff8000132a3ae0 > [202983.988160] x29: ffff8000132a3ae0 x28: ffff8000132a3bf0 > [202983.988164] x27: 00000000fb0000e0 x26: ffff8000132a3d58 > [202983.988165] x25: 0000000000000073 x24: ffff000007963e24 > [202983.988167] x23: ffff8000132a3bf0 x22: ffff000005a72a80 > [202983.988169] x21: 0000000000000011 x20: 0000000000000073 > [202983.988170] x19: ffff000001a92c00 x18: 0000000000000001 > [202983.988172] x17: 0000000000000000 x16: 0000000000000000 > [202983.988173] x15: ffff8000132a3460 x14: 00000000ac1e2001 > [202983.988175] x13: ffff0000079181a0 x12: 0000000000000028 > [202983.988176] x11: ffff8000d3407000 x10: ffff800010ea8af0 > [202983.988178] x9 : 000000000000001b x8 : ffff000007963e00 > [202983.988179] x7 : ffff000000000000 x6 : 0000046a76b5fe28 > [202983.988181] x5 : 0000000000941cc2 x4 : 0000000000000000 > [202983.988182] x3 : 0000000000000001 x2 : ffff8000d3407000 > [202983.988184] x1 : ffff00002f6e0000 x0 : 0000000100000001 > [202983.988186] Kernel panic - not syncing: Asynchronous SError > Interrupt > [202983.988187] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 > #1 > [202983.988187] Hardware name: Hardkernel ODROID-N2Plus (DT) > [202983.988188] Call trace: > [202983.988188] dump_backtrace+0x0/0x1a0 > [202983.988189] show_stack+0x18/0x70 > [202983.988190] dump_stack+0xd0/0x12c > [202983.988190] panic+0x170/0x338 > [202983.988191] nmi_panic+0x8c/0x90 > [202983.988191] arm64_serror_panic+0x78/0x84 > [202983.988192] do_serror+0x38/0xa0 > [202983.988193] el1_error+0x88/0x108 > [202983.988193] udp_send_skb.isra.0+0x178/0x390 > [202983.988194] udp_sendmsg+0x7c8/0x9c0 > [202983.988194] inet_sendmsg+0x44/0x70 > [202983.988195] sock_sendmsg+0x4c/0x60 > [202983.988196] __sys_sendto+0xd0/0x140 > [202983.988196] __arm64_sys_sendto+0x28/0x40 > [202983.988197] el0_svc_common.constprop.0+0x78/0x1a0 > [202983.988197] do_el0_svc+0x24/0x90 > [202983.988198] el0_svc+0x14/0x20 > [202983.988199] el0_sync_handler+0xb0/0xc0 > [202983.988199] el0_sync+0x178/0x180 > [202983.988211] SMP: stopping secondary CPUs > [202983.988212] Kernel Offset: disabled > [202983.988212] CPU features: 0x0240002,61082004 > [202983.988213] Memory Limit: none > A second stack trace, same build etc. but different board (instance): [48112.247242] SError Interrupt on CPU5, code 0xbf000000 -- SError [48112.247244] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1 [48112.247245] Hardware name: Hardkernel ODROID-N2Plus (DT) [48112.247246] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--) [48112.247247] pc : __rcu_read_lock+0x18/0x20 [48112.247248] lr : lock_page_memcg+0x28/0xd0 [48112.247249] sp : ffff800013e238e0 [48112.247249] x29: ffff800013e238e0 x28: ffff800013e23b18 [48112.247252] x27: ffff000055c5c780 x26: 0000ffff9163c000 [48112.247254] x25: ffff0000053000c0 x24: 00e00000d40e3bc3 [48112.247256] x23: fffffe00033038c0 x22: ffff800013e23a18 [48112.247257] x21: 0000ffff9163b000 x20: fffffe00033038c0 [48112.247259] x19: fffffe00033038c0 x18: 0000000000000000 [48112.247261] x17: 0000000000000000 x16: 0000000000000000 [48112.247262] x15: 0000000000000002 x14: 0000000000000001 [48112.247264] x13: fffffe0001acdd08 x12: 0000000000000000 [48112.247265] x11: ffff0000e4650100 x10: ffff00004c640000 [48112.247267] x9 : 000000000000000c x8 : 00000000ffffffff [48112.247268] x7 : 0000000000000020 x6 : 0000000000000000 [48112.247270] x5 : 00000000000d40e3 x4 : 0000ffff9163b000 [48112.247271] x3 : 00000000ffffffff x2 : 0000000000000001 [48112.247273] x1 : ffff000000182ac0 x0 : 0000000000000001 [48112.247275] Kernel panic - not syncing: Asynchronous SError Interrupt [48112.247275] CPU: 5 PID: 264945 Comm: python3 Not tainted 5.10.33 #1 [48112.247276] Hardware name: Hardkernel ODROID-N2Plus (DT) [48112.247277] Call trace: [48112.247277] dump_backtrace+0x0/0x1a0 [48112.247278] show_stack+0x18/0x70 [48112.247279] dump_stack+0xd0/0x12c [48112.247279] panic+0x170/0x338 [48112.247280] nmi_panic+0x8c/0x90 [48112.247280] arm64_serror_panic+0x78/0x84 [48112.247281] do_serror+0x38/0xa0 [48112.247281] el1_error+0x88/0x108 [48112.247282] __rcu_read_lock+0x18/0x20 [48112.247283] page_remove_rmap+0x1c/0x560 [48112.247283] unmap_page_range+0x5b0/0x7b0 [48112.247284] unmap_single_vma+0x4c/0xb0 [48112.247285] unmap_vmas+0x70/0xf0 [48112.247285] exit_mmap+0xc8/0x180 [48112.247286] mmput+0x7c/0x150 [48112.247286] begin_new_exec+0x2d4/0xa90 [48112.247287] load_elf_binary+0x38c/0x1800 [48112.247288] bprm_execve+0x29c/0x5d0 [48112.247288] do_execveat_common.isra.0+0x178/0x1d0 [48112.247289] __arm64_sys_execve+0x40/0x60 [48112.247290] el0_svc_common.constprop.0+0x78/0x1a0 [48112.247290] do_el0_svc+0x24/0x90 [48112.247291] el0_svc+0x14/0x20 [48112.247291] el0_sync_handler+0xb0/0xc0 [48112.247292] el0_sync+0x178/0x180 [48112.247303] SMP: stopping secondary CPUs [48112.247304] Kernel Offset: disabled [48112.247305] CPU features: 0x0240002,61082004 [48112.247305] Memory Limit: none The stack trace does not look related to me... -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner ` (2 preceding siblings ...) 2021-05-19 20:09 ` Stefan Agner @ 2021-06-22 7:39 ` Stefan Agner 2021-07-23 14:25 ` Byron Stanoszek 3 siblings, 1 reply; 17+ messages in thread From: Stefan Agner @ 2021-06-22 7:39 UTC (permalink / raw) To: linux-amlogic, linux-arm-kernel Cc: Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl On 2021-05-17 11:14, Stefan Agner wrote: > Hi, > > We are currently testing a new release using Linux 5.10.33. I've > received since several reports of random reboots every couple of days. > Unfortunately the log (journald) doesn't show anything, just a hard cut > at some point. > > After running serial console on several instances, I was able to catch > this stack trace: > > [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError > [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 > #1 > [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) > [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) > [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 > [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 <snip> We do see those crashes in similar frequency with Linux 5.12: [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError [129988.642348] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1 [129988.642350] Hardware name: Hardkernel ODROID-N2Plus (DT) [129988.642351] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--) [129988.642352] pc : free_page_and_swap_cache+0x0/0x110 [129988.642352] lr : tlb_remove_table_rcu+0x30/0x60 [129988.642353] sp : ffff8000115bbdf0 [129988.642354] x29: ffff8000115bbdf0 x28: ffff800010103a18 [129988.642358] x27: 000000000000000a x26: ffff000000120000 [129988.642360] x25: ffff000000120000 x24: ffff8000115bbe90 [129988.642362] x23: ffff800011456680 x22: ffff0000e07df970 [129988.642365] x21: 0000000000000003 x20: 0000000000000001 [129988.642367] x19: ffff000005300000 x18: 0000000000000000 [129988.642369] x17: 0000000000000000 x16: 0000000000000000 [129988.642371] x15: 0000000000000000 x14: 0000000000000500 [129988.642373] x13: 0000000000000002 x12: 0000000000000000 [129988.642375] x11: ffff8000cf5e6000 x10: ffff000028212800 [129988.642377] x9 : 0000000000000001 x8 : 00000000fffff1b8 [129988.642379] x7 : 0000000000015f40 x6 : 0000000000000001 [129988.642381] x5 : ffff80001007cf4c x4 : 0000000000000007 [129988.642383] x3 : ffff0000e07e2e78 x2 : ffff000025a2bd00 [129988.642385] x1 : ffff800010208b60 x0 : fffffc00002e9a80 [129988.642387] Kernel panic - not syncing: Asynchronous SError Interrupt [129988.642388] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.12.10 #1 [129988.642389] Hardware name: Hardkernel ODROID-N2Plus (DT) [129988.642390] Call trace: [129988.642391] dump_backtrace+0x0/0x1a0 [129988.642392] show_stack+0x18/0x70 [129988.642392] dump_stack+0xd0/0x12c [129988.642393] panic+0x170/0x338 [129988.642394] nmi_panic+0x8c/0x90 [129988.642395] arm64_serror_panic+0x78/0x84 [129988.642395] do_serror+0x38/0xa0 [129988.642396] el1_error+0x80/0xf8 [129988.642397] free_page_and_swap_cache+0x0/0x110 [129988.642398] rcu_core+0x310/0x5d0 [129988.642398] rcu_core_si+0x10/0x20 [129988.642399] _stext+0x128/0x28c [129988.642400] irq_exit+0xd8/0x100 [129988.642401] __handle_domain_irq+0x68/0xc0 [129988.642401] gic_handle_irq+0xa8/0xe0 [129988.642402] el1_irq+0xbc/0x180 [129988.642403] arch_cpu_idle+0x18/0x30 [129988.642404] default_idle_call+0x20/0x68 [129988.642404] do_idle+0x218/0x270 [129988.642405] cpu_startup_entry+0x24/0x70 [129988.642406] secondary_start_kernel+0x178/0x190 [129988.642418] SMP: stopping secondary CPUs [129988.642419] Kernel Offset: disabled [129988.642420] CPU features: 0x00240002,61082004 [129988.642421] Memory Limit: none It seems load and/or hardware dependent since we see it on some devices quite frequent (every few days), and on others it takes multiple weeks. Of course the once we see it frequently are the ones in production :). I am currently trying different stress-ng and other load to accelerate the crash rate before then trying to git bisect it. -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-06-22 7:39 ` Stefan Agner @ 2021-07-23 14:25 ` Byron Stanoszek 2021-07-23 15:36 ` Robin Murphy 0 siblings, 1 reply; 17+ messages in thread From: Byron Stanoszek @ 2021-07-23 14:25 UTC (permalink / raw) To: Stefan Agner Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On Tue, 22 Jun 2021, Stefan Agner wrote: > On 2021-05-17 11:14, Stefan Agner wrote: >> Hi, >> >> We are currently testing a new release using Linux 5.10.33. I've >> received since several reports of random reboots every couple of days. >> Unfortunately the log (journald) doesn't show anything, just a hard cut >> at some point. >> >> After running serial console on several instances, I was able to catch >> this stack trace: >> >> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 >> #1 >> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 > > <snip> > > We do see those crashes in similar frequency with Linux 5.12: > > [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError > > It seems load and/or hardware dependent since we see it on some devices > quite frequent (every few days), and on others it takes multiple weeks. > Of course the once we see it frequently are the ones in production :). > > I am currently trying different stress-ng and other load to accelerate > the crash rate before then trying to git bisect it. I have an Odroid-N2+ and was able to track this problem down. The problem is related to the following dmesg line that reads "failed to reserve memory" below: Machine model: Hardkernel ODROID-N2Plus memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604 memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664 memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50 OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c OF: reserved mem: node linux,cma compatible matching fail memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8 ... A subsequent "cat /proc/iomem" shows that this memory region is still reserved and the system appears to operate normally, until eventually the SError Interrupt comes in under heavy memory/page-cache usage. The difference with later kernels is that now the memory at 0x5000000-0x52fffff is registered under the "System RAM" memory area, whereas previous kernels had dropped it from "System RAM". The culprit is this new code introduced in Linux 5.12, in this function in drivers/of/fdt.c, called by function __reserved_mem_reserve_reg(): int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, phys_addr_t size, bool nomap) { if (nomap) { /* * If the memory is already reserved (by another region), we * should not allow it to be marked nomap. */ if (memblock_is_region_reserved(base, size)) <------ return -EBUSY; <------ return memblock_mark_nomap(base, size); } return memblock_reserve(base, size); } "nomap" is true, due to this text being present in the FDT: reserved-memory { ranges secmon_reserved: secmon@5000000 { reg = <0x0 0x05000000 0x0 0x300000> no-map } ... But memblock_is_region_reserved() is returning true because there is already an entry for 0x5000000-0x52fffff in the memory map, which is already marked reserved, at the time the __reserved_mem_reserve_reg() function is called. (Perhaps this is being set reserved by u-boot? -- I did not research that far.) This function is defined as: bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) { return memblock_overlaps_region(&memblock.reserved, base, size); } Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing reserved region "0x5000000-0x52fffff", the function returns true. If I comment out the "if (memblock_is_region_reserved(base, size))" code and allow it to mark the region no-map, then the memory area is properly removed from the "System RAM" area and the crashing stops. I've had the system up and running for 15 days now under heavy load without any crashes, using just the following patch as workaround: --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400 +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400 @@ -1157,13 +1157,6 @@ phys_addr_t size, bool nomap) { if (nomap) { - /* - * If the memory is already reserved (by another region), we - * should not allow it to be marked nomap. - */ - if (memblock_is_region_reserved(base, size)) - return -EBUSY; - return memblock_mark_nomap(base, size); } return memblock_reserve(base, size); The above patch applies to later versions of Linux 5.10.x through 5.12.x as well. Perhaps a more proper fix is to allow the no-map to still proceed, in the case that the existing reserved region is identical (same start/end) to the region getting marked no-map. -Byron _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 14:25 ` Byron Stanoszek @ 2021-07-23 15:36 ` Robin Murphy 2021-07-23 15:56 ` Stefan Agner 0 siblings, 1 reply; 17+ messages in thread From: Robin Murphy @ 2021-07-23 15:36 UTC (permalink / raw) To: Byron Stanoszek, Stefan Agner Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On 2021-07-23 15:25, Byron Stanoszek wrote: > On Tue, 22 Jun 2021, Stefan Agner wrote: > >> On 2021-05-17 11:14, Stefan Agner wrote: >>> Hi, >>> >>> We are currently testing a new release using Linux 5.10.33. I've >>> received since several reports of random reboots every couple of days. >>> Unfortunately the log (journald) doesn't show anything, just a hard cut >>> at some point. >>> >>> After running serial console on several instances, I was able to catch >>> this stack trace: >>> >>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 >>> #1 >>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 >> >> <snip> >> >> We do see those crashes in similar frequency with Linux 5.12: >> >> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError >> >> It seems load and/or hardware dependent since we see it on some devices >> quite frequent (every few days), and on others it takes multiple weeks. >> Of course the once we see it frequently are the ones in production :). >> >> I am currently trying different stress-ng and other load to accelerate >> the crash rate before then trying to git bisect it. > > I have an Odroid-N2+ and was able to track this problem down. The > problem is > related to the following dmesg line that reads "failed to reserve memory" > below: > > Machine model: Hardkernel ODROID-N2Plus > memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604 > memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664 > memblock_reserve: [0x0000000008210000-0x0000000008baffff] > 0xffffffc0107e36dc > memblock_reserve: [0x0000000005000000-0x00000000052fffff] > 0xffffffc0107feb50 > OF: fdt: Reserved memory: failed to reserve memory for node > 'secmon@5000000': base 0x0000000005000000, size 3 MiB > memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] > 0xffffffc0107ff87c > OF: reserved mem: node linux,cma compatible matching fail > memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8 > ... > > A subsequent "cat /proc/iomem" shows that this memory region is still > reserved > and the system appears to operate normally, until eventually the SError > Interrupt comes in under heavy memory/page-cache usage. The difference with > later kernels is that now the memory at 0x5000000-0x52fffff is > registered under > the "System RAM" memory area, whereas previous kernels had dropped it from > "System RAM". > > The culprit is this new code introduced in Linux 5.12, in this function in > drivers/of/fdt.c, called by function __reserved_mem_reserve_reg(): > > int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, > phys_addr_t size, bool nomap) > { > if (nomap) { > /* > * If the memory is already reserved (by another > region), we > * should not allow it to be marked nomap. > */ > if (memblock_is_region_reserved(base, size)) <------ > return -EBUSY; <------ > > return memblock_mark_nomap(base, size); > } > return memblock_reserve(base, size); > } > > "nomap" is true, due to this text being present in the FDT: > > reserved-memory { > ranges secmon_reserved: secmon@5000000 { > reg = <0x0 0x05000000 0x0 0x300000> > no-map > } > ... > > But memblock_is_region_reserved() is returning true because there is > already an > entry for 0x5000000-0x52fffff in the memory map, which is already marked > reserved, at the time the __reserved_mem_reserve_reg() function is called. > (Perhaps this is being set reserved by u-boot? -- I did not research > that far.) > > This function is defined as: > > bool __init_memblock memblock_is_region_reserved(phys_addr_t base, > phys_addr_t size) > { > return memblock_overlaps_region(&memblock.reserved, base, size); > } > > Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the > existing > reserved region "0x5000000-0x52fffff", the function returns true. > > If I comment out the "if (memblock_is_region_reserved(base, size))" code > and > allow it to mark the region no-map, then the memory area is properly > removed > from the "System RAM" area and the crashing stops. > > I've had the system up and running for 15 days now under heavy load > without any > crashes, using just the following patch as workaround: > > > --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 > -0400 > +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400 > @@ -1157,13 +1157,6 @@ > phys_addr_t size, bool nomap) > { > if (nomap) { > - /* > - * If the memory is already reserved (by another region), we > - * should not allow it to be marked nomap. > - */ > - if (memblock_is_region_reserved(base, size)) > - return -EBUSY; > - > return memblock_mark_nomap(base, size); > } > return memblock_reserve(base, size); > > > The above patch applies to later versions of Linux 5.10.x through 5.12.x as > well. > > Perhaps a more proper fix is to allow the no-map to still proceed, in > the case > that the existing reserved region is identical (same start/end) to the > region > getting marked no-map. If U-Boot is marking regions with the wrong type/attributes in the EFI memory map, then the best thing to do would be to fix that. I see a fairly recent commit which looks suspiciously relevant: https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 Booting with "efi=debug" should (among other things) print the memory map at boot if you want to double-check that that is the source of the mismatch. Our EFI code should be perfectly capable of setting the memblock flag if the region *is* described appropriately, see reserve_regions() in drivers/firmware/efi/efi-init.c. Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 15:36 ` Robin Murphy @ 2021-07-23 15:56 ` Stefan Agner 2021-07-23 16:14 ` Robin Murphy 0 siblings, 1 reply; 17+ messages in thread From: Stefan Agner @ 2021-07-23 15:56 UTC (permalink / raw) To: Robin Murphy, Byron Stanoszek Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport Hi Byron, Hi Robin, Very interesting findings! On 2021-07-23 17:36, Robin Murphy wrote: > On 2021-07-23 15:25, Byron Stanoszek wrote: >> On Tue, 22 Jun 2021, Stefan Agner wrote: >> >>> On 2021-05-17 11:14, Stefan Agner wrote: >>>> Hi, >>>> >>>> We are currently testing a new release using Linux 5.10.33. I've >>>> received since several reports of random reboots every couple of days. >>>> Unfortunately the log (journald) doesn't show anything, just a hard cut >>>> at some point. >>>> >>>> After running serial console on several instances, I was able to catch >>>> this stack trace: >>>> >>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 >>>> #1 >>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 >>> >>> <snip> >>> >>> We do see those crashes in similar frequency with Linux 5.12: >>> >>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError >>> >>> It seems load and/or hardware dependent since we see it on some devices >>> quite frequent (every few days), and on others it takes multiple weeks. >>> Of course the once we see it frequently are the ones in production :). >>> >>> I am currently trying different stress-ng and other load to accelerate >>> the crash rate before then trying to git bisect it. >> >> I have an Odroid-N2+ and was able to track this problem down. The problem is >> related to the following dmesg line that reads "failed to reserve memory" >> below: >> >> Machine model: Hardkernel ODROID-N2Plus >> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604 >> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664 >> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc >> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50 >> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB In my 5.9 builds that line isn't present, and it seems all logs I stored from 5.10 builds have the change already and show this line. >> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c >> OF: reserved mem: node linux,cma compatible matching fail >> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8 >> ... >> >> A subsequent "cat /proc/iomem" shows that this memory region is still reserved >> and the system appears to operate normally, until eventually the SError >> Interrupt comes in under heavy memory/page-cache usage. The difference with >> later kernels is that now the memory at 0x5000000-0x52fffff is registered under >> the "System RAM" memory area, whereas previous kernels had dropped it from >> "System RAM". >> >> The culprit is this new code introduced in Linux 5.12, in this function in >> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg(): It seems that patch got also backported, so that is why I see it with 5.10 as well. >> >> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, >> phys_addr_t size, bool nomap) >> { >> if (nomap) { >> /* >> * If the memory is already reserved (by another region), we >> * should not allow it to be marked nomap. >> */ >> if (memblock_is_region_reserved(base, size)) <------ >> return -EBUSY; <------ >> >> return memblock_mark_nomap(base, size); >> } >> return memblock_reserve(base, size); >> } >> >> "nomap" is true, due to this text being present in the FDT: >> >> reserved-memory { >> ranges secmon_reserved: secmon@5000000 { >> reg = <0x0 0x05000000 0x0 0x300000> >> no-map >> } >> ... >> >> But memblock_is_region_reserved() is returning true because there is already an >> entry for 0x5000000-0x52fffff in the memory map, which is already marked >> reserved, at the time the __reserved_mem_reserve_reg() function is called. >> (Perhaps this is being set reserved by u-boot? -- I did not research that far.) >> >> This function is defined as: >> >> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) >> { >> return memblock_overlaps_region(&memblock.reserved, base, size); >> } >> >> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing >> reserved region "0x5000000-0x52fffff", the function returns true. >> >> If I comment out the "if (memblock_is_region_reserved(base, size))" code and >> allow it to mark the region no-map, then the memory area is properly removed >> from the "System RAM" area and the crashing stops. >> >> I've had the system up and running for 15 days now under heavy load without any >> crashes, using just the following patch as workaround: >> >> >> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400 >> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400 >> @@ -1157,13 +1157,6 @@ >> phys_addr_t size, bool nomap) >> { >> if (nomap) { >> - /* >> - * If the memory is already reserved (by another region), we >> - * should not allow it to be marked nomap. >> - */ >> - if (memblock_is_region_reserved(base, size)) >> - return -EBUSY; >> - >> return memblock_mark_nomap(base, size); >> } >> return memblock_reserve(base, size); >> >> >> The above patch applies to later versions of Linux 5.10.x through 5.12.x as >> well. Eventhough probably not the correct solution, I'll give this a try on my end just to verify we are indeed experience the same problem and the change fixes it for me too. >> >> Perhaps a more proper fix is to allow the no-map to still proceed, in the case >> that the existing reserved region is identical (same start/end) to the region >> getting marked no-map. > > If U-Boot is marking regions with the wrong type/attributes in the EFI > memory map, then the best thing to do would be to fix that. I see a > fairly recent commit which looks suspiciously relevant: > > https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 It seems that this patch went into U-Boot 2021.04 which I am using, so that (alone) seems not to fix the mapping. > > Booting with "efi=debug" should (among other things) print the memory > map at boot if you want to double-check that that is the source of the > mismatch. Our EFI code should be perfectly capable of setting the > memblock flag if the region *is* described appropriately, see > reserve_regions() in drivers/firmware/efi/efi-init.c. Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: [ 0.000000] Machine model: Hardkernel ODROID-N2Plus [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: [ 0.000000] efi: UEFI not found. [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB So it seems UEFI is not in the play here? -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 15:56 ` Stefan Agner @ 2021-07-23 16:14 ` Robin Murphy 2021-07-23 17:47 ` Robin Murphy 0 siblings, 1 reply; 17+ messages in thread From: Robin Murphy @ 2021-07-23 16:14 UTC (permalink / raw) To: Stefan Agner, Byron Stanoszek Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On 2021-07-23 16:56, Stefan Agner wrote: > Hi Byron, Hi Robin, > > Very interesting findings! > > On 2021-07-23 17:36, Robin Murphy wrote: >> On 2021-07-23 15:25, Byron Stanoszek wrote: >>> On Tue, 22 Jun 2021, Stefan Agner wrote: >>> >>>> On 2021-05-17 11:14, Stefan Agner wrote: >>>>> Hi, >>>>> >>>>> We are currently testing a new release using Linux 5.10.33. I've >>>>> received since several reports of random reboots every couple of days. >>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut >>>>> at some point. >>>>> >>>>> After running serial console on several instances, I was able to catch >>>>> this stack trace: >>>>> >>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 >>>>> #1 >>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 >>>> >>>> <snip> >>>> >>>> We do see those crashes in similar frequency with Linux 5.12: >>>> >>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError >>>> >>>> It seems load and/or hardware dependent since we see it on some devices >>>> quite frequent (every few days), and on others it takes multiple weeks. >>>> Of course the once we see it frequently are the ones in production :). >>>> >>>> I am currently trying different stress-ng and other load to accelerate >>>> the crash rate before then trying to git bisect it. >>> >>> I have an Odroid-N2+ and was able to track this problem down. The problem is >>> related to the following dmesg line that reads "failed to reserve memory" >>> below: >>> >>> Machine model: Hardkernel ODROID-N2Plus >>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604 >>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664 >>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc >>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50 >>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB > > In my 5.9 builds that line isn't present, and it seems all logs I stored > from 5.10 builds have the change already and show this line. > >>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c >>> OF: reserved mem: node linux,cma compatible matching fail >>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8 >>> ... >>> >>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved >>> and the system appears to operate normally, until eventually the SError >>> Interrupt comes in under heavy memory/page-cache usage. The difference with >>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under >>> the "System RAM" memory area, whereas previous kernels had dropped it from >>> "System RAM". >>> >>> The culprit is this new code introduced in Linux 5.12, in this function in >>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg(): > > It seems that patch got also backported, so that is why I see it with > 5.10 as well. > >>> >>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, >>> phys_addr_t size, bool nomap) >>> { >>> if (nomap) { >>> /* >>> * If the memory is already reserved (by another region), we >>> * should not allow it to be marked nomap. >>> */ >>> if (memblock_is_region_reserved(base, size)) <------ >>> return -EBUSY; <------ >>> >>> return memblock_mark_nomap(base, size); >>> } >>> return memblock_reserve(base, size); >>> } >>> >>> "nomap" is true, due to this text being present in the FDT: >>> >>> reserved-memory { >>> ranges secmon_reserved: secmon@5000000 { >>> reg = <0x0 0x05000000 0x0 0x300000> >>> no-map >>> } >>> ... >>> >>> But memblock_is_region_reserved() is returning true because there is already an >>> entry for 0x5000000-0x52fffff in the memory map, which is already marked >>> reserved, at the time the __reserved_mem_reserve_reg() function is called. >>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.) >>> >>> This function is defined as: >>> >>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) >>> { >>> return memblock_overlaps_region(&memblock.reserved, base, size); >>> } >>> >>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing >>> reserved region "0x5000000-0x52fffff", the function returns true. >>> >>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and >>> allow it to mark the region no-map, then the memory area is properly removed >>> from the "System RAM" area and the crashing stops. >>> >>> I've had the system up and running for 15 days now under heavy load without any >>> crashes, using just the following patch as workaround: >>> >>> >>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400 >>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400 >>> @@ -1157,13 +1157,6 @@ >>> phys_addr_t size, bool nomap) >>> { >>> if (nomap) { >>> - /* >>> - * If the memory is already reserved (by another region), we >>> - * should not allow it to be marked nomap. >>> - */ >>> - if (memblock_is_region_reserved(base, size)) >>> - return -EBUSY; >>> - >>> return memblock_mark_nomap(base, size); >>> } >>> return memblock_reserve(base, size); >>> >>> >>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as >>> well. > > Eventhough probably not the correct solution, I'll give this a try on my > end just to verify we are indeed experience the same problem and the > change fixes it for me too. > >>> >>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case >>> that the existing reserved region is identical (same start/end) to the region >>> getting marked no-map. >> >> If U-Boot is marking regions with the wrong type/attributes in the EFI >> memory map, then the best thing to do would be to fix that. I see a >> fairly recent commit which looks suspiciously relevant: >> >> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 > > It seems that this patch went into U-Boot 2021.04 which I am using, so > that (alone) seems not to fix the mapping. > >> >> Booting with "efi=debug" should (among other things) print the memory >> map at boot if you want to double-check that that is the source of the >> mismatch. Our EFI code should be perfectly capable of setting the >> memblock flag if the region *is* described appropriately, see >> reserve_regions() in drivers/firmware/efi/efi-init.c. > > Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: > [ 0.000000] Machine model: Hardkernel ODROID-N2Plus > [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: > [ 0.000000] efi: UEFI not found. > [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for > node 'secmon@5000000': base 0x0000000005000000, size 3 MiB > > So it seems UEFI is not in the play here? Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 16:14 ` Robin Murphy @ 2021-07-23 17:47 ` Robin Murphy 2021-07-23 19:48 ` Stefan Agner 0 siblings, 1 reply; 17+ messages in thread From: Robin Murphy @ 2021-07-23 17:47 UTC (permalink / raw) To: Stefan Agner, Byron Stanoszek Cc: linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On 2021-07-23 17:14, Robin Murphy wrote: > On 2021-07-23 16:56, Stefan Agner wrote: >> Hi Byron, Hi Robin, >> >> Very interesting findings! >> >> On 2021-07-23 17:36, Robin Murphy wrote: >>> On 2021-07-23 15:25, Byron Stanoszek wrote: >>>> On Tue, 22 Jun 2021, Stefan Agner wrote: >>>> >>>>> On 2021-05-17 11:14, Stefan Agner wrote: >>>>>> Hi, >>>>>> >>>>>> We are currently testing a new release using Linux 5.10.33. I've >>>>>> received since several reports of random reboots every couple of >>>>>> days. >>>>>> Unfortunately the log (journald) doesn't show anything, just a >>>>>> hard cut >>>>>> at some point. >>>>>> >>>>>> After running serial console on several instances, I was able to >>>>>> catch >>>>>> this stack trace: >>>>>> >>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError >>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted >>>>>> 5.10.33 >>>>>> #1 >>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) >>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) >>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 >>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390 >>>>> >>>>> <snip> >>>>> >>>>> We do see those crashes in similar frequency with Linux 5.12: >>>>> >>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError >>>>> >>>>> It seems load and/or hardware dependent since we see it on some >>>>> devices >>>>> quite frequent (every few days), and on others it takes multiple >>>>> weeks. >>>>> Of course the once we see it frequently are the ones in production :). >>>>> >>>>> I am currently trying different stress-ng and other load to accelerate >>>>> the crash rate before then trying to git bisect it. >>>> >>>> I have an Odroid-N2+ and was able to track this problem down. The >>>> problem is >>>> related to the following dmesg line that reads "failed to reserve >>>> memory" >>>> below: >>>> >>>> Machine model: Hardkernel ODROID-N2Plus >>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] >>>> 0xffffffc0107e3604 >>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] >>>> 0xffffffc0107e3664 >>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] >>>> 0xffffffc0107e36dc >>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] >>>> 0xffffffc0107feb50 >>>> OF: fdt: Reserved memory: failed to reserve memory for node >>>> 'secmon@5000000': base 0x0000000005000000, size 3 MiB >> >> In my 5.9 builds that line isn't present, and it seems all logs I stored >> from 5.10 builds have the change already and show this line. >> >>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] >>>> 0xffffffc0107ff87c >>>> OF: reserved mem: node linux,cma compatible matching fail >>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] >>>> 0xffffffc0107ffca8 >>>> ... >>>> >>>> A subsequent "cat /proc/iomem" shows that this memory region is >>>> still reserved >>>> and the system appears to operate normally, until eventually the SError >>>> Interrupt comes in under heavy memory/page-cache usage. The >>>> difference with >>>> later kernels is that now the memory at 0x5000000-0x52fffff is >>>> registered under >>>> the "System RAM" memory area, whereas previous kernels had dropped >>>> it from >>>> "System RAM". >>>> >>>> The culprit is this new code introduced in Linux 5.12, in this >>>> function in >>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg(): >> >> It seems that patch got also backported, so that is why I see it with >> 5.10 as well. >> >>>> >>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, >>>> phys_addr_t size, bool nomap) >>>> { >>>> if (nomap) { >>>> /* >>>> * If the memory is already reserved (by another >>>> region), we >>>> * should not allow it to be marked nomap. >>>> */ >>>> if (memblock_is_region_reserved(base, size)) <------ >>>> return -EBUSY; <------ >>>> >>>> return memblock_mark_nomap(base, size); >>>> } >>>> return memblock_reserve(base, size); >>>> } >>>> >>>> "nomap" is true, due to this text being present in the FDT: >>>> >>>> reserved-memory { >>>> ranges secmon_reserved: secmon@5000000 { >>>> reg = <0x0 0x05000000 0x0 0x300000> >>>> no-map >>>> } >>>> ... >>>> >>>> But memblock_is_region_reserved() is returning true because there is >>>> already an >>>> entry for 0x5000000-0x52fffff in the memory map, which is already >>>> marked >>>> reserved, at the time the __reserved_mem_reserve_reg() function is >>>> called. >>>> (Perhaps this is being set reserved by u-boot? -- I did not research >>>> that far.) >>>> >>>> This function is defined as: >>>> >>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, >>>> phys_addr_t size) >>>> { >>>> return memblock_overlaps_region(&memblock.reserved, base, >>>> size); >>>> } >>>> >>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the >>>> existing >>>> reserved region "0x5000000-0x52fffff", the function returns true. >>>> >>>> If I comment out the "if (memblock_is_region_reserved(base, size))" >>>> code and >>>> allow it to mark the region no-map, then the memory area is properly >>>> removed >>>> from the "System RAM" area and the crashing stops. >>>> >>>> I've had the system up and running for 15 days now under heavy load >>>> without any >>>> crashes, using just the following patch as workaround: >>>> >>>> >>>> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 >>>> 00:22:58.000000000 -0400 >>>> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 >>>> -0400 >>>> @@ -1157,13 +1157,6 @@ >>>> phys_addr_t size, bool nomap) >>>> { >>>> if (nomap) { >>>> - /* >>>> - * If the memory is already reserved (by another region), we >>>> - * should not allow it to be marked nomap. >>>> - */ >>>> - if (memblock_is_region_reserved(base, size)) >>>> - return -EBUSY; >>>> - >>>> return memblock_mark_nomap(base, size); >>>> } >>>> return memblock_reserve(base, size); >>>> >>>> >>>> The above patch applies to later versions of Linux 5.10.x through >>>> 5.12.x as >>>> well. >> >> Eventhough probably not the correct solution, I'll give this a try on my >> end just to verify we are indeed experience the same problem and the >> change fixes it for me too. >> >>>> >>>> Perhaps a more proper fix is to allow the no-map to still proceed, >>>> in the case >>>> that the existing reserved region is identical (same start/end) to >>>> the region >>>> getting marked no-map. >>> >>> If U-Boot is marking regions with the wrong type/attributes in the EFI >>> memory map, then the best thing to do would be to fix that. I see a >>> fairly recent commit which looks suspiciously relevant: >>> >>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 >>> >> >> It seems that this patch went into U-Boot 2021.04 which I am using, so >> that (alone) seems not to fix the mapping. >> >>> >>> Booting with "efi=debug" should (among other things) print the memory >>> map at boot if you want to double-check that that is the source of the >>> mismatch. Our EFI code should be perfectly capable of setting the >>> memblock flag if the region *is* described appropriately, see >>> reserve_regions() in drivers/firmware/efi/efi-init.c. >> >> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: >> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus >> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: >> [ 0.000000] efi: UEFI not found. >> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for >> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB >> >> So it seems UEFI is not in the play here? > > Ah, OK, in that case I guess the question remains why does > early_init_dt_reserve_memory_arch() think the region is already > reserved? My instinctive assumption was an EFI memory map being present; > seeing that U-Boot does indeed reflect DT reservations there *and* has > had a likely-looking bug recently was then just overwhelmingly > suggestive :) Actually, poking at U-Boot a bit more I find meson_board_add_reserved_memory() - can you check /sys/firmware/fdt and see if the region ends up being passed as a /memreserve/ as well as a proper reserved-memory node? IIRC the semantics of /memreserve/ aren't really well-defined enough to be suitable for the kind of things which require no-map, and my new guess is that that's what ends up conflicting here. Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 17:47 ` Robin Murphy @ 2021-07-23 19:48 ` Stefan Agner 2021-07-26 7:54 ` Neil Armstrong 0 siblings, 1 reply; 17+ messages in thread From: Stefan Agner @ 2021-07-23 19:48 UTC (permalink / raw) To: Robin Murphy Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Neil Armstrong, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On 2021-07-23 19:47, Robin Murphy wrote: > On 2021-07-23 17:14, Robin Murphy wrote: >> On 2021-07-23 16:56, Stefan Agner wrote: <snip> >>>> >>>> Booting with "efi=debug" should (among other things) print the memory >>>> map at boot if you want to double-check that that is the source of the >>>> mismatch. Our EFI code should be perfectly capable of setting the >>>> memblock flag if the region *is* described appropriately, see >>>> reserve_regions() in drivers/firmware/efi/efi-init.c. >>> >>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: >>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus >>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: >>> [ 0.000000] efi: UEFI not found. >>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for >>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB >>> >>> So it seems UEFI is not in the play here? >> >> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) > > Actually, poking at U-Boot a bit more I find > meson_board_add_reserved_memory() - can you check /sys/firmware/fdt > and see if the region ends up being passed as a /memreserve/ as well > as a proper reserved-memory node? > > IIRC the semantics of /memreserve/ aren't really well-defined enough > to be suitable for the kind of things which require no-map, and my new > guess is that that's what ends up conflicting here. Seems to be present in booth: On v5.12.10 # fdtdump /sys/firmware/fdt ... /memreserve/ 0x5000000 0x300000; ... reserved-memory { #address-cells = <0x00000002>; #size-cells = <0x00000002>; ranges; secmon@5000000 { reg = <0x00000000 0x05000000 0x00000000 0x00300000>; no-map; phandle = <0x00000068>; }; linux,cma { compatible = "shared-dma-pool"; reusable; size = <0x00000000 0x10000000>; alignment = <0x00000000 0x00400000>; linux,cma-default; }; }; -- Stefan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-23 19:48 ` Stefan Agner @ 2021-07-26 7:54 ` Neil Armstrong 2021-07-26 12:07 ` Stefan Agner 0 siblings, 1 reply; 17+ messages in thread From: Neil Armstrong @ 2021-07-26 7:54 UTC (permalink / raw) To: Stefan Agner, Robin Murphy Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport Hi, On 23/07/2021 21:48, Stefan Agner wrote: > On 2021-07-23 19:47, Robin Murphy wrote: >> On 2021-07-23 17:14, Robin Murphy wrote: >>> On 2021-07-23 16:56, Stefan Agner wrote: > <snip> >>>>> >>>>> Booting with "efi=debug" should (among other things) print the memory >>>>> map at boot if you want to double-check that that is the source of the >>>>> mismatch. Our EFI code should be perfectly capable of setting the >>>>> memblock flag if the region *is* described appropriately, see >>>>> reserve_regions() in drivers/firmware/efi/efi-init.c. >>>> >>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: >>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus >>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: >>>> [ 0.000000] efi: UEFI not found. >>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for >>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB >>>> >>>> So it seems UEFI is not in the play here? >>> >>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) >> >> Actually, poking at U-Boot a bit more I find >> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt >> and see if the region ends up being passed as a /memreserve/ as well >> as a proper reserved-memory node? >> >> IIRC the semantics of /memreserve/ aren't really well-defined enough >> to be suitable for the kind of things which require no-map, and my new >> guess is that that's what ends up conflicting here. > > Seems to be present in booth: Indeed, in order so support any combination: - upstream u-boot - vendor u-boot - upstream linux - other OS The secmon is in the upstream Linux DT, and upstream u-boot reads the secure memory regions from the first stage bootloaders and adds them into the DT memreserve. It worked fine since Linux 4.10-ish, until 5.10. Neil > > On v5.12.10 > # fdtdump /sys/firmware/fdt > ... > /memreserve/ 0x5000000 0x300000; > ... > reserved-memory { > #address-cells = <0x00000002>; > #size-cells = <0x00000002>; > ranges; > secmon@5000000 { > reg = <0x00000000 0x05000000 0x00000000 0x00300000>; > no-map; > phandle = <0x00000068>; > }; > linux,cma { > compatible = "shared-dma-pool"; > reusable; > size = <0x00000000 0x10000000>; > alignment = <0x00000000 0x00400000>; > linux,cma-default; > }; > }; > > -- > Stefan > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-26 7:54 ` Neil Armstrong @ 2021-07-26 12:07 ` Stefan Agner 2021-07-26 12:31 ` Robin Murphy 0 siblings, 1 reply; 17+ messages in thread From: Stefan Agner @ 2021-07-26 12:07 UTC (permalink / raw) To: Neil Armstrong Cc: Robin Murphy, Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport FWIW, I did run two boards over the weekend with stress-ng vm test running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt: Make sure no-map does not remove already reserved regions") reverted. The one without the revert crashed after ~24h, the other did run through the weekend. Basically confirming what Byron reported. On 2021-07-26 09:54, Neil Armstrong wrote: > Hi, > > On 23/07/2021 21:48, Stefan Agner wrote: >> On 2021-07-23 19:47, Robin Murphy wrote: >>> On 2021-07-23 17:14, Robin Murphy wrote: >>>> On 2021-07-23 16:56, Stefan Agner wrote: >> <snip> >>>>>> >>>>>> Booting with "efi=debug" should (among other things) print the memory >>>>>> map at boot if you want to double-check that that is the source of the >>>>>> mismatch. Our EFI code should be perfectly capable of setting the >>>>>> memblock flag if the region *is* described appropriately, see >>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c. >>>>> >>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: >>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus >>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: >>>>> [ 0.000000] efi: UEFI not found. >>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for >>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB >>>>> >>>>> So it seems UEFI is not in the play here? >>>> >>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) >>> >>> Actually, poking at U-Boot a bit more I find >>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt >>> and see if the region ends up being passed as a /memreserve/ as well >>> as a proper reserved-memory node? >>> >>> IIRC the semantics of /memreserve/ aren't really well-defined enough >>> to be suitable for the kind of things which require no-map, and my new >>> guess is that that's what ends up conflicting here. >> >> Seems to be present in booth: > > Indeed, in order so support any combination: > - upstream u-boot > - vendor u-boot > - upstream linux > - other OS > > The secmon is in the upstream Linux DT, and upstream u-boot reads the > secure memory regions > from the first stage bootloaders and adds them into the DT memreserve. > > It worked fine since Linux 4.10-ish, until 5.10. Just verified what is probably obvious at this point: By removing meson_board_add_reserved_memory() the /memreserve/ region isn't present and "failed to reserve memory" message disappears indeed. Why is reserving memory not enough? From what I've read no-map also make sure there is no VM mapping, but if the region is reserved, shouldn't that be enough for Linux to not access the region? I've read that no-map also preventsaccess due to speculation, is this what is happening here? What is the proper solution here? Could maybe meson_board_add_reserved_memory() check if reserved-memory is present, and if so avoid adding /memreserve/? -- Stefan > > Neil > >> >> On v5.12.10 >> # fdtdump /sys/firmware/fdt >> ... >> /memreserve/ 0x5000000 0x300000; >> ... >> reserved-memory { >> #address-cells = <0x00000002>; >> #size-cells = <0x00000002>; >> ranges; >> secmon@5000000 { >> reg = <0x00000000 0x05000000 0x00000000 0x00300000>; >> no-map; >> phandle = <0x00000068>; >> }; >> linux,cma { >> compatible = "shared-dma-pool"; >> reusable; >> size = <0x00000000 0x10000000>; >> alignment = <0x00000000 0x00400000>; >> linux,cma-default; >> }; >> }; >> >> -- >> Stefan >> _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Random reboots on ODROID-N2+ 2021-07-26 12:07 ` Stefan Agner @ 2021-07-26 12:31 ` Robin Murphy 0 siblings, 0 replies; 17+ messages in thread From: Robin Murphy @ 2021-07-26 12:31 UTC (permalink / raw) To: Stefan Agner, Neil Armstrong Cc: Byron Stanoszek, linux-amlogic, linux-arm-kernel, Jerome Brunet, Kevin Hilman, Martin Blumenstingl, Mike Rapoport On 2021-07-26 13:07, Stefan Agner wrote: > FWIW, I did run two boards over the weekend with stress-ng vm test > running to cause memory pressure, one board with 8a5a75e5e9e55 ("of/fdt: > Make sure no-map does not remove already reserved regions") reverted. > The one without the revert crashed after ~24h, the other did run through > the weekend. Basically confirming what Byron reported. > > On 2021-07-26 09:54, Neil Armstrong wrote: >> Hi, >> >> On 23/07/2021 21:48, Stefan Agner wrote: >>> On 2021-07-23 19:47, Robin Murphy wrote: >>>> On 2021-07-23 17:14, Robin Murphy wrote: >>>>> On 2021-07-23 16:56, Stefan Agner wrote: >>> <snip> >>>>>>> >>>>>>> Booting with "efi=debug" should (among other things) print the memory >>>>>>> map at boot if you want to double-check that that is the source of the >>>>>>> mismatch. Our EFI code should be perfectly capable of setting the >>>>>>> memblock flag if the region *is* described appropriately, see >>>>>>> reserve_regions() in drivers/firmware/efi/efi-init.c. >>>>>> >>>>>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: >>>>>> [ 0.000000] Machine model: Hardkernel ODROID-N2Plus >>>>>> [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: >>>>>> [ 0.000000] efi: UEFI not found. >>>>>> [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for >>>>>> node 'secmon@5000000': base 0x0000000005000000, size 3 MiB >>>>>> >>>>>> So it seems UEFI is not in the play here? >>>>> >>>>> Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) >>>> >>>> Actually, poking at U-Boot a bit more I find >>>> meson_board_add_reserved_memory() - can you check /sys/firmware/fdt >>>> and see if the region ends up being passed as a /memreserve/ as well >>>> as a proper reserved-memory node? >>>> >>>> IIRC the semantics of /memreserve/ aren't really well-defined enough >>>> to be suitable for the kind of things which require no-map, and my new >>>> guess is that that's what ends up conflicting here. >>> >>> Seems to be present in booth: >> >> Indeed, in order so support any combination: >> - upstream u-boot >> - vendor u-boot >> - upstream linux >> - other OS >> >> The secmon is in the upstream Linux DT, and upstream u-boot reads the >> secure memory regions >> from the first stage bootloaders and adds them into the DT memreserve. >> >> It worked fine since Linux 4.10-ish, until 5.10. > > Just verified what is probably obvious at this point: By removing > meson_board_add_reserved_memory() the /memreserve/ region isn't present > and "failed to reserve memory" message disappears indeed. > > Why is reserving memory not enough? From what I've read no-map also make > sure there is no VM mapping, but if the region is reserved, shouldn't > that be enough for Linux to not access the region? I've read that no-map > also preventsaccess due to speculation, is this what is happening here? Almost certainly - being reserved either way means that Linux won't try to access those pages directly, but if they are still present in the linear map as Normal memory which allows speculation, legitimate access to adjacent pages may well cause the CPU to end up prefetching into them. > What is the proper solution here? Could maybe > meson_board_add_reserved_memory() check if reserved-memory is present, > and if so avoid adding /memreserve/? Perhaps, although it doesn't help people who can't or don't want to update their firmware. As I say, I'm not sure what the expectations are supposed to be for /memreserve/, particularly if it duplicates reserved-memory. Furthermore, looking at 8a5a75e5e9e55 I'm also not really convinced that making the kernel boot for the sake of debugging a fundamentally broken bootloader is a common and realistic enough issue to justify breaking the existing not-necessarily-invalid bootloader behaviour of other widely-deployed systems :/ Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-07-26 12:35 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-17 9:14 Random reboots on ODROID-N2+ Stefan Agner 2021-05-17 21:09 ` Martin Blumenstingl 2021-05-18 9:16 ` Stefan Agner 2021-05-18 9:35 ` Neil Armstrong 2021-05-18 1:33 ` Andrew Lunn 2021-05-18 10:15 ` Stefan Agner 2021-05-19 20:09 ` Stefan Agner 2021-06-22 7:39 ` Stefan Agner 2021-07-23 14:25 ` Byron Stanoszek 2021-07-23 15:36 ` Robin Murphy 2021-07-23 15:56 ` Stefan Agner 2021-07-23 16:14 ` Robin Murphy 2021-07-23 17:47 ` Robin Murphy 2021-07-23 19:48 ` Stefan Agner 2021-07-26 7:54 ` Neil Armstrong 2021-07-26 12:07 ` Stefan Agner 2021-07-26 12:31 ` Robin Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).