* stmmac on Banana PI CPU stalls since Linux 6.6
@ 2024-01-21 20:17 Marc Haber
2024-01-21 21:52 ` Andrew Lunn
0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-01-21 20:17 UTC (permalink / raw)
To: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
Hi,
I am running a bunch of Banana Pis with Debian stable and unstable but
with a bleeding edge kernel. Since kernel 6.6, especially the test
system running Debian unstable is plagued by self-detected stalls on
CPU. The system seems to continue running normally locally but doesn't
answer on the network any more. Sometimes, after a few hours, things
heal themselves.
Here is an example log output:
[73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
[73929.368653] rcu: 1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
[73929.377796] rcu: (t=5250 jiffies g=851349 q=113 ncpus=2)
[73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G L 6.6.0-zgbpi-armmp-lpae+ #1
[73929.383222] Hardware name: Allwinner sun7i (A20) Family
[73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
[73929.383363] LR is at dev_get_stats+0x44/0x144
[73929.383389] pc : [<bf126db0>] lr : [<c09525e8>] psr: 200f0013
[73929.383401] sp : f0c59c78 ip : f0c59df8 fp : c2bb8000
[73929.383412] r10: 00800001 r9 : c3443dd8 r8 : 00000143
[73929.383423] r7 : 00000001 r6 : 00000000 r5 : c2bbb000 r4 : 00000001
[73929.383434] r3 : 0004c891 r2 : c2bbae48 r1 : f0c59d30 r0 : c2bb8000
[73929.383447] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[73929.383463] Control: 30c5387d Table: 49b553c0 DAC: a7f66f60
[73929.383486] stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
[73929.383564] dev_get_stats from dev_seq_printf_stats+0x40/0x194
[73929.383593] dev_seq_printf_stats from dev_seq_show+0x18/0x4c
[73929.383617] dev_seq_show from seq_read_iter+0x3c4/0x57c
[73929.383647] seq_read_iter from seq_read+0x9c/0xdc
[73929.383674] seq_read from proc_reg_read+0xb0/0xe4
[73929.383706] proc_reg_read from vfs_read+0xa8/0x2f4
[73929.383735] vfs_read from ksys_read+0x78/0x10c
[73929.383757] ksys_read from ret_fast_syscall+0x0/0x4c
[73929.383781] Exception stack(0xf0c59fa8 to 0xf0c59ff0)
[73929.383800] 9fa0: 024b7190 00000498 00000003 024cac10 00000400 00000001
[73929.383817] 9fc0: 024b7190 00000498 b6ef6d20 00000003 0000000a be9eb15c 00000000 00000000
[73929.383831] 9fe0: 00000003 be9eb030 b6e90eeb b6e0ab06
The issue is still present in Linux 6.7. I tried transplanting the stmmac
sub directory from Linux 6.5 to Linux 6.6, but the changes were too big,
the result doesn't even build.
I am running a bisect attempt since before christmas, but since it takes
up to a day for the issue to show themselves on a "bad" kernel, I'll let
"good" kernels run for four days until I declare them good. That takes a
lot of wall clock (or better, wall calendar) time.
If you might have some ideas why this is happening on my Banana Pis,
I'm open to suggestions. Tentative patches against 6.6.$HIGH or
6.7.$CURRENT would be appreciated as well.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-21 20:17 stmmac on Banana PI CPU stalls since Linux 6.6 Marc Haber
@ 2024-01-21 21:52 ` Andrew Lunn
2024-01-22 21:34 ` Andrey Jr. Melnikov
2024-01-25 18:01 ` Marc Haber
0 siblings, 2 replies; 18+ messages in thread
From: Andrew Lunn @ 2024-01-21 21:52 UTC (permalink / raw)
To: Marc Haber
Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> Hi,
>
> I am running a bunch of Banana Pis with Debian stable and unstable but
> with a bleeding edge kernel. Since kernel 6.6, especially the test
> system running Debian unstable is plagued by self-detected stalls on
> CPU. The system seems to continue running normally locally but doesn't
> answer on the network any more. Sometimes, after a few hours, things
> heal themselves.
>
> Here is an example log output:
> [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> [73929.368653] rcu: 1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> [73929.377796] rcu: (t=5250 jiffies g=851349 q=113 ncpus=2)
> [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G L 6.6.0-zgbpi-armmp-lpae+ #1
> [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> [73929.383363] LR is at dev_get_stats+0x44/0x144
> [73929.383389] pc : [<bf126db0>] lr : [<c09525e8>] psr: 200f0013
> [73929.383401] sp : f0c59c78 ip : f0c59df8 fp : c2bb8000
> [73929.383412] r10: 00800001 r9 : c3443dd8 r8 : 00000143
> [73929.383423] r7 : 00000001 r6 : 00000000 r5 : c2bbb000 r4 : 00000001
> [73929.383434] r3 : 0004c891 r2 : c2bbae48 r1 : f0c59d30 r0 : c2bb8000
> [73929.383447] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> [73929.383463] Control: 30c5387d Table: 49b553c0 DAC: a7f66f60
> [73929.383486] stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
Hi Marc
https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949
My _guess_ would be, its stuck in one of the loops which look like:
do {
start = u64_stats_fetch_begin(&txq_stats->syncp);
tx_packets = txq_stats->tx_packets;
tx_bytes = txq_stats->tx_bytes;
} while (u64_stats_fetch_retry(&txq_stats->syncp, start));
Next time you get a backtrace, could you do:
make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
use whatever it is reporting for:
PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
to find where it is in the listing.
Once we know if its the RX or the TX loop, we have a better idea where
to look for an unbalanced u64_stats_update_begin() /
u64_stats_update_end().
> I am running a bisect attempt since before christmas, but since it takes
> up to a day for the issue to show themselves on a "bad" kernel, I'll let
> "good" kernels run for four days until I declare them good. That takes a
> lot of wall clock (or better, wall calendar) time.
You might be able to speed it up with:
while true ; do cat /proc/net/dev > /dev/null ; done
and iperf or similar to generate a lot of traffic.
Andrew
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-21 21:52 ` Andrew Lunn
@ 2024-01-22 21:34 ` Andrey Jr. Melnikov
2024-01-25 18:01 ` Marc Haber
1 sibling, 0 replies; 18+ messages in thread
From: Andrey Jr. Melnikov @ 2024-01-22 21:34 UTC (permalink / raw)
To: Andrew Lunn
Cc: Marc Haber, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev
On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > Hi,
Hello. I have same symthom on same board.
[skip]
> make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> use whatever it is reporting for:
>
> PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
>
> to find where it is in the listing.
root@bpi:~# grep -ah 'PC is at ' /var/log/syslog*
Jan 22 20:13:04 bpi kernel: [256048.826170] PC is at stmmac_get_stats64+0x5c/0x1f8 [stmmac]
Jan 22 20:14:51 bpi kernel: [256156.077831] PC is at stmmac_get_stats64+0x40/0x1f8 [stmmac]
Jan 22 20:15:18 bpi kernel: [256183.687522] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:50:44 bpi kernel: [156104.837571] PC is at stmmac_get_stats64+0x4c/0x1f8 [stmmac]
Jan 17 10:51:52 bpi kernel: [156172.085436] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:52:37 bpi kernel: [156217.161344] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:53:03 bpi kernel: [156243.852175] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:54:40 bpi kernel: [156340.689082] PC is at stmmac_get_stats64+0x48/0x1f8 [stmmac]
Jan 17 10:55:07 bpi kernel: [156367.851904] PC is at stmmac_get_stats64+0x50/0x1f8 [stmmac]
Jan 17 10:56:11 bpi kernel: [156431.692860] PC is at stmmac_get_stats64+0x44/0x1f8 [stmmac]
Jan 17 10:56:49 bpi kernel: [156469.648758] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:57:15 bpi kernel: [156495.851573] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:59:20 bpi kernel: [156620.036359] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 11:00:31 bpi kernel: [156691.276191] PC is at stmmac_get_stats64+0x38/0x1f8 [stmmac]
Jan 17 11:01:07 bpi kernel: [156727.700103] PC is at stmmac_get_stats64+0x40/0x1f8 [stmmac]
Jan 17 11:01:31 bpi kernel: [156751.850926] PC is at stmmac_get_stats64+0x48/0x1f8 [stmmac]
so, PC always after first memory barrier (according to objdump -DS sttmac.ko):
....
00005b6c <stmmac_get_stats64>:
5b6c: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr}
5b70: e52de004 push {lr} @ (str lr, [sp, #-4]!)
5b74: ebfffffe bl 0 <__gnu_mcount_nc>
5b78: e2805a03 add r5, r0, #12288 @ 0x3000
5b7c: e59535c0 ldr r3, [r5, #1472] @ 0x5c0
5b80: e5937078 ldr r7, [r3, #120] @ 0x78
5b84: e5934074 ldr r4, [r3, #116] @ 0x74
5b88: e3570000 cmp r7, #0 // r7 -
5b8c: 12802db9 addne r2, r0, #11840 @ 0x2e40
5b90: 12822008 addne r2, r2, #8
5b94: 13a06000 movne r6, #0
5b98: 1a00000b bne 5bcc <stmmac_get_stats64+0x60>
5b9c: ea000026 b 5c3c <stmmac_get_stats64+0xd0>
5ba0: f57ff05b dmb ish
5ba4: e320f000 nop {0}
5ba8: e320f000 nop {0}
5bac: e320f000 nop {0}
5bb0: e320f000 nop {0}
5bb4: e320f000 nop {0}
5bb8: e320f000 nop {0}
5bbc: e320f000 nop {0}
5bc0: e320f000 nop {0}
5bc4: e320f000 nop {0}
5bc8: e320f000 nop {0}
5bcc: e5923000 ldr r3, [r2]
5bd0: e3130001 tst r3, #1
5bd4: 1afffff1 bne 5ba0 <stmmac_get_stats64+0x34>
5bd8: f57ff05b dmb ish
....
it loops in tx stats reading.
> Once we know if its the RX or the TX loop, we have a better idea where
> to look for an unbalanced u64_stats_update_begin() /
> u64_stats_update_end().
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-21 21:52 ` Andrew Lunn
2024-01-22 21:34 ` Andrey Jr. Melnikov
@ 2024-01-25 18:01 ` Marc Haber
2024-01-25 19:54 ` Andrew Lunn
2024-01-26 10:48 ` Marc Haber
1 sibling, 2 replies; 18+ messages in thread
From: Marc Haber @ 2024-01-25 18:01 UTC (permalink / raw)
To: Andrew Lunn
Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
Hi,
On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > Hi,
> >
> > I am running a bunch of Banana Pis with Debian stable and unstable but
> > with a bleeding edge kernel. Since kernel 6.6, especially the test
> > system running Debian unstable is plagued by self-detected stalls on
> > CPU. The system seems to continue running normally locally but doesn't
> > answer on the network any more. Sometimes, after a few hours, things
> > heal themselves.
> >
> > Here is an example log output:
> > [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> > [73929.368653] rcu: 1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> > [73929.377796] rcu: (t=5250 jiffies g=851349 q=113 ncpus=2)
> > [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G L 6.6.0-zgbpi-armmp-lpae+ #1
> > [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> > [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> > [73929.383363] LR is at dev_get_stats+0x44/0x144
> > [73929.383389] pc : [<bf126db0>] lr : [<c09525e8>] psr: 200f0013
> > [73929.383401] sp : f0c59c78 ip : f0c59df8 fp : c2bb8000
> > [73929.383412] r10: 00800001 r9 : c3443dd8 r8 : 00000143
> > [73929.383423] r7 : 00000001 r6 : 00000000 r5 : c2bbb000 r4 : 00000001
> > [73929.383434] r3 : 0004c891 r2 : c2bbae48 r1 : f0c59d30 r0 : c2bb8000
> > [73929.383447] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> > [73929.383463] Control: 30c5387d Table: 49b553c0 DAC: a7f66f60
> > [73929.383486] stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
>
> Hi Marc
>
> https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949
That is just for reference to the source? Or am I supposed to do
something with that link?
> My _guess_ would be, its stuck in one of the loops which look like:
>
> do {
> start = u64_stats_fetch_begin(&txq_stats->syncp);
> tx_packets = txq_stats->tx_packets;
> tx_bytes = txq_stats->tx_bytes;
> } while (u64_stats_fetch_retry(&txq_stats->syncp, start));
>
> Next time you get a backtrace, could you do:
>
> make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> use whatever it is reporting for:
I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
the first bad commit that the bisect eventually identified) and tried
running:
[56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
SYNC include/config/auto.conf.cmd
SYSHDR arch/arm/include/generated/uapi/asm/unistd-oabi.h
SYSHDR arch/arm/include/generated/uapi/asm/unistd-eabi.h
HOSTCC scripts/kallsyms
UPD include/config/kernel.release
UPD include/generated/uapi/linux/version.h
UPD include/generated/utsrelease.h
SYSNR arch/arm/include/generated/asm/unistd-nr.h
SYSTBL arch/arm/include/generated/calls-oabi.S
SYSTBL arch/arm/include/generated/calls-eabi.S
CC scripts/mod/empty.o
MKELF scripts/mod/elfconfig.h
HOSTCC scripts/mod/modpost.o
CC scripts/mod/devicetable-offsets.s
UPD scripts/mod/devicetable-offsets.h
HOSTCC scripts/mod/file2alias.o
HOSTCC scripts/mod/sumversion.o
HOSTLD scripts/mod/modpost
CC kernel/bounds.s
CC arch/arm/kernel/asm-offsets.s
UPD include/generated/asm-offsets.h
CALL scripts/checksyscalls.sh
CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
CHKSHA1 include/linux/atomic/atomic-instrumented.h
MKLST drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
[57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
That is not what it was suppsoed to yield, right?
>
> PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
>
> to find where it is in the listing.
>
> Once we know if its the RX or the TX loop, we have a better idea where
> to look for an unbalanced u64_stats_update_begin() /
> u64_stats_update_end().
>
> > I am running a bisect attempt since before christmas, but since it takes
> > up to a day for the issue to show themselves on a "bad" kernel, I'll let
> > "good" kernels run for four days until I declare them good. That takes a
> > lot of wall clock (or better, wall calendar) time.
>
> You might be able to speed it up with:
>
> while true ; do cat /proc/net/dev > /dev/null ; done
>
> and iperf or similar to generate a lot of traffic.
My bisect eventually completed and identified
2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
Sadly, it doesnt contain any loops, no calls to u64_stats_update_begin()
or u64_stats_update_end() or other suspicious things to the casual
reader.
I have backed out that commit out of 6.7.1 and have booted that kernel.
Not long enough to be able to say something yet.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-25 18:01 ` Marc Haber
@ 2024-01-25 19:54 ` Andrew Lunn
2024-01-25 20:00 ` Florian Fainelli
2024-01-26 10:48 ` Marc Haber
1 sibling, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2024-01-25 19:54 UTC (permalink / raw)
To: Marc Haber
Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
> the first bad commit that the bisect eventually identified) and tried
> running:
>
> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> SYNC include/config/auto.conf.cmd
> SYSHDR arch/arm/include/generated/uapi/asm/unistd-oabi.h
> SYSHDR arch/arm/include/generated/uapi/asm/unistd-eabi.h
> HOSTCC scripts/kallsyms
> UPD include/config/kernel.release
> UPD include/generated/uapi/linux/version.h
> UPD include/generated/utsrelease.h
> SYSNR arch/arm/include/generated/asm/unistd-nr.h
> SYSTBL arch/arm/include/generated/calls-oabi.S
> SYSTBL arch/arm/include/generated/calls-eabi.S
> CC scripts/mod/empty.o
> MKELF scripts/mod/elfconfig.h
> HOSTCC scripts/mod/modpost.o
> CC scripts/mod/devicetable-offsets.s
> UPD scripts/mod/devicetable-offsets.h
> HOSTCC scripts/mod/file2alias.o
> HOSTCC scripts/mod/sumversion.o
> HOSTLD scripts/mod/modpost
> CC kernel/bounds.s
> CC arch/arm/kernel/asm-offsets.s
> UPD include/generated/asm-offsets.h
> CALL scripts/checksyscalls.sh
> CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
> CHKSHA1 include/linux/atomic/atomic-instrumented.h
> MKLST drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
>
> That is not what it was suppsoed to yield, right?
No. But did it actually generate
drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
like this are not always fatal.
> My bisect eventually completed and identified
> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
I can make a guess.
- memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
Its removed, not moved later. Deep within this structure is the
stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
to return, and the two syncp variables are in it as well.
My guess is, they have an invalid state, when this memset is missing.
Try putting the memset back.
I also guess that is not the real fix, there are missing calls to
u64_stats_init().
Andrew
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-25 19:54 ` Andrew Lunn
@ 2024-01-25 20:00 ` Florian Fainelli
2024-01-26 7:51 ` Petr Tesařík
0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2024-01-25 20:00 UTC (permalink / raw)
To: Andrew Lunn, Marc Haber, Petr Tesarik
Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
On 1/25/24 11:54, Andrew Lunn wrote:
>> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
>> the first bad commit that the bisect eventually identified) and tried
>> running:
>>
>> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
>> SYNC include/config/auto.conf.cmd
>> SYSHDR arch/arm/include/generated/uapi/asm/unistd-oabi.h
>> SYSHDR arch/arm/include/generated/uapi/asm/unistd-eabi.h
>> HOSTCC scripts/kallsyms
>> UPD include/config/kernel.release
>> UPD include/generated/uapi/linux/version.h
>> UPD include/generated/utsrelease.h
>> SYSNR arch/arm/include/generated/asm/unistd-nr.h
>> SYSTBL arch/arm/include/generated/calls-oabi.S
>> SYSTBL arch/arm/include/generated/calls-eabi.S
>> CC scripts/mod/empty.o
>> MKELF scripts/mod/elfconfig.h
>> HOSTCC scripts/mod/modpost.o
>> CC scripts/mod/devicetable-offsets.s
>> UPD scripts/mod/devicetable-offsets.h
>> HOSTCC scripts/mod/file2alias.o
>> HOSTCC scripts/mod/sumversion.o
>> HOSTLD scripts/mod/modpost
>> CC kernel/bounds.s
>> CC arch/arm/kernel/asm-offsets.s
>> UPD include/generated/asm-offsets.h
>> CALL scripts/checksyscalls.sh
>> CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
>> CHKSHA1 include/linux/atomic/atomic-instrumented.h
>> MKLST drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
>> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
>> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
>>
>> That is not what it was suppsoed to yield, right?
>
> No. But did it actually generate
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
> like this are not always fatal.
>
>> My bisect eventually completed and identified
>> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
>
> I can make a guess.
>
> - memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
>
> Its removed, not moved later. Deep within this structure is the
> stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
> to return, and the two syncp variables are in it as well.
>
> My guess is, they have an invalid state, when this memset is missing.
>
> Try putting the memset back.
>
> I also guess that is not the real fix, there are missing calls to
> u64_stats_init().
Did not Petr try to address the same problem essentially:
https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
this was not deemed the proper solution and I don't think one has been
posted since then, but it looks about your issue here Marc.
--
Florian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-25 20:00 ` Florian Fainelli
@ 2024-01-26 7:51 ` Petr Tesařík
2024-01-26 10:54 ` Marc Haber
0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-01-26 7:51 UTC (permalink / raw)
To: Florian Fainelli
Cc: Andrew Lunn, Marc Haber, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
On Thu, 25 Jan 2024 12:00:46 -0800
Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 1/25/24 11:54, Andrew Lunn wrote:
> >> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
> >> the first bad commit that the bisect eventually identified) and tried
> >> running:
> >>
> >> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >> SYNC include/config/auto.conf.cmd
> >> SYSHDR arch/arm/include/generated/uapi/asm/unistd-oabi.h
> >> SYSHDR arch/arm/include/generated/uapi/asm/unistd-eabi.h
> >> HOSTCC scripts/kallsyms
> >> UPD include/config/kernel.release
> >> UPD include/generated/uapi/linux/version.h
> >> UPD include/generated/utsrelease.h
> >> SYSNR arch/arm/include/generated/asm/unistd-nr.h
> >> SYSTBL arch/arm/include/generated/calls-oabi.S
> >> SYSTBL arch/arm/include/generated/calls-eabi.S
> >> CC scripts/mod/empty.o
> >> MKELF scripts/mod/elfconfig.h
> >> HOSTCC scripts/mod/modpost.o
> >> CC scripts/mod/devicetable-offsets.s
> >> UPD scripts/mod/devicetable-offsets.h
> >> HOSTCC scripts/mod/file2alias.o
> >> HOSTCC scripts/mod/sumversion.o
> >> HOSTLD scripts/mod/modpost
> >> CC kernel/bounds.s
> >> CC arch/arm/kernel/asm-offsets.s
> >> UPD include/generated/asm-offsets.h
> >> CALL scripts/checksyscalls.sh
> >> CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
> >> CHKSHA1 include/linux/atomic/atomic-instrumented.h
> >> MKLST drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
> >> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
> >>
> >> That is not what it was suppsoed to yield, right?
> >
> > No. But did it actually generate
> > drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
> > like this are not always fatal.
> >
> >> My bisect eventually completed and identified
> >> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
> >
> > I can make a guess.
> >
> > - memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
> >
> > Its removed, not moved later. Deep within this structure is the
> > stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
> > to return, and the two syncp variables are in it as well.
> >
> > My guess is, they have an invalid state, when this memset is missing.
> >
> > Try putting the memset back.
> >
> > I also guess that is not the real fix, there are missing calls to
> > u64_stats_init().
>
> Did not Petr try to address the same problem essentially:
>
> https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
>
> this was not deemed the proper solution and I don't think one has been
> posted since then, but it looks about your issue here Marc.
Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
I've been busy with other things lately, so I could not test and submit
my changes.
Essentially, the write side of the statistics seqlock is not protected
and will eventually miss an increment, causing the read side to spin
forever. The final plan is to split the statistics into three parts:
1. fields updated only under the tx queue lock,
2. fields updated only during NAPI poll,
3. fields updated only from interrupt context,
The first two groups can each have its own seqlock. The third group
(actually a single counter) can be converted to a per-CPU variable. The
read side will then aggregate the values as appropriate.
I hope I can find some time for this bug again during the coming weekend
(it's not for my day job). It's motivating to know that I'm not the
only affected person on the planet. ;-)
Petr T
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-25 18:01 ` Marc Haber
2024-01-25 19:54 ` Andrew Lunn
@ 2024-01-26 10:48 ` Marc Haber
1 sibling, 0 replies; 18+ messages in thread
From: Marc Haber @ 2024-01-26 10:48 UTC (permalink / raw)
To: Andrew Lunn
Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
On Thu, Jan 25, 2024 at 07:01:40PM +0100, Marc Haber wrote:
> On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> > On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > > Hi,
> > >
> > > I am running a bunch of Banana Pis with Debian stable and unstable but
> > > with a bleeding edge kernel. Since kernel 6.6, especially the test
> > > system running Debian unstable is plagued by self-detected stalls on
> > > CPU. The system seems to continue running normally locally but doesn't
> > > answer on the network any more. Sometimes, after a few hours, things
> > > heal themselves.
> > >
> > > Here is an example log output:
> > > [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [73929.368653] rcu: 1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> > > [73929.377796] rcu: (t=5250 jiffies g=851349 q=113 ncpus=2)
> > > [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G L 6.6.0-zgbpi-armmp-lpae+ #1
> > > [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> > > [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> > > [73929.383363] LR is at dev_get_stats+0x44/0x144
> > > [73929.383389] pc : [<bf126db0>] lr : [<c09525e8>] psr: 200f0013
> > > [73929.383401] sp : f0c59c78 ip : f0c59df8 fp : c2bb8000
> > > [73929.383412] r10: 00800001 r9 : c3443dd8 r8 : 00000143
> > > [73929.383423] r7 : 00000001 r6 : 00000000 r5 : c2bbb000 r4 : 00000001
> > > [73929.383434] r3 : 0004c891 r2 : c2bbae48 r1 : f0c59d30 r0 : c2bb8000
> > > [73929.383447] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> > > [73929.383463] Control: 30c5387d Table: 49b553c0 DAC: a7f66f60
> > > [73929.383486] stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
> >
> > Hi Marc
> >
> > https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949
>
> That is just for reference to the source? Or am I supposed to do
> something with that link?
>
> > My _guess_ would be, its stuck in one of the loops which look like:
> >
> > do {
> > start = u64_stats_fetch_begin(&txq_stats->syncp);
> > tx_packets = txq_stats->tx_packets;
> > tx_bytes = txq_stats->tx_bytes;
> > } while (u64_stats_fetch_retry(&txq_stats->syncp, start));
> >
> > Next time you get a backtrace, could you do:
> >
> > make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> > use whatever it is reporting for:
So, if I have in my current backtrace:
PC is at stmmac_get_stats64+0x48/0x20c [stmmac]
I look in the generated stmmac_main.lst for the function
stmmac_get_stats:
00005e9c <stmmac_get_stats64>:
{
5e9c: e92d47f0 push {r4, r5, r6, r7, r8, r9, sl, lr}
5ea0: e52de004 push {lr} @ (str lr, [sp, #-4]!)
5ea4: ebfffffe bl 0 <__gnu_mcount_nc>
5ea4: R_ARM_CALL __gnu_mcount_nc
u32 tx_cnt = priv->plat->tx_queues_to_use;
5ea8: e2805a03 add r5, r0, #12288 @ 0x3000
5eac: e59535c0 ldr r3, [r5, #1472] @ 0x5c0
5eb0: e5937078 ldr r7, [r3, #120] @ 0x78
u32 rx_cnt = priv->plat->rx_queues_to_use;
5eb4: e5934074 ldr r4, [r3, #116] @ 0x74
for (q = 0; q < tx_cnt; q++) {
5eb8: e3570000 cmp r7, #0
5ebc: 12802db9 addne r2, r0, #11840 @ 0x2e40
5ec0: 12822008 addne r2, r2, #8
5ec4: 13a06000 movne r6, #0
5ec8: 1a00000b bne 5efc <stmmac_get_stats64+0x60>
5ecc: ea000026 b 5f6c <stmmac_get_stats64+0xd0>
local_irq_restore(flags);
}
the address in the first line is the base address, so the line in
question is 0x5e9c+0x48=0x5ee4, which is already outside the function?!
> My bisect eventually completed and identified
> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
> Sadly, it doesnt contain any loops, no calls to u64_stats_update_begin()
> or u64_stats_update_end() or other suspicious things to the casual
> reader.
>
> I have backed out that commit out of 6.7.1 and have booted that kernel.
> Not long enough to be able to say something yet.
That didn't fix the hangs, PC is at
stmmac_get_stats64+0x34/0x20c
stmmac_get_stats64+0x38/0x20c
stmmac_get_stats64+0x3c/0x20c
stmmac_get_stats64+0x40/0x20c
stmmac_get_stats64+0x44/0x20c
stmmac_get_stats64+0x48/0x20c
stmmac_get_stats64+0x4c/0x20c
stmmac_get_stats64+0x50/0x20c
stmmac_get_stats64+0x54/0x20c
stmmac_get_stats64+0x58/0x20c
stmmac_get_stats64+0x5c/0x20c
stmmac_get_stats64+0x60/0x20c
stmmac_get_stats64+0x64/0x20c
(sorted, uniq, about 66 instances in about 18 hours)
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-26 7:51 ` Petr Tesařík
@ 2024-01-26 10:54 ` Marc Haber
2024-01-26 11:10 ` Petr Tesařík
0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-01-26 10:54 UTC (permalink / raw)
To: Petr Tesařík
Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
On Fri, Jan 26, 2024 at 08:51:22AM +0100, Petr Tesařík wrote:
> On Thu, 25 Jan 2024 12:00:46 -0800
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> > Did not Petr try to address the same problem essentially:
> >
> > https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> >
> > this was not deemed the proper solution and I don't think one has been
> > posted since then, but it looks about your issue here Marc.
>
> Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
> I've been busy with other things lately, so I could not test and submit
> my changes.
Is it worth trying your patch from the message cited above, knowing that
is not the final solution?
> I hope I can find some time for this bug again during the coming weekend
> (it's not for my day job). It's motivating to know that I'm not the
> only affected person on the planet. ;-)
I am ready to test if you want me to ;-)
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-26 10:54 ` Marc Haber
@ 2024-01-26 11:10 ` Petr Tesařík
2024-02-05 20:12 ` Marc Haber
0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-01-26 11:10 UTC (permalink / raw)
To: Marc Haber
Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
On Fri, 26 Jan 2024 11:54:20 +0100
Marc Haber <mh+netdev@zugschlus.de> wrote:
> On Fri, Jan 26, 2024 at 08:51:22AM +0100, Petr Tesařík wrote:
> > On Thu, 25 Jan 2024 12:00:46 -0800
> > Florian Fainelli <f.fainelli@gmail.com> wrote:
> > > Did not Petr try to address the same problem essentially:
> > >
> > > https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> > >
> > > this was not deemed the proper solution and I don't think one has been
> > > posted since then, but it looks about your issue here Marc.
> >
> > Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
> > I've been busy with other things lately, so I could not test and submit
> > my changes.
>
> Is it worth trying your patch from the message cited above, knowing that
> is not the final solution?
Depends. It solves the deadlock (at least for me); my NanoPi has been
running stable for over a month with this patch. But it also introduces
a new spinlock, which usually reduces performance.
In any case, you can give it a try to verify that you hit the same
issue.
> > I hope I can find some time for this bug again during the coming weekend
> > (it's not for my day job). It's motivating to know that I'm not the
> > only affected person on the planet. ;-)
>
> I am ready to test if you want me to ;-)
Then you may want to start by verifying that it is indeed the same
issue. Try the linked patch.
Thank you!
Petr T
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-01-26 11:10 ` Petr Tesařík
@ 2024-02-05 20:12 ` Marc Haber
2024-02-05 21:50 ` Florian Fainelli
0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-02-05 20:12 UTC (permalink / raw)
To: Petr Tesařík
Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> Then you may want to start by verifying that it is indeed the same
> issue. Try the linked patch.
The linked patch seemed to help for 6.7.2, the test machine ran for five
days without problems. After going to unpatched 6.7.2, the issue was
back in six hours.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-05 20:12 ` Marc Haber
@ 2024-02-05 21:50 ` Florian Fainelli
2024-02-06 8:23 ` Petr Tesařík
0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2024-02-05 21:50 UTC (permalink / raw)
To: Marc Haber, Petr Tesařík
Cc: Andrew Lunn, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev
On 2/5/24 12:12, Marc Haber wrote:
> On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
>> Then you may want to start by verifying that it is indeed the same
>> issue. Try the linked patch.
>
> The linked patch seemed to help for 6.7.2, the test machine ran for five
> days without problems. After going to unpatched 6.7.2, the issue was
> back in six hours.
Do you mind responding to Petr's patch with a Tested-by? Thanks!
--
Florian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-05 21:50 ` Florian Fainelli
@ 2024-02-06 8:23 ` Petr Tesařík
2024-02-12 12:15 ` Marc Haber
0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-02-06 8:23 UTC (permalink / raw)
To: Florian Fainelli
Cc: Marc Haber, Andrew Lunn, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
Hi Florian,
On Mon, 5 Feb 2024 13:50:35 -0800
Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 2/5/24 12:12, Marc Haber wrote:
> > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> >> Then you may want to start by verifying that it is indeed the same
> >> issue. Try the linked patch.
> >
> > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > days without problems. After going to unpatched 6.7.2, the issue was
> > back in six hours.
>
> Do you mind responding to Petr's patch with a Tested-by? Thanks!
I believe Marc tested my first attempt at a solution (the one with
spinlocks), not the latest incarnation. FWIW I have tested a similar
scenario, with similar results.
@Marc: I was able to reduce the time until hang by running a "ping -f"
from another machine on the same LAN and running "ethtool -S" in a
tight loop on the system under testing (over an SSH connection, so it
probably contributed substantially to the network traffic). The
unpatched kernel froze within a few minutes.
Petr T
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-06 8:23 ` Petr Tesařík
@ 2024-02-12 12:15 ` Marc Haber
2024-02-19 19:20 ` Christian Stewart
0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-02-12 12:15 UTC (permalink / raw)
To: Petr Tesařík
Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
netdev
On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:
> On Mon, 5 Feb 2024 13:50:35 -0800
> Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> > On 2/5/24 12:12, Marc Haber wrote:
> > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> > >> Then you may want to start by verifying that it is indeed the same
> > >> issue. Try the linked patch.
> > >
> > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > days without problems. After going to unpatched 6.7.2, the issue was
> > > back in six hours.
> >
> > Do you mind responding to Petr's patch with a Tested-by? Thanks!
>
> I believe Marc tested my first attempt at a solution (the one with
> spinlocks), not the latest incarnation. FWIW I have tested a similar
> scenario, with similar results.
Where is the latest patch? I can give it a try.
Sorry for not responding any earlier, February 10 is an important tax
due date in Germany.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-12 12:15 ` Marc Haber
@ 2024-02-19 19:20 ` Christian Stewart
2024-02-19 19:44 ` Petr Tesařík
0 siblings, 1 reply; 18+ messages in thread
From: Christian Stewart @ 2024-02-19 19:20 UTC (permalink / raw)
To: Marc Haber
Cc: Petr Tesařík, Florian Fainelli, Andrew Lunn,
alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
Samuel Holland, Jisheng Zhang, netdev
Hi all,
On Mon, Feb 12, 2024 at 4:15 AM Marc Haber <mh+netdev@zugschlus.de> wrote:
>
> On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:
> > On Mon, 5 Feb 2024 13:50:35 -0800
> > Florian Fainelli <f.fainelli@gmail.com> wrote:
> >
> > > On 2/5/24 12:12, Marc Haber wrote:
> > > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> > > >> Then you may want to start by verifying that it is indeed the same
> > > >> issue. Try the linked patch.
> > > >
> > > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > > days without problems. After going to unpatched 6.7.2, the issue was
> > > > back in six hours.
> > >
> > > Do you mind responding to Petr's patch with a Tested-by? Thanks!
> >
> > I believe Marc tested my first attempt at a solution (the one with
> > spinlocks), not the latest incarnation. FWIW I have tested a similar
> > scenario, with similar results.
>
> Where is the latest patch? I can give it a try.
>
> Sorry for not responding any earlier, February 10 is an important tax
> due date in Germany.
>
> Greetings
> Marc
We are seeing the same kernel panic on shutdown with 6.7.4 on a
BananaPi M2 Ultra:
[** ] (3 of 3) A stop job is running for Network Manager (33s / 52s)
[ 259.463772] rcu: INFO: rcu_sched self-detected stall on CPU
[ 259.469388] rcu: 0-....: (2099 ticks this GP)
idle=0fdc/1/0x40000002 softirq=12003/12003 fqs=1034
[ 259.478360] rcu: (t=2100 jiffies g=16277 q=36 ncpus=4)
[ 259.483595] CPU: 0 PID: 4462 Comm: ip Tainted: G C 6.7.4 #1
[ 259.490562] Hardware name: Allwinner sun8i Family
[ 259.495268] PC is at stmmac_get_stats64+0x30/0x198
[ 259.500081] LR is at dev_get_stats+0x3c/0x160
[ 259.504445] pc : [<c06b9924>] lr : [<c07bf7a8>] psr: 200f0013
[ 259.510712] sp : f1e6d9b8 ip : c3ca478c fp : c23e0000
[ 259.515941] r10: 00000000 r9 : c3ca4598 r8 : 00000000
[ 259.521168] r7 : 00000001 r6 : 00000000 r5 : c23e3000 r4 : 00000001
[ 259.527697] r3 : 00005c1b r2 : c23e2e08 r1 : c3ca46c4 r0 : c23e0000
[ 259.534226] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 259.541363] Control: 10c5387d Table: 429cc06a DAC: 00000051
[ 259.547117] stmmac_get_stats64 from dev_get_stats+0x3c/0x160
[ 259.552882] dev_get_stats from rtnl_fill_stats+0x30/0x118
[ 259.552899] rtnl_fill_stats from rtnl_fill_ifinfo+0x720/0x135c
[ 259.564306] rtnl_fill_ifinfo from rtnl_dump_ifinfo+0x330/0x6a8
[ 259.570240] rtnl_dump_ifinfo from netlink_dump+0x16c/0x350
[ 259.575830] netlink_dump from __netlink_dump_start+0x1bc/0x280
[ 259.581766] __netlink_dump_start from rtnetlink_rcv_msg+0xf4/0x2f0
[ 259.588047] rtnetlink_rcv_msg from netlink_rcv_skb+0xb8/0x118
[ 259.593893] netlink_rcv_skb from netlink_unicast+0x1fc/0x2d8
[ 259.599655] netlink_unicast from netlink_sendmsg+0x1c8/0x440
[ 259.605416] netlink_sendmsg from sock_write_iter+0xa0/0x10c
[ 259.611094] sock_write_iter from vfs_write+0x338/0x398
[ 259.616334] vfs_write from ksys_write+0xbc/0xf0
[ 259.620961] ksys_write from ret_fast_syscall+0x0/0x54
[ 259.626110] Exception stack(0xf1e6dfa8 to 0xf1e6dff0)
[ 259.631169] dfa0: 00000003 be997dd8 00000003
be997dd8 00000014 00000001
[ 259.639351] dfc0: 00000003 be997dd8 00000014 00000004 00519548
be997e08 b6fd0ce0 0051783c
https://github.com/skiffos/SkiffOS/issues/307
I'm writing to ask if anyone has found a fix for this yet?
Thanks!
Christian Stewart
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-19 19:20 ` Christian Stewart
@ 2024-02-19 19:44 ` Petr Tesařík
2024-02-20 14:59 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-02-19 19:44 UTC (permalink / raw)
To: Christian Stewart
Cc: Marc Haber, Florian Fainelli, Andrew Lunn, alexandre.torgue,
Jose Abreu, Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
Jisheng Zhang, netdev
On Mon, 19 Feb 2024 11:20:35 -0800
Christian Stewart <christian@aperture.us> wrote:
> Hi all,
>
> On Mon, Feb 12, 2024 at 4:15 AM Marc Haber <mh+netdev@zugschlus.de> wrote:
> >
> > On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:
> > > On Mon, 5 Feb 2024 13:50:35 -0800
> > > Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >
> > > > On 2/5/24 12:12, Marc Haber wrote:
> > > > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> > > > >> Then you may want to start by verifying that it is indeed the same
> > > > >> issue. Try the linked patch.
> > > > >
> > > > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > > > days without problems. After going to unpatched 6.7.2, the issue was
> > > > > back in six hours.
> > > >
> > > > Do you mind responding to Petr's patch with a Tested-by? Thanks!
> > >
> > > I believe Marc tested my first attempt at a solution (the one with
> > > spinlocks), not the latest incarnation. FWIW I have tested a similar
> > > scenario, with similar results.
> >
> > Where is the latest patch? I can give it a try.
> >
> > Sorry for not responding any earlier, February 10 is an important tax
> > due date in Germany.
> >
> > Greetings
> > Marc
>
> We are seeing the same kernel panic on shutdown with 6.7.4 on a
> BananaPi M2 Ultra:
>
> [** ] (3 of 3) A stop job is running for Network Manager (33s / 52s)
> [ 259.463772] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 259.469388] rcu: 0-....: (2099 ticks this GP)
> idle=0fdc/1/0x40000002 softirq=12003/12003 fqs=1034
> [ 259.478360] rcu: (t=2100 jiffies g=16277 q=36 ncpus=4)
> [ 259.483595] CPU: 0 PID: 4462 Comm: ip Tainted: G C 6.7.4 #1
> [ 259.490562] Hardware name: Allwinner sun8i Family
> [ 259.495268] PC is at stmmac_get_stats64+0x30/0x198
> [ 259.500081] LR is at dev_get_stats+0x3c/0x160
> [ 259.504445] pc : [<c06b9924>] lr : [<c07bf7a8>] psr: 200f0013
> [ 259.510712] sp : f1e6d9b8 ip : c3ca478c fp : c23e0000
> [ 259.515941] r10: 00000000 r9 : c3ca4598 r8 : 00000000
> [ 259.521168] r7 : 00000001 r6 : 00000000 r5 : c23e3000 r4 : 00000001
> [ 259.527697] r3 : 00005c1b r2 : c23e2e08 r1 : c3ca46c4 r0 : c23e0000
> [ 259.534226] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> [ 259.541363] Control: 10c5387d Table: 429cc06a DAC: 00000051
> [ 259.547117] stmmac_get_stats64 from dev_get_stats+0x3c/0x160
> [ 259.552882] dev_get_stats from rtnl_fill_stats+0x30/0x118
> [ 259.552899] rtnl_fill_stats from rtnl_fill_ifinfo+0x720/0x135c
> [ 259.564306] rtnl_fill_ifinfo from rtnl_dump_ifinfo+0x330/0x6a8
> [ 259.570240] rtnl_dump_ifinfo from netlink_dump+0x16c/0x350
> [ 259.575830] netlink_dump from __netlink_dump_start+0x1bc/0x280
> [ 259.581766] __netlink_dump_start from rtnetlink_rcv_msg+0xf4/0x2f0
> [ 259.588047] rtnetlink_rcv_msg from netlink_rcv_skb+0xb8/0x118
> [ 259.593893] netlink_rcv_skb from netlink_unicast+0x1fc/0x2d8
> [ 259.599655] netlink_unicast from netlink_sendmsg+0x1c8/0x440
> [ 259.605416] netlink_sendmsg from sock_write_iter+0xa0/0x10c
> [ 259.611094] sock_write_iter from vfs_write+0x338/0x398
> [ 259.616334] vfs_write from ksys_write+0xbc/0xf0
> [ 259.620961] ksys_write from ret_fast_syscall+0x0/0x54
> [ 259.626110] Exception stack(0xf1e6dfa8 to 0xf1e6dff0)
> [ 259.631169] dfa0: 00000003 be997dd8 00000003
> be997dd8 00000014 00000001
> [ 259.639351] dfc0: 00000003 be997dd8 00000014 00000004 00519548
> be997e08 b6fd0ce0 0051783c
>
> https://github.com/skiffos/SkiffOS/issues/307
>
> I'm writing to ask if anyone has found a fix for this yet?
If you're running a 6.7 stable kernel, my patch has just been added to
the 6.7-stable tree.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.7/net-stmmac-protect-updates-of-64-bit-statistics-counters.patch
However, lockdep has reported an issue with it:
https://lore.kernel.org/lkml/ea1567d9-ce66-45e6-8168-ac40a47d1821@roeck-us.net/
This new report has not yet been properly understood, but FWIW I've
been running stable with my patch for over a month now.
Petr T
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-19 19:44 ` Petr Tesařík
@ 2024-02-20 14:59 ` Jakub Kicinski
2024-02-23 20:38 ` Christian Stewart
0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-02-20 14:59 UTC (permalink / raw)
To: Petr Tesařík, Christian Stewart
Cc: Marc Haber, Florian Fainelli, Andrew Lunn, alexandre.torgue,
Jose Abreu, Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
Jisheng Zhang, netdev
On Mon, 19 Feb 2024 20:44:21 +0100 Petr Tesařík wrote:
> If you're running a 6.7 stable kernel, my patch has just been added to
> the 6.7-stable tree.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.7/net-stmmac-protect-updates-of-64-bit-statistics-counters.patch
>
> However, lockdep has reported an issue with it:
>
> https://lore.kernel.org/lkml/ea1567d9-ce66-45e6-8168-ac40a47d1821@roeck-us.net/
>
> This new report has not yet been properly understood, but FWIW I've
> been running stable with my patch for over a month now.
Christian got an actual soft lockup, not just a lockdep warning, tho.
Christian, could you run the stack trace thru scripts/decode_stacktrace
and tell us which loop it's stuck on?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: stmmac on Banana PI CPU stalls since Linux 6.6
2024-02-20 14:59 ` Jakub Kicinski
@ 2024-02-23 20:38 ` Christian Stewart
0 siblings, 0 replies; 18+ messages in thread
From: Christian Stewart @ 2024-02-23 20:38 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Petr Tesařík, Marc Haber, Florian Fainelli,
Andrew Lunn, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev
On Tue, Feb 20, 2024 at 6:59 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 19 Feb 2024 20:44:21 +0100 Petr Tesařík wrote:
> > This new report has not yet been properly understood, but FWIW I've
> > been running stable with my patch for over a month now.
>
> Christian got an actual soft lockup, not just a lockdep warning, tho.
> Christian, could you run the stack trace thru scripts/decode_stacktrace
> and tell us which loop it's stuck on?
This was a crash report from a user and unfortunately I don't have the
kernel sources & build artifacts from that device to be able to run
decode_stacktrace. If it happens again I will request the user send me
their kernel build tree & will report back with the decoded
stacktrace.
Thanks!
Christian Stewart
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-02-23 20:38 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-21 20:17 stmmac on Banana PI CPU stalls since Linux 6.6 Marc Haber
2024-01-21 21:52 ` Andrew Lunn
2024-01-22 21:34 ` Andrey Jr. Melnikov
2024-01-25 18:01 ` Marc Haber
2024-01-25 19:54 ` Andrew Lunn
2024-01-25 20:00 ` Florian Fainelli
2024-01-26 7:51 ` Petr Tesařík
2024-01-26 10:54 ` Marc Haber
2024-01-26 11:10 ` Petr Tesařík
2024-02-05 20:12 ` Marc Haber
2024-02-05 21:50 ` Florian Fainelli
2024-02-06 8:23 ` Petr Tesařík
2024-02-12 12:15 ` Marc Haber
2024-02-19 19:20 ` Christian Stewart
2024-02-19 19:44 ` Petr Tesařík
2024-02-20 14:59 ` Jakub Kicinski
2024-02-23 20:38 ` Christian Stewart
2024-01-26 10:48 ` Marc Haber
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.