All of lore.kernel.org
 help / color / mirror / Atom feed
* stmmac on Banana PI CPU stalls since Linux 6.6
@ 2024-01-21 20:17 Marc Haber
  2024-01-21 21:52 ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-01-21 20:17 UTC (permalink / raw)
  To: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

Hi,

I am running a bunch of Banana Pis with Debian stable and unstable but
with a bleeding edge kernel. Since kernel 6.6, especially the test
system running Debian unstable is plagued by self-detected stalls on
CPU. The system seems to continue running normally locally but doesn't
answer on the network any more. Sometimes, after a few hours, things
heal themselves.

Here is an example log output:
[73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
[73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
[73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
[73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
[73929.383222] Hardware name: Allwinner sun7i (A20) Family
[73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
[73929.383363] LR is at dev_get_stats+0x44/0x144
[73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
[73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
[73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
[73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
[73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
[73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
[73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
[73929.383564]  dev_get_stats from dev_seq_printf_stats+0x40/0x194
[73929.383593]  dev_seq_printf_stats from dev_seq_show+0x18/0x4c
[73929.383617]  dev_seq_show from seq_read_iter+0x3c4/0x57c
[73929.383647]  seq_read_iter from seq_read+0x9c/0xdc
[73929.383674]  seq_read from proc_reg_read+0xb0/0xe4
[73929.383706]  proc_reg_read from vfs_read+0xa8/0x2f4
[73929.383735]  vfs_read from ksys_read+0x78/0x10c
[73929.383757]  ksys_read from ret_fast_syscall+0x0/0x4c
[73929.383781] Exception stack(0xf0c59fa8 to 0xf0c59ff0)
[73929.383800] 9fa0:                   024b7190 00000498 00000003 024cac10 00000400 00000001
[73929.383817] 9fc0: 024b7190 00000498 b6ef6d20 00000003 0000000a be9eb15c 00000000 00000000
[73929.383831] 9fe0: 00000003 be9eb030 b6e90eeb b6e0ab06

The issue is still present in Linux 6.7. I tried transplanting the stmmac
sub directory from Linux 6.5 to Linux 6.6, but the changes were too big,
the result doesn't even build.

I am running a bisect attempt since before christmas, but since it takes
up to a day for the issue to show themselves on a "bad" kernel, I'll let
"good" kernels run for four days until I declare them good. That takes a
lot of wall clock (or better, wall calendar) time.

If you might have some ideas why this is happening on my Banana Pis,
I'm open to suggestions. Tentative patches against 6.6.$HIGH or
6.7.$CURRENT would be appreciated as well.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-21 20:17 stmmac on Banana PI CPU stalls since Linux 6.6 Marc Haber
@ 2024-01-21 21:52 ` Andrew Lunn
  2024-01-22 21:34   ` Andrey Jr. Melnikov
  2024-01-25 18:01   ` Marc Haber
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Lunn @ 2024-01-21 21:52 UTC (permalink / raw)
  To: Marc Haber
  Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> Hi,
> 
> I am running a bunch of Banana Pis with Debian stable and unstable but
> with a bleeding edge kernel. Since kernel 6.6, especially the test
> system running Debian unstable is plagued by self-detected stalls on
> CPU. The system seems to continue running normally locally but doesn't
> answer on the network any more. Sometimes, after a few hours, things
> heal themselves.
> 
> Here is an example log output:
> [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> [73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> [73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
> [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
> [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> [73929.383363] LR is at dev_get_stats+0x44/0x144
> [73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
> [73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
> [73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
> [73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
> [73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
> [73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
> [73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144

Hi Marc

https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949

My _guess_ would be, its stuck in one of the loops which look like:

		do {
			start = u64_stats_fetch_begin(&txq_stats->syncp);
			tx_packets = txq_stats->tx_packets;
			tx_bytes   = txq_stats->tx_bytes;
		} while (u64_stats_fetch_retry(&txq_stats->syncp, start));

Next time you get a backtrace, could you do:

make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
use whatever it is reporting for:

PC is at stmmac_get_stats64+0x64/0x20c [stmmac]

to find where it is in the listing.

Once we know if its the RX or the TX loop, we have a better idea where
to look for an unbalanced u64_stats_update_begin() /
u64_stats_update_end().

> I am running a bisect attempt since before christmas, but since it takes
> up to a day for the issue to show themselves on a "bad" kernel, I'll let
> "good" kernels run for four days until I declare them good. That takes a
> lot of wall clock (or better, wall calendar) time.

You might be able to speed it up with:

while true ; do cat /proc/net/dev > /dev/null ; done

and iperf or similar to generate a lot of traffic.

    Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-21 21:52 ` Andrew Lunn
@ 2024-01-22 21:34   ` Andrey Jr. Melnikov
  2024-01-25 18:01   ` Marc Haber
  1 sibling, 0 replies; 18+ messages in thread
From: Andrey Jr. Melnikov @ 2024-01-22 21:34 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Marc Haber, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
	Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev

On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > Hi,

Hello. I have same symthom on same board.

[skip]
 
> make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> use whatever it is reporting for:
> 
> PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> 
> to find where it is in the listing.

root@bpi:~# grep -ah 'PC is at ' /var/log/syslog*
Jan 22 20:13:04 bpi kernel: [256048.826170] PC is at stmmac_get_stats64+0x5c/0x1f8 [stmmac]
Jan 22 20:14:51 bpi kernel: [256156.077831] PC is at stmmac_get_stats64+0x40/0x1f8 [stmmac]
Jan 22 20:15:18 bpi kernel: [256183.687522] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:50:44 bpi kernel: [156104.837571] PC is at stmmac_get_stats64+0x4c/0x1f8 [stmmac]
Jan 17 10:51:52 bpi kernel: [156172.085436] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:52:37 bpi kernel: [156217.161344] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:53:03 bpi kernel: [156243.852175] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:54:40 bpi kernel: [156340.689082] PC is at stmmac_get_stats64+0x48/0x1f8 [stmmac]
Jan 17 10:55:07 bpi kernel: [156367.851904] PC is at stmmac_get_stats64+0x50/0x1f8 [stmmac]
Jan 17 10:56:11 bpi kernel: [156431.692860] PC is at stmmac_get_stats64+0x44/0x1f8 [stmmac]
Jan 17 10:56:49 bpi kernel: [156469.648758] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:57:15 bpi kernel: [156495.851573] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 10:59:20 bpi kernel: [156620.036359] PC is at stmmac_get_stats64+0x64/0x1f8 [stmmac]
Jan 17 11:00:31 bpi kernel: [156691.276191] PC is at stmmac_get_stats64+0x38/0x1f8 [stmmac]
Jan 17 11:01:07 bpi kernel: [156727.700103] PC is at stmmac_get_stats64+0x40/0x1f8 [stmmac]
Jan 17 11:01:31 bpi kernel: [156751.850926] PC is at stmmac_get_stats64+0x48/0x1f8 [stmmac]

so, PC always after first memory barrier (according to objdump -DS sttmac.ko):

....

00005b6c <stmmac_get_stats64>:
    5b6c:       e92d47f0        push    {r4, r5, r6, r7, r8, r9, sl, lr}
    5b70:       e52de004        push    {lr}            @ (str lr, [sp, #-4]!)
    5b74:       ebfffffe        bl      0 <__gnu_mcount_nc>
    5b78:       e2805a03        add     r5, r0, #12288  @ 0x3000
    5b7c:       e59535c0        ldr     r3, [r5, #1472] @ 0x5c0
    5b80:       e5937078        ldr     r7, [r3, #120]  @ 0x78
    5b84:       e5934074        ldr     r4, [r3, #116]  @ 0x74
    5b88:       e3570000        cmp     r7, #0 // r7 - 
    5b8c:       12802db9        addne   r2, r0, #11840  @ 0x2e40
    5b90:       12822008        addne   r2, r2, #8
    5b94:       13a06000        movne   r6, #0
    5b98:       1a00000b        bne     5bcc <stmmac_get_stats64+0x60>
    5b9c:       ea000026        b       5c3c <stmmac_get_stats64+0xd0>
    5ba0:       f57ff05b        dmb     ish
    5ba4:       e320f000        nop     {0}
    5ba8:       e320f000        nop     {0}
    5bac:       e320f000        nop     {0}
    5bb0:       e320f000        nop     {0}
    5bb4:       e320f000        nop     {0}
    5bb8:       e320f000        nop     {0}
    5bbc:       e320f000        nop     {0}
    5bc0:       e320f000        nop     {0}
    5bc4:       e320f000        nop     {0}
    5bc8:       e320f000        nop     {0}
    5bcc:       e5923000        ldr     r3, [r2]
    5bd0:       e3130001        tst     r3, #1
    5bd4:       1afffff1        bne     5ba0 <stmmac_get_stats64+0x34>
    5bd8:       f57ff05b        dmb     ish

....

it loops in tx stats reading.
 
> Once we know if its the RX or the TX loop, we have a better idea where
> to look for an unbalanced u64_stats_update_begin() /
> u64_stats_update_end().


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-21 21:52 ` Andrew Lunn
  2024-01-22 21:34   ` Andrey Jr. Melnikov
@ 2024-01-25 18:01   ` Marc Haber
  2024-01-25 19:54     ` Andrew Lunn
  2024-01-26 10:48     ` Marc Haber
  1 sibling, 2 replies; 18+ messages in thread
From: Marc Haber @ 2024-01-25 18:01 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

Hi,

On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > Hi,
> > 
> > I am running a bunch of Banana Pis with Debian stable and unstable but
> > with a bleeding edge kernel. Since kernel 6.6, especially the test
> > system running Debian unstable is plagued by self-detected stalls on
> > CPU. The system seems to continue running normally locally but doesn't
> > answer on the network any more. Sometimes, after a few hours, things
> > heal themselves.
> > 
> > Here is an example log output:
> > [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> > [73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> > [73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
> > [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
> > [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> > [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> > [73929.383363] LR is at dev_get_stats+0x44/0x144
> > [73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
> > [73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
> > [73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
> > [73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
> > [73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
> > [73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > [73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
> > [73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
> 
> Hi Marc
> 
> https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949

That is just for reference to the source? Or am I supposed to do
something with that link?

> My _guess_ would be, its stuck in one of the loops which look like:
> 
> 		do {
> 			start = u64_stats_fetch_begin(&txq_stats->syncp);
> 			tx_packets = txq_stats->tx_packets;
> 			tx_bytes   = txq_stats->tx_bytes;
> 		} while (u64_stats_fetch_retry(&txq_stats->syncp, start));
> 
> Next time you get a backtrace, could you do:
> 
> make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> use whatever it is reporting for:

I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
the first bad commit that the bisect eventually identified) and tried
running:

[56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
  SYNC    include/config/auto.conf.cmd
  SYSHDR  arch/arm/include/generated/uapi/asm/unistd-oabi.h
  SYSHDR  arch/arm/include/generated/uapi/asm/unistd-eabi.h
  HOSTCC  scripts/kallsyms
  UPD     include/config/kernel.release
  UPD     include/generated/uapi/linux/version.h
  UPD     include/generated/utsrelease.h
  SYSNR   arch/arm/include/generated/asm/unistd-nr.h
  SYSTBL  arch/arm/include/generated/calls-oabi.S
  SYSTBL  arch/arm/include/generated/calls-eabi.S
  CC      scripts/mod/empty.o
  MKELF   scripts/mod/elfconfig.h
  HOSTCC  scripts/mod/modpost.o
  CC      scripts/mod/devicetable-offsets.s
  UPD     scripts/mod/devicetable-offsets.h
  HOSTCC  scripts/mod/file2alias.o
  HOSTCC  scripts/mod/sumversion.o
  HOSTLD  scripts/mod/modpost
  CC      kernel/bounds.s
  CC      arch/arm/kernel/asm-offsets.s
  UPD     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
  CHKSHA1 include/linux/atomic/atomic-instrumented.h
  MKLST   drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
[57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $

That is not what it was suppsoed to yield, right?

> 
> PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> 
> to find where it is in the listing.
> 
> Once we know if its the RX or the TX loop, we have a better idea where
> to look for an unbalanced u64_stats_update_begin() /
> u64_stats_update_end().
> 
> > I am running a bisect attempt since before christmas, but since it takes
> > up to a day for the issue to show themselves on a "bad" kernel, I'll let
> > "good" kernels run for four days until I declare them good. That takes a
> > lot of wall clock (or better, wall calendar) time.
> 
> You might be able to speed it up with:
> 
> while true ; do cat /proc/net/dev > /dev/null ; done
> 
> and iperf or similar to generate a lot of traffic.

My bisect eventually completed and identified
2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
Sadly, it doesnt contain any loops, no calls to u64_stats_update_begin()
or u64_stats_update_end() or other suspicious things to the casual
reader.

I have backed out that commit out of 6.7.1 and have booted that kernel.
Not long enough to be able to say something yet.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-25 18:01   ` Marc Haber
@ 2024-01-25 19:54     ` Andrew Lunn
  2024-01-25 20:00       ` Florian Fainelli
  2024-01-26 10:48     ` Marc Haber
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2024-01-25 19:54 UTC (permalink / raw)
  To: Marc Haber
  Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
> the first bad commit that the bisect eventually identified) and tried
> running:
> 
> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
>   SYNC    include/config/auto.conf.cmd
>   SYSHDR  arch/arm/include/generated/uapi/asm/unistd-oabi.h
>   SYSHDR  arch/arm/include/generated/uapi/asm/unistd-eabi.h
>   HOSTCC  scripts/kallsyms
>   UPD     include/config/kernel.release
>   UPD     include/generated/uapi/linux/version.h
>   UPD     include/generated/utsrelease.h
>   SYSNR   arch/arm/include/generated/asm/unistd-nr.h
>   SYSTBL  arch/arm/include/generated/calls-oabi.S
>   SYSTBL  arch/arm/include/generated/calls-eabi.S
>   CC      scripts/mod/empty.o
>   MKELF   scripts/mod/elfconfig.h
>   HOSTCC  scripts/mod/modpost.o
>   CC      scripts/mod/devicetable-offsets.s
>   UPD     scripts/mod/devicetable-offsets.h
>   HOSTCC  scripts/mod/file2alias.o
>   HOSTCC  scripts/mod/sumversion.o
>   HOSTLD  scripts/mod/modpost
>   CC      kernel/bounds.s
>   CC      arch/arm/kernel/asm-offsets.s
>   UPD     include/generated/asm-offsets.h
>   CALL    scripts/checksyscalls.sh
>   CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
>   CHKSHA1 include/linux/atomic/atomic-instrumented.h
>   MKLST   drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
> 
> That is not what it was suppsoed to yield, right?

No. But did it actually generate
drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
like this are not always fatal.

> My bisect eventually completed and identified
> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.

I can make a guess.

-       memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));

Its removed, not moved later. Deep within this structure is the
stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
to return, and the two syncp variables are in it as well.

My guess is, they have an invalid state, when this memset is missing.

Try putting the memset back.

I also guess that is not the real fix, there are missing calls to
u64_stats_init().

	Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-25 19:54     ` Andrew Lunn
@ 2024-01-25 20:00       ` Florian Fainelli
  2024-01-26  7:51         ` Petr Tesařík
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2024-01-25 20:00 UTC (permalink / raw)
  To: Andrew Lunn, Marc Haber, Petr Tesarik
  Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

On 1/25/24 11:54, Andrew Lunn wrote:
>> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
>> the first bad commit that the bisect eventually identified) and tried
>> running:
>>
>> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
>>    SYNC    include/config/auto.conf.cmd
>>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-oabi.h
>>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-eabi.h
>>    HOSTCC  scripts/kallsyms
>>    UPD     include/config/kernel.release
>>    UPD     include/generated/uapi/linux/version.h
>>    UPD     include/generated/utsrelease.h
>>    SYSNR   arch/arm/include/generated/asm/unistd-nr.h
>>    SYSTBL  arch/arm/include/generated/calls-oabi.S
>>    SYSTBL  arch/arm/include/generated/calls-eabi.S
>>    CC      scripts/mod/empty.o
>>    MKELF   scripts/mod/elfconfig.h
>>    HOSTCC  scripts/mod/modpost.o
>>    CC      scripts/mod/devicetable-offsets.s
>>    UPD     scripts/mod/devicetable-offsets.h
>>    HOSTCC  scripts/mod/file2alias.o
>>    HOSTCC  scripts/mod/sumversion.o
>>    HOSTLD  scripts/mod/modpost
>>    CC      kernel/bounds.s
>>    CC      arch/arm/kernel/asm-offsets.s
>>    UPD     include/generated/asm-offsets.h
>>    CALL    scripts/checksyscalls.sh
>>    CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
>>    CHKSHA1 include/linux/atomic/atomic-instrumented.h
>>    MKLST   drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
>> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
>> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
>>
>> That is not what it was suppsoed to yield, right?
> 
> No. But did it actually generate
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
> like this are not always fatal.
> 
>> My bisect eventually completed and identified
>> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
> 
> I can make a guess.
> 
> -       memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
> 
> Its removed, not moved later. Deep within this structure is the
> stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
> to return, and the two syncp variables are in it as well.
> 
> My guess is, they have an invalid state, when this memset is missing.
> 
> Try putting the memset back.
> 
> I also guess that is not the real fix, there are missing calls to
> u64_stats_init().

Did not Petr try to address the same problem essentially:

https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/

this was not deemed the proper solution and I don't think one has been 
posted since then, but it looks about your issue here Marc.
-- 
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-25 20:00       ` Florian Fainelli
@ 2024-01-26  7:51         ` Petr Tesařík
  2024-01-26 10:54           ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-01-26  7:51 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Marc Haber, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

On Thu, 25 Jan 2024 12:00:46 -0800
Florian Fainelli <f.fainelli@gmail.com> wrote:

> On 1/25/24 11:54, Andrew Lunn wrote:
> >> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
> >> the first bad commit that the bisect eventually identified) and tried
> >> running:
> >>
> >> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >>    SYNC    include/config/auto.conf.cmd
> >>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-oabi.h
> >>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-eabi.h
> >>    HOSTCC  scripts/kallsyms
> >>    UPD     include/config/kernel.release
> >>    UPD     include/generated/uapi/linux/version.h
> >>    UPD     include/generated/utsrelease.h
> >>    SYSNR   arch/arm/include/generated/asm/unistd-nr.h
> >>    SYSTBL  arch/arm/include/generated/calls-oabi.S
> >>    SYSTBL  arch/arm/include/generated/calls-eabi.S
> >>    CC      scripts/mod/empty.o
> >>    MKELF   scripts/mod/elfconfig.h
> >>    HOSTCC  scripts/mod/modpost.o
> >>    CC      scripts/mod/devicetable-offsets.s
> >>    UPD     scripts/mod/devicetable-offsets.h
> >>    HOSTCC  scripts/mod/file2alias.o
> >>    HOSTCC  scripts/mod/sumversion.o
> >>    HOSTLD  scripts/mod/modpost
> >>    CC      kernel/bounds.s
> >>    CC      arch/arm/kernel/asm-offsets.s
> >>    UPD     include/generated/asm-offsets.h
> >>    CALL    scripts/checksyscalls.sh
> >>    CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
> >>    CHKSHA1 include/linux/atomic/atomic-instrumented.h
> >>    MKLST   drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
> >> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
> >>
> >> That is not what it was suppsoed to yield, right?  
> > 
> > No. But did it actually generate
> > drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
> > like this are not always fatal.
> >   
> >> My bisect eventually completed and identified
> >> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.  
> > 
> > I can make a guess.
> > 
> > -       memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
> > 
> > Its removed, not moved later. Deep within this structure is the
> > stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
> > to return, and the two syncp variables are in it as well.
> > 
> > My guess is, they have an invalid state, when this memset is missing.
> > 
> > Try putting the memset back.
> > 
> > I also guess that is not the real fix, there are missing calls to
> > u64_stats_init().  
> 
> Did not Petr try to address the same problem essentially:
> 
> https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> 
> this was not deemed the proper solution and I don't think one has been 
> posted since then, but it looks about your issue here Marc.

Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
I've been busy with other things lately, so I could not test and submit
my changes.

Essentially, the write side of the statistics seqlock is not protected
and will eventually miss an increment, causing the read side to spin
forever. The final plan is to split the statistics into three parts:

1. fields updated only under the tx queue lock,
2. fields updated only during NAPI poll,
3. fields updated only from interrupt context,

The first two groups can each have its own seqlock. The third group
(actually a single counter) can be converted to a per-CPU variable. The
read side will then aggregate the values as appropriate.

I hope I can find some time for this bug again during the coming weekend
(it's not for my day job). It's motivating to know that I'm not the
only affected person on the planet. ;-)

Petr T

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-25 18:01   ` Marc Haber
  2024-01-25 19:54     ` Andrew Lunn
@ 2024-01-26 10:48     ` Marc Haber
  1 sibling, 0 replies; 18+ messages in thread
From: Marc Haber @ 2024-01-26 10:48 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

On Thu, Jan 25, 2024 at 07:01:40PM +0100, Marc Haber wrote:
> On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> > On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > > Hi,
> > > 
> > > I am running a bunch of Banana Pis with Debian stable and unstable but
> > > with a bleeding edge kernel. Since kernel 6.6, especially the test
> > > system running Debian unstable is plagued by self-detected stalls on
> > > CPU. The system seems to continue running normally locally but doesn't
> > > answer on the network any more. Sometimes, after a few hours, things
> > > heal themselves.
> > > 
> > > Here is an example log output:
> > > [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> > > [73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
> > > [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
> > > [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> > > [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> > > [73929.383363] LR is at dev_get_stats+0x44/0x144
> > > [73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
> > > [73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
> > > [73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
> > > [73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
> > > [73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
> > > [73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > > [73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
> > > [73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
> > 
> > Hi Marc
> > 
> > https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949
> 
> That is just for reference to the source? Or am I supposed to do
> something with that link?
> 
> > My _guess_ would be, its stuck in one of the loops which look like:
> > 
> > 		do {
> > 			start = u64_stats_fetch_begin(&txq_stats->syncp);
> > 			tx_packets = txq_stats->tx_packets;
> > 			tx_bytes   = txq_stats->tx_bytes;
> > 		} while (u64_stats_fetch_retry(&txq_stats->syncp, start));
> > 
> > Next time you get a backtrace, could you do:
> > 
> > make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> > use whatever it is reporting for:

So, if I have in my current backtrace:
PC is at stmmac_get_stats64+0x48/0x20c [stmmac]
I look in the generated stmmac_main.lst for the function
stmmac_get_stats:
00005e9c <stmmac_get_stats64>:
{
    5e9c:       e92d47f0        push    {r4, r5, r6, r7, r8, r9, sl, lr}
    5ea0:       e52de004        push    {lr}            @ (str lr, [sp, #-4]!)
    5ea4:       ebfffffe        bl      0 <__gnu_mcount_nc>
                        5ea4: R_ARM_CALL        __gnu_mcount_nc
        u32 tx_cnt = priv->plat->tx_queues_to_use;
    5ea8:       e2805a03        add     r5, r0, #12288  @ 0x3000
    5eac:       e59535c0        ldr     r3, [r5, #1472] @ 0x5c0
    5eb0:       e5937078        ldr     r7, [r3, #120]  @ 0x78
        u32 rx_cnt = priv->plat->rx_queues_to_use;
    5eb4:       e5934074        ldr     r4, [r3, #116]  @ 0x74
        for (q = 0; q < tx_cnt; q++) {
    5eb8:       e3570000        cmp     r7, #0
    5ebc:       12802db9        addne   r2, r0, #11840  @ 0x2e40
    5ec0:       12822008        addne   r2, r2, #8
    5ec4:       13a06000        movne   r6, #0
    5ec8:       1a00000b        bne     5efc <stmmac_get_stats64+0x60>
    5ecc:       ea000026        b       5f6c <stmmac_get_stats64+0xd0>
        local_irq_restore(flags);
}

the address in the first line is the base address, so the line in
question is 0x5e9c+0x48=0x5ee4, which is already outside the function?!

> My bisect eventually completed and identified
> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
> Sadly, it doesnt contain any loops, no calls to u64_stats_update_begin()
> or u64_stats_update_end() or other suspicious things to the casual
> reader.
> 
> I have backed out that commit out of 6.7.1 and have booted that kernel.
> Not long enough to be able to say something yet.

That didn't fix the hangs, PC is at
stmmac_get_stats64+0x34/0x20c
stmmac_get_stats64+0x38/0x20c
stmmac_get_stats64+0x3c/0x20c
stmmac_get_stats64+0x40/0x20c
stmmac_get_stats64+0x44/0x20c
stmmac_get_stats64+0x48/0x20c
stmmac_get_stats64+0x4c/0x20c
stmmac_get_stats64+0x50/0x20c
stmmac_get_stats64+0x54/0x20c
stmmac_get_stats64+0x58/0x20c
stmmac_get_stats64+0x5c/0x20c
stmmac_get_stats64+0x60/0x20c
stmmac_get_stats64+0x64/0x20c
(sorted, uniq, about 66 instances in about 18 hours)

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-26  7:51         ` Petr Tesařík
@ 2024-01-26 10:54           ` Marc Haber
  2024-01-26 11:10             ` Petr Tesařík
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-01-26 10:54 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

On Fri, Jan 26, 2024 at 08:51:22AM +0100, Petr Tesařík wrote:
> On Thu, 25 Jan 2024 12:00:46 -0800
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> > Did not Petr try to address the same problem essentially:
> > 
> > https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> > 
> > this was not deemed the proper solution and I don't think one has been 
> > posted since then, but it looks about your issue here Marc.
> 
> Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
> I've been busy with other things lately, so I could not test and submit
> my changes.

Is it worth trying your patch from the message cited above, knowing that
is not the final solution?

> I hope I can find some time for this bug again during the coming weekend
> (it's not for my day job). It's motivating to know that I'm not the
> only affected person on the planet. ;-)

I am ready to test if you want me to ;-)

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-26 10:54           ` Marc Haber
@ 2024-01-26 11:10             ` Petr Tesařík
  2024-02-05 20:12               ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-01-26 11:10 UTC (permalink / raw)
  To: Marc Haber
  Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

On Fri, 26 Jan 2024 11:54:20 +0100
Marc Haber <mh+netdev@zugschlus.de> wrote:

> On Fri, Jan 26, 2024 at 08:51:22AM +0100, Petr Tesařík wrote:
> > On Thu, 25 Jan 2024 12:00:46 -0800
> > Florian Fainelli <f.fainelli@gmail.com> wrote:  
> > > Did not Petr try to address the same problem essentially:
> > > 
> > > https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> > > 
> > > this was not deemed the proper solution and I don't think one has been 
> > > posted since then, but it looks about your issue here Marc.  
> > 
> > Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
> > I've been busy with other things lately, so I could not test and submit
> > my changes.  
> 
> Is it worth trying your patch from the message cited above, knowing that
> is not the final solution?

Depends. It solves the deadlock (at least for me); my NanoPi has been
running stable for over a month with this patch. But it also introduces
a new spinlock, which usually reduces performance.

In any case, you can give it a try to verify that you hit the same
issue.

> > I hope I can find some time for this bug again during the coming weekend
> > (it's not for my day job). It's motivating to know that I'm not the
> > only affected person on the planet. ;-)  
> 
> I am ready to test if you want me to ;-)

Then you may want to start by verifying that it is indeed the same
issue. Try the linked patch.

Thank you!

Petr T

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-01-26 11:10             ` Petr Tesařík
@ 2024-02-05 20:12               ` Marc Haber
  2024-02-05 21:50                 ` Florian Fainelli
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-02-05 20:12 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> Then you may want to start by verifying that it is indeed the same
> issue. Try the linked patch.

The linked patch seemed to help for 6.7.2, the test machine ran for five
days without problems. After going to unpatched 6.7.2, the issue was
back in six hours.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-05 20:12               ` Marc Haber
@ 2024-02-05 21:50                 ` Florian Fainelli
  2024-02-06  8:23                   ` Petr Tesařík
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2024-02-05 21:50 UTC (permalink / raw)
  To: Marc Haber, Petr Tesařík
  Cc: Andrew Lunn, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
	Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev

On 2/5/24 12:12, Marc Haber wrote:
> On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
>> Then you may want to start by verifying that it is indeed the same
>> issue. Try the linked patch.
> 
> The linked patch seemed to help for 6.7.2, the test machine ran for five
> days without problems. After going to unpatched 6.7.2, the issue was
> back in six hours.

Do you mind responding to Petr's patch with a Tested-by? Thanks!
-- 
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-05 21:50                 ` Florian Fainelli
@ 2024-02-06  8:23                   ` Petr Tesařík
  2024-02-12 12:15                     ` Marc Haber
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-02-06  8:23 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Marc Haber, Andrew Lunn, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

Hi Florian,

On Mon, 5 Feb 2024 13:50:35 -0800
Florian Fainelli <f.fainelli@gmail.com> wrote:

> On 2/5/24 12:12, Marc Haber wrote:
> > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:  
> >> Then you may want to start by verifying that it is indeed the same
> >> issue. Try the linked patch.  
> > 
> > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > days without problems. After going to unpatched 6.7.2, the issue was
> > back in six hours.  
> 
> Do you mind responding to Petr's patch with a Tested-by? Thanks!

I believe Marc tested my first attempt at a solution (the one with
spinlocks), not the latest incarnation. FWIW I have tested a similar
scenario, with similar results.

@Marc: I was able to reduce the time until hang by running a "ping -f"
from another machine on the same LAN and running "ethtool -S" in a
tight loop on the system under testing (over an SSH connection, so it
probably contributed substantially to the network traffic). The
unpatched kernel froze within a few minutes.

Petr T

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-06  8:23                   ` Petr Tesařík
@ 2024-02-12 12:15                     ` Marc Haber
  2024-02-19 19:20                       ` Christian Stewart
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Haber @ 2024-02-12 12:15 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Florian Fainelli, Andrew Lunn, alexandre.torgue, Jose Abreu,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland, Jisheng Zhang,
	netdev

On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:
> On Mon, 5 Feb 2024 13:50:35 -0800
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
> > On 2/5/24 12:12, Marc Haber wrote:
> > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:  
> > >> Then you may want to start by verifying that it is indeed the same
> > >> issue. Try the linked patch.  
> > > 
> > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > days without problems. After going to unpatched 6.7.2, the issue was
> > > back in six hours.  
> > 
> > Do you mind responding to Petr's patch with a Tested-by? Thanks!
> 
> I believe Marc tested my first attempt at a solution (the one with
> spinlocks), not the latest incarnation. FWIW I have tested a similar
> scenario, with similar results.

Where is the latest patch? I can give it a try.

Sorry for not responding any earlier, February 10 is an important tax
due date in Germany.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-12 12:15                     ` Marc Haber
@ 2024-02-19 19:20                       ` Christian Stewart
  2024-02-19 19:44                         ` Petr Tesařík
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Stewart @ 2024-02-19 19:20 UTC (permalink / raw)
  To: Marc Haber
  Cc: Petr Tesařík, Florian Fainelli, Andrew Lunn,
	alexandre.torgue, Jose Abreu, Chen-Yu Tsai, Jernej Skrabec,
	Samuel Holland, Jisheng Zhang, netdev

Hi all,

On Mon, Feb 12, 2024 at 4:15 AM Marc Haber <mh+netdev@zugschlus.de> wrote:
>
> On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:
> > On Mon, 5 Feb 2024 13:50:35 -0800
> > Florian Fainelli <f.fainelli@gmail.com> wrote:
> >
> > > On 2/5/24 12:12, Marc Haber wrote:
> > > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:
> > > >> Then you may want to start by verifying that it is indeed the same
> > > >> issue. Try the linked patch.
> > > >
> > > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > > days without problems. After going to unpatched 6.7.2, the issue was
> > > > back in six hours.
> > >
> > > Do you mind responding to Petr's patch with a Tested-by? Thanks!
> >
> > I believe Marc tested my first attempt at a solution (the one with
> > spinlocks), not the latest incarnation. FWIW I have tested a similar
> > scenario, with similar results.
>
> Where is the latest patch? I can give it a try.
>
> Sorry for not responding any earlier, February 10 is an important tax
> due date in Germany.
>
> Greetings
> Marc

We are seeing the same kernel panic on shutdown with 6.7.4 on a
BananaPi M2 Ultra:

[**    ] (3 of 3) A stop job is running for Network Manager (33s / 52s)
[  259.463772] rcu: INFO: rcu_sched self-detected stall on CPU
[  259.469388] rcu:     0-....: (2099 ticks this GP)
idle=0fdc/1/0x40000002 softirq=12003/12003 fqs=1034
[  259.478360] rcu:     (t=2100 jiffies g=16277 q=36 ncpus=4)
[  259.483595] CPU: 0 PID: 4462 Comm: ip Tainted: G         C         6.7.4 #1
[  259.490562] Hardware name: Allwinner sun8i Family
[  259.495268] PC is at stmmac_get_stats64+0x30/0x198
[  259.500081] LR is at dev_get_stats+0x3c/0x160
[  259.504445] pc : [<c06b9924>]    lr : [<c07bf7a8>]    psr: 200f0013
[  259.510712] sp : f1e6d9b8  ip : c3ca478c  fp : c23e0000
[  259.515941] r10: 00000000  r9 : c3ca4598  r8 : 00000000
[  259.521168] r7 : 00000001  r6 : 00000000  r5 : c23e3000  r4 : 00000001
[  259.527697] r3 : 00005c1b  r2 : c23e2e08  r1 : c3ca46c4  r0 : c23e0000
[  259.534226] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  259.541363] Control: 10c5387d  Table: 429cc06a  DAC: 00000051
[  259.547117]  stmmac_get_stats64 from dev_get_stats+0x3c/0x160
[  259.552882]  dev_get_stats from rtnl_fill_stats+0x30/0x118
[  259.552899]  rtnl_fill_stats from rtnl_fill_ifinfo+0x720/0x135c
[  259.564306]  rtnl_fill_ifinfo from rtnl_dump_ifinfo+0x330/0x6a8
[  259.570240]  rtnl_dump_ifinfo from netlink_dump+0x16c/0x350
[  259.575830]  netlink_dump from __netlink_dump_start+0x1bc/0x280
[  259.581766]  __netlink_dump_start from rtnetlink_rcv_msg+0xf4/0x2f0
[  259.588047]  rtnetlink_rcv_msg from netlink_rcv_skb+0xb8/0x118
[  259.593893]  netlink_rcv_skb from netlink_unicast+0x1fc/0x2d8
[  259.599655]  netlink_unicast from netlink_sendmsg+0x1c8/0x440
[  259.605416]  netlink_sendmsg from sock_write_iter+0xa0/0x10c
[  259.611094]  sock_write_iter from vfs_write+0x338/0x398
[  259.616334]  vfs_write from ksys_write+0xbc/0xf0
[  259.620961]  ksys_write from ret_fast_syscall+0x0/0x54
[  259.626110] Exception stack(0xf1e6dfa8 to 0xf1e6dff0)
[  259.631169] dfa0:                   00000003 be997dd8 00000003
be997dd8 00000014 00000001
[  259.639351] dfc0: 00000003 be997dd8 00000014 00000004 00519548
be997e08 b6fd0ce0 0051783c

https://github.com/skiffos/SkiffOS/issues/307

I'm writing to ask if anyone has found a fix for this yet?

Thanks!
Christian Stewart

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-19 19:20                       ` Christian Stewart
@ 2024-02-19 19:44                         ` Petr Tesařík
  2024-02-20 14:59                           ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Tesařík @ 2024-02-19 19:44 UTC (permalink / raw)
  To: Christian Stewart
  Cc: Marc Haber, Florian Fainelli, Andrew Lunn, alexandre.torgue,
	Jose Abreu, Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
	Jisheng Zhang, netdev

On Mon, 19 Feb 2024 11:20:35 -0800
Christian Stewart <christian@aperture.us> wrote:

> Hi all,
> 
> On Mon, Feb 12, 2024 at 4:15 AM Marc Haber <mh+netdev@zugschlus.de> wrote:
> >
> > On Tue, Feb 06, 2024 at 09:23:51AM +0100, Petr Tesařík wrote:  
> > > On Mon, 5 Feb 2024 13:50:35 -0800
> > > Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >  
> > > > On 2/5/24 12:12, Marc Haber wrote:  
> > > > > On Fri, Jan 26, 2024 at 12:10:28PM +0100, Petr Tesařík wrote:  
> > > > >> Then you may want to start by verifying that it is indeed the same
> > > > >> issue. Try the linked patch.  
> > > > >
> > > > > The linked patch seemed to help for 6.7.2, the test machine ran for five
> > > > > days without problems. After going to unpatched 6.7.2, the issue was
> > > > > back in six hours.  
> > > >
> > > > Do you mind responding to Petr's patch with a Tested-by? Thanks!  
> > >
> > > I believe Marc tested my first attempt at a solution (the one with
> > > spinlocks), not the latest incarnation. FWIW I have tested a similar
> > > scenario, with similar results.  
> >
> > Where is the latest patch? I can give it a try.
> >
> > Sorry for not responding any earlier, February 10 is an important tax
> > due date in Germany.
> >
> > Greetings
> > Marc  
> 
> We are seeing the same kernel panic on shutdown with 6.7.4 on a
> BananaPi M2 Ultra:
> 
> [**    ] (3 of 3) A stop job is running for Network Manager (33s / 52s)
> [  259.463772] rcu: INFO: rcu_sched self-detected stall on CPU
> [  259.469388] rcu:     0-....: (2099 ticks this GP)
> idle=0fdc/1/0x40000002 softirq=12003/12003 fqs=1034
> [  259.478360] rcu:     (t=2100 jiffies g=16277 q=36 ncpus=4)
> [  259.483595] CPU: 0 PID: 4462 Comm: ip Tainted: G         C         6.7.4 #1
> [  259.490562] Hardware name: Allwinner sun8i Family
> [  259.495268] PC is at stmmac_get_stats64+0x30/0x198
> [  259.500081] LR is at dev_get_stats+0x3c/0x160
> [  259.504445] pc : [<c06b9924>]    lr : [<c07bf7a8>]    psr: 200f0013
> [  259.510712] sp : f1e6d9b8  ip : c3ca478c  fp : c23e0000
> [  259.515941] r10: 00000000  r9 : c3ca4598  r8 : 00000000
> [  259.521168] r7 : 00000001  r6 : 00000000  r5 : c23e3000  r4 : 00000001
> [  259.527697] r3 : 00005c1b  r2 : c23e2e08  r1 : c3ca46c4  r0 : c23e0000
> [  259.534226] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [  259.541363] Control: 10c5387d  Table: 429cc06a  DAC: 00000051
> [  259.547117]  stmmac_get_stats64 from dev_get_stats+0x3c/0x160
> [  259.552882]  dev_get_stats from rtnl_fill_stats+0x30/0x118
> [  259.552899]  rtnl_fill_stats from rtnl_fill_ifinfo+0x720/0x135c
> [  259.564306]  rtnl_fill_ifinfo from rtnl_dump_ifinfo+0x330/0x6a8
> [  259.570240]  rtnl_dump_ifinfo from netlink_dump+0x16c/0x350
> [  259.575830]  netlink_dump from __netlink_dump_start+0x1bc/0x280
> [  259.581766]  __netlink_dump_start from rtnetlink_rcv_msg+0xf4/0x2f0
> [  259.588047]  rtnetlink_rcv_msg from netlink_rcv_skb+0xb8/0x118
> [  259.593893]  netlink_rcv_skb from netlink_unicast+0x1fc/0x2d8
> [  259.599655]  netlink_unicast from netlink_sendmsg+0x1c8/0x440
> [  259.605416]  netlink_sendmsg from sock_write_iter+0xa0/0x10c
> [  259.611094]  sock_write_iter from vfs_write+0x338/0x398
> [  259.616334]  vfs_write from ksys_write+0xbc/0xf0
> [  259.620961]  ksys_write from ret_fast_syscall+0x0/0x54
> [  259.626110] Exception stack(0xf1e6dfa8 to 0xf1e6dff0)
> [  259.631169] dfa0:                   00000003 be997dd8 00000003
> be997dd8 00000014 00000001
> [  259.639351] dfc0: 00000003 be997dd8 00000014 00000004 00519548
> be997e08 b6fd0ce0 0051783c
> 
> https://github.com/skiffos/SkiffOS/issues/307
> 
> I'm writing to ask if anyone has found a fix for this yet?

If you're running a 6.7 stable kernel, my patch has just been added to
the 6.7-stable tree.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.7/net-stmmac-protect-updates-of-64-bit-statistics-counters.patch

However, lockdep has reported an issue with it:

https://lore.kernel.org/lkml/ea1567d9-ce66-45e6-8168-ac40a47d1821@roeck-us.net/

This new report has not yet been properly understood, but FWIW I've
been running stable with my patch for over a month now.

Petr T

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-19 19:44                         ` Petr Tesařík
@ 2024-02-20 14:59                           ` Jakub Kicinski
  2024-02-23 20:38                             ` Christian Stewart
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-02-20 14:59 UTC (permalink / raw)
  To: Petr Tesařík, Christian Stewart
  Cc: Marc Haber, Florian Fainelli, Andrew Lunn, alexandre.torgue,
	Jose Abreu, Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
	Jisheng Zhang, netdev

On Mon, 19 Feb 2024 20:44:21 +0100 Petr Tesařík wrote:
> If you're running a 6.7 stable kernel, my patch has just been added to
> the 6.7-stable tree.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.7/net-stmmac-protect-updates-of-64-bit-statistics-counters.patch
> 
> However, lockdep has reported an issue with it:
> 
> https://lore.kernel.org/lkml/ea1567d9-ce66-45e6-8168-ac40a47d1821@roeck-us.net/
> 
> This new report has not yet been properly understood, but FWIW I've
> been running stable with my patch for over a month now.

Christian got an actual soft lockup, not just a lockdep warning, tho.
Christian, could you run the stack trace thru scripts/decode_stacktrace
and tell us which loop it's stuck on?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: stmmac on Banana PI CPU stalls since Linux 6.6
  2024-02-20 14:59                           ` Jakub Kicinski
@ 2024-02-23 20:38                             ` Christian Stewart
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Stewart @ 2024-02-23 20:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Tesařík, Marc Haber, Florian Fainelli,
	Andrew Lunn, alexandre.torgue, Jose Abreu, Chen-Yu Tsai,
	Jernej Skrabec, Samuel Holland, Jisheng Zhang, netdev

On Tue, Feb 20, 2024 at 6:59 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 19 Feb 2024 20:44:21 +0100 Petr Tesařík wrote:
> > This new report has not yet been properly understood, but FWIW I've
> > been running stable with my patch for over a month now.
>
> Christian got an actual soft lockup, not just a lockdep warning, tho.
> Christian, could you run the stack trace thru scripts/decode_stacktrace
> and tell us which loop it's stuck on?

This was a crash report from a user and unfortunately I don't have the
kernel sources & build artifacts from that device to be able to run
decode_stacktrace. If it happens again I will request the user send me
their kernel build tree & will report back with the decoded
stacktrace.

Thanks!
Christian Stewart

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-02-23 20:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-21 20:17 stmmac on Banana PI CPU stalls since Linux 6.6 Marc Haber
2024-01-21 21:52 ` Andrew Lunn
2024-01-22 21:34   ` Andrey Jr. Melnikov
2024-01-25 18:01   ` Marc Haber
2024-01-25 19:54     ` Andrew Lunn
2024-01-25 20:00       ` Florian Fainelli
2024-01-26  7:51         ` Petr Tesařík
2024-01-26 10:54           ` Marc Haber
2024-01-26 11:10             ` Petr Tesařík
2024-02-05 20:12               ` Marc Haber
2024-02-05 21:50                 ` Florian Fainelli
2024-02-06  8:23                   ` Petr Tesařík
2024-02-12 12:15                     ` Marc Haber
2024-02-19 19:20                       ` Christian Stewart
2024-02-19 19:44                         ` Petr Tesařík
2024-02-20 14:59                           ` Jakub Kicinski
2024-02-23 20:38                             ` Christian Stewart
2024-01-26 10:48     ` Marc Haber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.