[BUG] irqchip: armada-370-xp: workqueue lockup

* [BUG] irqchip: armada-370-xp: workqueue lockup
@ 2021-09-21  8:40 Steffen Trumtrar
  2021-09-21 15:18 ` Marc Zyngier
  2021-09-22 13:27 ` [irqchip: irq/irqchip-fixes] irqchip/armada-370-xp: Fix ack/eoi breakage irqchip-bot for Marc Zyngier
  0 siblings, 2 replies; 6+ messages in thread
From: Steffen Trumtrar @ 2021-09-21  8:40 UTC (permalink / raw)
  To: Valentin Schneider, Marc Zyngier
  Cc: Andrew Lunn, Gregory Clement, Sebastion Hesselbarth, linux-arm-kernel

Hi,

I noticed that after the patch

        e52e73b7e9f7d08b8c2ef6fb1657105093e22a03
        From: Valentin Schneider <valentin.schneider@arm.com>
        Date: Mon, 9 Nov 2020 09:41:18 +0000
        Subject: [PATCH] irqchip/armada-370-xp: Make IPIs use
        handle_percpu_devid_irq()

        As done for the Arm GIC irqchips, move IPIs to handle_percpu_devid_irq() as
        handle_percpu_devid_fasteoi_ipi() isn't actually required.

        Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
        Signed-off-by: Marc Zyngier <maz@kernel.org>
        Link: https://lore.kernel.org/r/20201109094121.29975-3-valentin.schneider@arm.com
        ---
        drivers/irqchip/irq-armada-370-xp.c | 2 +-
        1 file changed, 1 insertion(+), 1 deletion(-)

        diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c
        index d7eb2e93db8f..32938dfc0e46 100644
        --- a/drivers/irqchip/irq-armada-370-xp.c
        +++ b/drivers/irqchip/irq-armada-370-xp.c
        @@ -382,7 +382,7 @@ static int armada_370_xp_ipi_alloc(struct irq_domain *d,
                        irq_set_percpu_devid(virq + i);
                        irq_domain_set_info(d, virq + i, i, &ipi_irqchip,
                                        d->host_data,
        -                                   handle_percpu_devid_fasteoi_ipi,
        +                                   handle_percpu_devid_irq,
                                        NULL, NULL);
                }

I get workqueue lockups on my Armada-XP based board.
When I run the following test on v5.15-rc2

        stress-ng --cpu 8 --io 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 120s

I get a backtrace like this:

        stress-ng: info:  [7740] dispatching hogs: 8 cpu, 4 io, 2 vm, 4 fork
        [ 1670.169087] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        [ 1670.169102] 	(detected by 0, t=5252 jiffies, g=50257, q=3369)
        [ 1670.169112] rcu: All QSes seen, last rcu_preempt kthread activity 5252 (342543-337291), jiffies_till_next_fqs=1, root ->qsmask 0x0
        [ 1670.169121] rcu: rcu_preempt kthread timer wakeup didn't happen for 5251 jiffies! g50257 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
        [ 1670.169128] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=20398
        [ 1670.169132] rcu: rcu_preempt kthread starved for 5252 jiffies! g50257 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=1
        [ 1670.169140] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
        [ 1670.169143] rcu: RCU grace-period kthread stack dump:
        [ 1670.169146] task:rcu_preempt     state:R stack:    0 pid:   13 ppid:     2 flags:0x00000000
        [ 1670.169157] Backtrace:
        [ 1670.169163] [<c0a19c20>] (__schedule) from [<c0a1a458>] (schedule+0x64/0x110)
        [ 1670.169185]  r10:00000001 r9:c190e000 r8:c137b690 r7:c137b69c r6:c190fed4 r5:c190e000
        [ 1670.169189]  r4:c197c880
        [ 1670.169192] [<c0a1a3f4>] (schedule) from [<c0a20048>] (schedule_timeout+0xa8/0x1c0)
        [ 1670.169206]  r5:c1303d00 r4:0005258c
        [ 1670.169209] [<c0a1ffa0>] (schedule_timeout) from [<c01a1664>] (rcu_gp_fqs_loop+0x120/0x3ac)
        [ 1670.169227]  r7:c137b69c r6:c1303d00 r5:c137b4c0 r4:00000000
        [ 1670.169230] [<c01a1544>] (rcu_gp_fqs_loop) from [<c01a3dac>] (rcu_gp_kthread+0xfc/0x1b0)
        [ 1670.169247]  r10:c190ff5c r9:c1303d00 r8:c137b4c0 r7:c190e000 r6:c137b69e r5:c137b690
        [ 1670.169251]  r4:c137b69c
        [ 1670.169253] [<c01a3cb0>] (rcu_gp_kthread) from [<c0153b14>] (kthread+0x16c/0x1a0)
        [ 1670.169268]  r7:00000000
        [ 1670.169271] [<c01539a8>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
        [ 1670.169282] Exception stack(0xc190ffb0 to 0xc190fff8)
        [ 1670.169288] ffa0:                                     ???????? ???????? ???????? ????????
        [ 1670.169293] ffc0: ???????? ???????? ???????? ???????? ???????? ???????? ???????? ????????
        [ 1670.169297] ffe0: ???????? ???????? ???????? ???????? ???????? ????????
        [ 1670.169305]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c01539a8
        [ 1670.169310]  r4:c19320c0 r3:00000000
        [ 1670.169313] rcu: Stack dump where RCU GP kthread last ran:
        [ 1670.169316] Sending NMI from CPU 0 to CPUs 1:
        [ 1670.169327] NMI backtrace for cpu 1
        [ 1670.169335] CPU: 1 PID: 7764 Comm: stress-ng-cpu Tainted: G        W         5.15.0-rc2+ #5
        [ 1670.169343] Hardware name: Marvell Armada 370/XP (Device Tree)
        [ 1670.169346] PC is at 0x4bde7a
        [ 1670.169354] LR is at 0x4bdf21
        [ 1670.169359] pc : [<004bde7a>]    lr : [<004bdf21>]    psr: 20030030
        [ 1670.169363] sp : beb8270c  ip : 00004650  fp : beb8289c
        [ 1670.169367] r10: 00e5e800  r9 : 00514760  r8 : 0000036b
        [ 1670.169371] r7 : beb828a8  r6 : 000001f7  r5 : 000001fd  r4 : 000bacd7
        [ 1670.169375] r3 : 004bde30  r2 : 0000000b  r1 : 000001fd  r0 : 0001bbd7
        [ 1670.169380] Flags: nzCv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
        [ 1670.169386] Control: 10c5387d  Table: 0334806a  DAC: 00000055
        [ 1670.169389] CPU: 1 PID: 7764 Comm: stress-ng-cpu Tainted: G        W         5.15.0-rc2+ #5
        [ 1670.169395] Hardware name: Marvell Armada 370/XP (Device Tree)
        [ 1670.169398] Backtrace:
        [ 1670.169402] [<c0a0b758>] (dump_backtrace) from [<c0a0b9a4>] (show_stack+0x20/0x24)
        [ 1670.169418]  r7:c18db400 r6:c7875fb0 r5:60030193 r4:c1099c7c
        [ 1670.169421] [<c0a0b984>] (show_stack) from [<c0a11988>] (dump_stack_lvl+0x48/0x54)
        [ 1670.169433] [<c0a11940>] (dump_stack_lvl) from [<c0a119ac>] (dump_stack+0x18/0x1c)
        [ 1670.169445]  r5:00000001 r4:20030193
        [ 1670.169447] [<c0a11994>] (dump_stack) from [<c0109984>] (show_regs+0x1c/0x20)
        [ 1670.169461] [<c0109968>] (show_regs) from [<c05f6af8>] (nmi_cpu_backtrace+0xc0/0x10c)
        [ 1670.169474] [<c05f6a38>] (nmi_cpu_backtrace) from [<c010ffa4>] (do_handle_IPI+0x54/0x3b8)
        [ 1670.169489]  r7:c18db400 r6:00000017 r5:00000001 r4:00000007
        [ 1670.169491] [<c010ff50>] (do_handle_IPI) from [<c0110330>] (ipi_handler+0x28/0x30)
        [ 1670.169505]  r10:c7875f58 r9:c7875fb0 r8:c7875f30 r7:c18db400 r6:00000017 r5:c13ecadc
        [ 1670.169509]  r4:c18d9300 r3:00000010
        [ 1670.169511] [<c0110308>] (ipi_handler) from [<c0193200>] (handle_percpu_devid_irq+0xb4/0x288)
        [ 1670.169525] [<c019314c>] (handle_percpu_devid_irq) from [<c018c4b4>] (handle_domain_irq+0x8c/0xc0)
        [ 1670.169539]  r9:c7875fb0 r8:00000007 r7:00000000 r6:c1863d80 r5:00000000 r4:c12781e0
        [ 1670.169542] [<c018c428>] (handle_domain_irq) from [<c01012cc>] (armada_370_xp_handle_irq+0xdc/0x124)
        [ 1670.169556]  r10:00e5e800 r9:00514760 r8:10c5387d r7:c147d604 r6:c7875fb0 r5:000003fe
        [ 1670.169560]  r4:00000007 r3:00000007
        [ 1670.169562] [<c01011f0>] (armada_370_xp_handle_irq) from [<c0100e58>] (__irq_usr+0x58/0x80)
        [ 1670.169571] Exception stack(0xc7875fb0 to 0xc7875ff8)
        [ 1670.169576] 5fa0:                                     ???????? ???????? ???????? ????????
        [ 1670.169580] 5fc0: ???????? ???????? ???????? ???????? ???????? ???????? ???????? ????????
        [ 1670.169584] 5fe0: ???????? ???????? ???????? ???????? ???????? ????????
        [ 1670.169590]  r7:10c5387d r6:ffffffff r5:20030030 r4:004bde7a
        [ 1690.589098] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s!
        [ 1690.589133] Showing busy workqueues and worker pools:
        [ 1690.589138] workqueue events_unbound: flags=0x2
        [ 1690.589142]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=3/512 refcnt=5
        [ 1690.589157]     in-flight: 7:call_usermodehelper_exec_work
        [ 1690.589177]     pending: flush_memcg_stats_work, flush_memcg_stats_dwork
        [ 1690.589198] workqueue events_power_efficient: flags=0x80
        [ 1690.589203]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=5/256 refcnt=6
        [ 1690.589218]     in-flight: 53:fb_flashcursor fb_flashcursor
        [ 1690.589236]     pending: neigh_periodic_work, neigh_periodic_work, do_cache_clean
        [ 1690.589265] workqueue mm_percpu_wq: flags=0x8
        [ 1690.589269]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
        [ 1690.589284]     pending: vmstat_update
        [ 1690.589301] workqueue edac-poller: flags=0xa000a
        [ 1690.589305]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/1 refcnt=4
        [ 1690.589318]     pending: edac_mc_workq_function
        [ 1690.589331]     inactive: edac_device_workq_function
        [ 1690.589346] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=38s workers=3 idle: 7621 6478
        [ 1690.589370] pool 4: cpus=0-1 flags=0x4 nice=0 hung=41s workers=3 idle: 6967 5672
        [ 1721.313097] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 69s!
        [ 1721.313136] BUG: workqueue lockup - pool cpus=0-1 flags=0x4 nice=0 stuck for 72s!
        [ 1721.313149] Showing busy workqueues and worker pools:
        [ 1721.313154] workqueue events_unbound: flags=0x2
        [ 1721.313158]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=3/512 refcnt=5
        [ 1721.313173]     in-flight: 7:call_usermodehelper_exec_work
        [ 1721.313193]     pending: flush_memcg_stats_work, flush_memcg_stats_dwork
        [ 1721.313213] workqueue events_power_efficient: flags=0x80
        [ 1721.313218]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=5/256 refcnt=6
        [ 1721.313234]     in-flight: 53:fb_flashcursor fb_flashcursor
        [ 1721.313251]     pending: neigh_periodic_work, neigh_periodic_work, do_cache_clean
        [ 1721.313282] workqueue mm_percpu_wq: flags=0x8
        [ 1721.313285]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
        [ 1721.313301]     pending: vmstat_update
        [ 1721.313319] workqueue edac-poller: flags=0xa000a
        [ 1721.313323]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/1 refcnt=4
        [ 1721.313336]     pending: edac_mc_workq_function
        [ 1721.313349]     inactive: edac_device_workq_function
        [ 1721.313366] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=69s workers=3 idle: 7621 6478
        [ 1721.313390] pool 4: cpus=0-1 flags=0x4 nice=0 hung=72s workers=3 idle: 6967 5672
        [ 1733.189086] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        [ 1733.189101] 	(detected by 0, t=21007 jiffies, g=50257, q=13112)
        [ 1733.189111] rcu: All QSes seen, last rcu_preempt kthread activity 21007 (358298-337291), jiffies_till_next_fqs=1, root ->qsmask 0x0
        [ 1733.189119] rcu: rcu_preempt kthread timer wakeup didn't happen for 21006 jiffies! g50257 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
        [ 1733.189126] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=20834
        [ 1733.189131] rcu: rcu_preempt kthread starved for 21007 jiffies! g50257 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=1
        [ 1733.189138] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
        [ 1733.189141] rcu: RCU grace-period kthread stack dump:
        [ 1733.189144] task:rcu_preempt     state:R stack:    0 pid:   13 ppid:     2 flags:0x00000000
        [ 1733.189156] Backtrace:
        [ 1733.189162] [<c0a19c20>] (__schedule) from [<c0a1a458>] (schedule+0x64/0x110)
        [ 1733.189184]  r10:00000001 r9:c190e000 r8:c137b690 r7:c137b69c r6:c190fed4 r5:c190e000
        [ 1733.189188]  r4:c197c880
        [ 1733.189191] [<c0a1a3f4>] (schedule) from [<c0a20048>] (schedule_timeout+0xa8/0x1c0)
        [ 1733.189205]  r5:c1303d00 r4:0005258c
        [ 1733.189208] [<c0a1ffa0>] (schedule_timeout) from [<c01a1664>] (rcu_gp_fqs_loop+0x120/0x3ac)
        [ 1733.189226]  r7:c137b69c r6:c1303d00 r5:c137b4c0 r4:00000000
        [ 1733.189229] [<c01a1544>] (rcu_gp_fqs_loop) from [<c01a3dac>] (rcu_gp_kthread+0xfc/0x1b0)
        [ 1733.189246]  r10:c190ff5c r9:c1303d00 r8:c137b4c0 r7:c190e000 r6:c137b69e r5:c137b690
        [ 1733.189249]  r4:c137b69c
        [ 1733.189252] [<c01a3cb0>] (rcu_gp_kthread) from [<c0153b14>] (kthread+0x16c/0x1a0)
        [ 1733.189267]  r7:00000000
        [ 1733.189270] [<c01539a8>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
        [ 1733.189281] Exception stack(0xc190ffb0 to 0xc190fff8)
        [ 1733.189287] ffa0:                                     ???????? ???????? ???????? ????????
        [ 1733.189292] ffc0: ???????? ???????? ???????? ???????? ???????? ???????? ???????? ????????
        [ 1733.189297] ffe0: ???????? ???????? ???????? ???????? ???????? ????????
        [ 1733.189304]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c01539a8
        [ 1733.189309]  r4:c19320c0 r3:00000000
        [ 1733.189312] rcu: Stack dump where RCU GP kthread last ran:
        [ 1733.189315] Sending NMI from CPU 0 to CPUs 1:
        [ 1733.189327] NMI backtrace for cpu 1
        [ 1733.189335] CPU: 1 PID: 7755 Comm: stress-ng-cpu Tainted: G        W         5.15.0-rc2+ #5
        [ 1733.189343] Hardware name: Marvell Armada 370/XP (Device Tree)
        [ 1733.189346] PC is at 0x4bdee0
        [ 1733.189354] LR is at 0x4bdf21
        [ 1733.189358] pc : [<004bdee0>]    lr : [<004bdf21>]    psr: 20030030
        [ 1733.189363] sp : beb8270c  ip : 00004650  fp : beb8289c
        [ 1733.189367] r10: 00e5e800  r9 : 00514760  r8 : 00000358
        [ 1733.189370] r7 : beb828a8  r6 : 00000047  r5 : 0000004d  r4 : 000b2ab7
        [ 1733.189375] r3 : 004bde10  r2 : 00001217  r1 : 0000004f  r0 : 00000085
        [ 1733.189379] Flags: nzCv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
        [ 1733.189385] Control: 10c5387d  Table: 0734006a  DAC: 00000055
        [ 1733.189389] CPU: 1 PID: 7755 Comm: stress-ng-cpu Tainted: G        W         5.15.0-rc2+ #5
        [ 1733.189395] Hardware name: Marvell Armada 370/XP (Device Tree)
        [ 1733.189397] Backtrace:
        [ 1733.189402] [<c0a0b758>] (dump_backtrace) from [<c0a0b9a4>] (show_stack+0x20/0x24)
        [ 1733.189417]  r7:c18db400 r6:c7375fb0 r5:60030193 r4:c1099c7c
        [ 1733.189420] [<c0a0b984>] (show_stack) from [<c0a11988>] (dump_stack_lvl+0x48/0x54)
        [ 1733.189432] [<c0a11940>] (dump_stack_lvl) from [<c0a119ac>] (dump_stack+0x18/0x1c)
        [ 1733.189444]  r5:00000001 r4:20030193
        [ 1733.189446] [<c0a11994>] (dump_stack) from [<c0109984>] (show_regs+0x1c/0x20)
        [ 1733.189460] [<c0109968>] (show_regs) from [<c05f6af8>] (nmi_cpu_backtrace+0xc0/0x10c)
        [ 1733.189473] [<c05f6a38>] (nmi_cpu_backtrace) from [<c010ffa4>] (do_handle_IPI+0x54/0x3b8)
        [ 1733.189488]  r7:c18db400 r6:00000017 r5:00000001 r4:00000007
        [ 1733.189490] [<c010ff50>] (do_handle_IPI) from [<c0110330>] (ipi_handler+0x28/0x30)
        [ 1733.189504]  r10:c7375f58 r9:c7375fb0 r8:c7375f30 r7:c18db400 r6:00000017 r5:c13ecadc
        [ 1733.189508]  r4:c18d9300 r3:00000010
        [ 1733.189510] [<c0110308>] (ipi_handler) from [<c0193200>] (handle_percpu_devid_irq+0xb4/0x288)
        [ 1733.189523] [<c019314c>] (handle_percpu_devid_irq) from [<c018c4b4>] (handle_domain_irq+0x8c/0xc0)
        [ 1733.189538]  r9:c7375fb0 r8:00000007 r7:00000000 r6:c1863d80 r5:00000000 r4:c12781e0
        [ 1733.189540] [<c018c428>] (handle_domain_irq) from [<c01012cc>] (armada_370_xp_handle_irq+0xdc/0x124)
        [ 1733.189555]  r10:00e5e800 r9:00514760 r8:10c5387d r7:c147d604 r6:c7375fb0 r5:000003fe
        [ 1733.189559]  r4:00000007 r3:00000007
        [ 1733.189561] [<c01011f0>] (armada_370_xp_handle_irq) from [<c0100e58>] (__irq_usr+0x58/0x80)
        [ 1733.189570] Exception stack(0xc7375fb0 to 0xc7375ff8)
        [ 1733.189575] 5fa0:                                     ???????? ???????? ???????? ????????
        [ 1733.189579] 5fc0: ???????? ???????? ???????? ???????? ???????? ???????? ???????? ????????
        [ 1733.189583] 5fe0: ???????? ???????? ???????? ???????? ???????? ????????
        [ 1733.189589]  r7:10c5387d r6:ffffffff r5:20030030 r4:004bdee0
        [ 1752.029102] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 100s!
        [ 1752.029137] Showing busy workqueues and worker pools:
        [ 1752.029141] workqueue events_unbound: flags=0x2
        [ 1752.029146]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=3/512 refcnt=5
        [ 1752.029161]     in-flight: 7:call_usermodehelper_exec_work
        [ 1752.029180]     pending: flush_memcg_stats_work, flush_memcg_stats_dwork
        [ 1752.029200] workqueue events_power_efficient: flags=0x80
        [ 1752.029205]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=5/256 refcnt=6
        [ 1752.029221]     in-flight: 53:fb_flashcursor fb_flashcursor
        [ 1752.029239]     pending: neigh_periodic_work, neigh_periodic_work, do_cache_clean
        [ 1752.029269] workqueue mm_percpu_wq: flags=0x8
        [ 1752.029272]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
        [ 1752.029288]     pending: vmstat_update
        [ 1752.029306] workqueue edac-poller: flags=0xa000a
        [ 1752.029310]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/1 refcnt=4
        [ 1752.029323]     pending: edac_mc_workq_function
        [ 1752.029337]     inactive: edac_device_workq_function
        [ 1752.029353] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=100s workers=3 idle: 7621 6478
        [ 1752.029378] pool 4: cpus=0-1 flags=0x4 nice=0 hung=102s workers=3 idle: 6967 5672
        stress-ng: info:  [7740] successful run completed in 125.31s (2 mins, 5.31 secs)

Earlier kernels (i.e v5.13.9) completely froze the machine resulting in
the watchdog triggering and rebooting the machine. So, $something was
already fixed here.

Bisecting leads to the mentioned commit, reverting of the commit results
in a BUG-less run of the stress-ng test.
Any idea what might cause this and how to fix it?

Best regards,
Steffen Trumtrar

--
Pengutronix e.K.                | Dipl.-Inform. Steffen Trumtrar |
Steuerwalder Str. 21            | https://www.pengutronix.de/    |
31137 Hildesheim, Germany       | Phone: +49-5121-206917-0       |
Amtsgericht Hildesheim, HRA 2686| Fax:   +49-5121-206917-5555    |

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread