All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] riscv: fix potential panic during CPU hot-plug
       [not found] <CGME20231108112923eucas1p27474e921565e3f175e27e6598f0b71c3@eucas1p2.samsung.com>
@ 2023-11-08 11:29   ` Marek Szyprowski
  0 siblings, 0 replies; 8+ messages in thread
From: Marek Szyprowski @ 2023-11-08 11:29 UTC (permalink / raw)
  To: linux-riscv, linux-kernel
  Cc: Marek Szyprowski, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Conor Dooley, Andrew Jones, Heiko Stuebner, Evan Green,
	Jisheng Zhang, Clément Léger

Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
test for unaligned access speed. It is being performed when given CPU is
onlined. Then, another test for misaligned access emulation has been
added in commit 71c54b3d169d ("riscv: report misaligned accesses
emulation to hwprobe"). This unaligned access speed doesn't really change
after the boot, so it is sufficient to do this test only once. This has
been partially added by commit c20d36cc2a20 ("riscv: don't probe
unaligned access speed if already done"), but this optimisation works
only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
check. Otherwise the 'misaligned_access_speed' pcpu varliable is
overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
first check to be always performed.

Recently I've noticed that the first check introduced a regression in the
CPU hot-plug mechanism. This can be observed as a following panic on
QEmu:

CPU1: off
cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
CPU1: off
CPU1: failed to come online
------------[ cut here ]------------
WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
Modules linked in:
CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
Hardware name: riscv-virtio,qemu (DT)
epc : __flush_smp_call_function_queue+0x90/0x292
 ra : smpcfd_dying_cpu+0xe/0x1c
...
[<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
[<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
[<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
[<ffffffff800142f8>] _cpu_up+0x218/0x322
[<ffffffff8001445e>] cpu_up+0x5c/0x98
[<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
[<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
[<ffffffff806680e8>] device_online+0x56/0x72
[<ffffffff80668190>] online_store+0x8c/0xae
[<ffffffff80662dde>] dev_attr_store+0xe/0x1a
[<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
[<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
[<ffffffff80230170>] vfs_write+0x2a4/0x3e0
[<ffffffff802303e8>] ksys_write+0x5e/0xc8
[<ffffffff80230460>] sys_write+0xe/0x16
[<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
[<ffffffff80003e8c>] ret_from_exception+0x0/0x64
irq event stamp: 12335
hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
kernel BUG at kernel/irq_work.c:245!
Kernel BUG [#1]
Modules linked in:
CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
Hardware name: riscv-virtio,qemu (DT)
epc : irq_work_run_list+0x30/0x32
 ra : irq_work_run+0x2a/0x4c
...
[<ffffffff8011a346>] irq_work_run_list+0x30/0x32
[<ffffffff8011a372>] irq_work_run+0x2a/0x4c
[<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
[<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
[<ffffffff800142f8>] _cpu_up+0x218/0x322
[<ffffffff8001445e>] cpu_up+0x5c/0x98
[<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
[<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
[<ffffffff806680e8>] device_online+0x56/0x72
[<ffffffff80668190>] online_store+0x8c/0xae
[<ffffffff80662dde>] dev_attr_store+0xe/0x1a
[<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
[<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
[<ffffffff80230170>] vfs_write+0x2a4/0x3e0
[<ffffffff802303e8>] ksys_write+0x5e/0xc8
[<ffffffff80230460>] sys_write+0xe/0x16
[<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
[<ffffffff80003e8c>] ret_from_exception+0x0/0x64
Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0-1
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

To avoid calling those checks in the CPU hot-plug paths again and again,
simply move the check at the beginning of the check_unaligned_access()
function and rely on the value determined during the system boot.

Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 arch/riscv/kernel/cpufeature.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 2f51dc4eaf77..3c502840a548 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -571,13 +571,13 @@ void check_unaligned_access(int cpu)
 	void *src;
 	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
 
-	if (check_unaligned_access_emulated(cpu))
-		return;
-
 	/* We are already set since the last check */
 	if (per_cpu(misaligned_access_speed, cpu) != RISCV_HWPROBE_MISALIGNED_UNKNOWN)
 		return;
 
+	if (check_unaligned_access_emulated(cpu))
+		return;
+
 	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
 	if (!page) {
 		pr_warn("Can't alloc pages to measure memcpy performance");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] riscv: fix potential panic during CPU hot-plug
@ 2023-11-08 11:29   ` Marek Szyprowski
  0 siblings, 0 replies; 8+ messages in thread
From: Marek Szyprowski @ 2023-11-08 11:29 UTC (permalink / raw)
  To: linux-riscv, linux-kernel
  Cc: Marek Szyprowski, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Conor Dooley, Andrew Jones, Heiko Stuebner, Evan Green,
	Jisheng Zhang, Clément Léger

Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
test for unaligned access speed. It is being performed when given CPU is
onlined. Then, another test for misaligned access emulation has been
added in commit 71c54b3d169d ("riscv: report misaligned accesses
emulation to hwprobe"). This unaligned access speed doesn't really change
after the boot, so it is sufficient to do this test only once. This has
been partially added by commit c20d36cc2a20 ("riscv: don't probe
unaligned access speed if already done"), but this optimisation works
only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
check. Otherwise the 'misaligned_access_speed' pcpu varliable is
overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
first check to be always performed.

Recently I've noticed that the first check introduced a regression in the
CPU hot-plug mechanism. This can be observed as a following panic on
QEmu:

CPU1: off
cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
CPU1: off
CPU1: failed to come online
------------[ cut here ]------------
WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
Modules linked in:
CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
Hardware name: riscv-virtio,qemu (DT)
epc : __flush_smp_call_function_queue+0x90/0x292
 ra : smpcfd_dying_cpu+0xe/0x1c
...
[<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
[<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
[<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
[<ffffffff800142f8>] _cpu_up+0x218/0x322
[<ffffffff8001445e>] cpu_up+0x5c/0x98
[<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
[<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
[<ffffffff806680e8>] device_online+0x56/0x72
[<ffffffff80668190>] online_store+0x8c/0xae
[<ffffffff80662dde>] dev_attr_store+0xe/0x1a
[<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
[<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
[<ffffffff80230170>] vfs_write+0x2a4/0x3e0
[<ffffffff802303e8>] ksys_write+0x5e/0xc8
[<ffffffff80230460>] sys_write+0xe/0x16
[<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
[<ffffffff80003e8c>] ret_from_exception+0x0/0x64
irq event stamp: 12335
hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
kernel BUG at kernel/irq_work.c:245!
Kernel BUG [#1]
Modules linked in:
CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
Hardware name: riscv-virtio,qemu (DT)
epc : irq_work_run_list+0x30/0x32
 ra : irq_work_run+0x2a/0x4c
...
[<ffffffff8011a346>] irq_work_run_list+0x30/0x32
[<ffffffff8011a372>] irq_work_run+0x2a/0x4c
[<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
[<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
[<ffffffff800142f8>] _cpu_up+0x218/0x322
[<ffffffff8001445e>] cpu_up+0x5c/0x98
[<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
[<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
[<ffffffff806680e8>] device_online+0x56/0x72
[<ffffffff80668190>] online_store+0x8c/0xae
[<ffffffff80662dde>] dev_attr_store+0xe/0x1a
[<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
[<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
[<ffffffff80230170>] vfs_write+0x2a4/0x3e0
[<ffffffff802303e8>] ksys_write+0x5e/0xc8
[<ffffffff80230460>] sys_write+0xe/0x16
[<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
[<ffffffff80003e8c>] ret_from_exception+0x0/0x64
Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0-1
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

To avoid calling those checks in the CPU hot-plug paths again and again,
simply move the check at the beginning of the check_unaligned_access()
function and rely on the value determined during the system boot.

Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 arch/riscv/kernel/cpufeature.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 2f51dc4eaf77..3c502840a548 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -571,13 +571,13 @@ void check_unaligned_access(int cpu)
 	void *src;
 	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
 
-	if (check_unaligned_access_emulated(cpu))
-		return;
-
 	/* We are already set since the last check */
 	if (per_cpu(misaligned_access_speed, cpu) != RISCV_HWPROBE_MISALIGNED_UNKNOWN)
 		return;
 
+	if (check_unaligned_access_emulated(cpu))
+		return;
+
 	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
 	if (!page) {
 		pr_warn("Can't alloc pages to measure memcpy performance");
-- 
2.34.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
  2023-11-08 11:29   ` Marek Szyprowski
@ 2023-11-08 16:47     ` Evan Green
  -1 siblings, 0 replies; 8+ messages in thread
From: Evan Green @ 2023-11-08 16:47 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
> test for unaligned access speed. It is being performed when given CPU is
> onlined. Then, another test for misaligned access emulation has been
> added in commit 71c54b3d169d ("riscv: report misaligned accesses
> emulation to hwprobe"). This unaligned access speed doesn't really change
> after the boot, so it is sufficient to do this test only once. This has
> been partially added by commit c20d36cc2a20 ("riscv: don't probe
> unaligned access speed if already done"), but this optimisation works
> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
> first check to be always performed.
>
> Recently I've noticed that the first check introduced a regression in the
> CPU hot-plug mechanism. This can be observed as a following panic on
> QEmu:
>
> CPU1: off
> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
> CPU1: off
> CPU1: failed to come online
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : __flush_smp_call_function_queue+0x90/0x292
>  ra : smpcfd_dying_cpu+0xe/0x1c
> ...
> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> irq event stamp: 12335
> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> kernel BUG at kernel/irq_work.c:245!
> Kernel BUG [#1]
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : irq_work_run_list+0x30/0x32
>  ra : irq_work_run+0x2a/0x4c
> ...
> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> SMP: stopping secondary CPUs
> SMP: failed to stop secondary CPUs 0-1
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> To avoid calling those checks in the CPU hot-plug paths again and again,
> simply move the check at the beginning of the check_unaligned_access()
> function and rely on the value determined during the system boot.
>
> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

Hi Marek,
Thanks for the patch. I happened to spot that bug too
(check_unaligned_access_emulated() clobbering the per-cpu variable
back to unknown). Since I was rearranging that code to try to run the
speed measurements in parallel, I moved that check around to hopefully
solve the same issue you're reporting. Can you see if this patch also
fixes your issue:
https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
. It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
misaligned access speed in parallel").

-Evan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
@ 2023-11-08 16:47     ` Evan Green
  0 siblings, 0 replies; 8+ messages in thread
From: Evan Green @ 2023-11-08 16:47 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
> test for unaligned access speed. It is being performed when given CPU is
> onlined. Then, another test for misaligned access emulation has been
> added in commit 71c54b3d169d ("riscv: report misaligned accesses
> emulation to hwprobe"). This unaligned access speed doesn't really change
> after the boot, so it is sufficient to do this test only once. This has
> been partially added by commit c20d36cc2a20 ("riscv: don't probe
> unaligned access speed if already done"), but this optimisation works
> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
> first check to be always performed.
>
> Recently I've noticed that the first check introduced a regression in the
> CPU hot-plug mechanism. This can be observed as a following panic on
> QEmu:
>
> CPU1: off
> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
> CPU1: off
> CPU1: failed to come online
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : __flush_smp_call_function_queue+0x90/0x292
>  ra : smpcfd_dying_cpu+0xe/0x1c
> ...
> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> irq event stamp: 12335
> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> kernel BUG at kernel/irq_work.c:245!
> Kernel BUG [#1]
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : irq_work_run_list+0x30/0x32
>  ra : irq_work_run+0x2a/0x4c
> ...
> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> SMP: stopping secondary CPUs
> SMP: failed to stop secondary CPUs 0-1
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> To avoid calling those checks in the CPU hot-plug paths again and again,
> simply move the check at the beginning of the check_unaligned_access()
> function and rely on the value determined during the system boot.
>
> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>

Hi Marek,
Thanks for the patch. I happened to spot that bug too
(check_unaligned_access_emulated() clobbering the per-cpu variable
back to unknown). Since I was rearranging that code to try to run the
speed measurements in parallel, I moved that check around to hopefully
solve the same issue you're reporting. Can you see if this patch also
fixes your issue:
https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
. It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
misaligned access speed in parallel").

-Evan

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
  2023-11-08 16:47     ` Evan Green
@ 2023-11-08 17:18       ` Marek Szyprowski
  -1 siblings, 0 replies; 8+ messages in thread
From: Marek Szyprowski @ 2023-11-08 17:18 UTC (permalink / raw)
  To: Evan Green
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

Dear Evan,

On 08.11.2023 17:47, Evan Green wrote:
> On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
> <m.szyprowski@samsung.com>  wrote:
>> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
>> test for unaligned access speed. It is being performed when given CPU is
>> onlined. Then, another test for misaligned access emulation has been
>> added in commit 71c54b3d169d ("riscv: report misaligned accesses
>> emulation to hwprobe"). This unaligned access speed doesn't really change
>> after the boot, so it is sufficient to do this test only once. This has
>> been partially added by commit c20d36cc2a20 ("riscv: don't probe
>> unaligned access speed if already done"), but this optimisation works
>> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
>> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
>> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
>> first check to be always performed.
>>
>> Recently I've noticed that the first check introduced a regression in the
>> CPU hot-plug mechanism. This can be observed as a following panic on
>> QEmu:
>>
>> CPU1: off
>> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
>> CPU1: off
>> CPU1: failed to come online
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
>> Modules linked in:
>> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
>> Hardware name: riscv-virtio,qemu (DT)
>> epc : __flush_smp_call_function_queue+0x90/0x292
>>   ra : smpcfd_dying_cpu+0xe/0x1c
>> ...
>> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
>> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
>> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
>> [<ffffffff800142f8>] _cpu_up+0x218/0x322
>> [<ffffffff8001445e>] cpu_up+0x5c/0x98
>> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
>> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
>> [<ffffffff806680e8>] device_online+0x56/0x72
>> [<ffffffff80668190>] online_store+0x8c/0xae
>> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
>> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
>> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
>> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
>> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
>> [<ffffffff80230460>] sys_write+0xe/0x16
>> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
>> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
>> irq event stamp: 12335
>> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
>> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
>> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
>> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
>> ---[ end trace 0000000000000000 ]---
>> ------------[ cut here ]------------
>> kernel BUG at kernel/irq_work.c:245!
>> Kernel BUG [#1]
>> Modules linked in:
>> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
>> Hardware name: riscv-virtio,qemu (DT)
>> epc : irq_work_run_list+0x30/0x32
>>   ra : irq_work_run+0x2a/0x4c
>> ...
>> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
>> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
>> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
>> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
>> [<ffffffff800142f8>] _cpu_up+0x218/0x322
>> [<ffffffff8001445e>] cpu_up+0x5c/0x98
>> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
>> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
>> [<ffffffff806680e8>] device_online+0x56/0x72
>> [<ffffffff80668190>] online_store+0x8c/0xae
>> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
>> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
>> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
>> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
>> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
>> [<ffffffff80230460>] sys_write+0xe/0x16
>> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
>> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
>> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
>> ---[ end trace 0000000000000000 ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> SMP: stopping secondary CPUs
>> SMP: failed to stop secondary CPUs 0-1
>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>>
>> To avoid calling those checks in the CPU hot-plug paths again and again,
>> simply move the check at the beginning of the check_unaligned_access()
>> function and rely on the value determined during the system boot.
>>
>> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
>> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
>> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
>> Signed-off-by: Marek Szyprowski<m.szyprowski@samsung.com>
> Hi Marek,
> Thanks for the patch. I happened to spot that bug too
> (check_unaligned_access_emulated() clobbering the per-cpu variable
> back to unknown). Since I was rearranging that code to try to run the
> speed measurements in parallel, I moved that check around to hopefully
> solve the same issue you're reporting. Can you see if this patch also
> fixes your issue:
> https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
> . It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
> misaligned access speed in parallel").

Thanks for for the information. Yes, Your patch also fixes this issue, 
but it is a bit invasive (it perfectly fits for the -next material 
though). Maybe my little cleanup could be applied as a fix during -rc 
cycle? It has an advantage that it can be easily backported to v6.6-stable.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
@ 2023-11-08 17:18       ` Marek Szyprowski
  0 siblings, 0 replies; 8+ messages in thread
From: Marek Szyprowski @ 2023-11-08 17:18 UTC (permalink / raw)
  To: Evan Green
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

Dear Evan,

On 08.11.2023 17:47, Evan Green wrote:
> On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
> <m.szyprowski@samsung.com>  wrote:
>> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
>> test for unaligned access speed. It is being performed when given CPU is
>> onlined. Then, another test for misaligned access emulation has been
>> added in commit 71c54b3d169d ("riscv: report misaligned accesses
>> emulation to hwprobe"). This unaligned access speed doesn't really change
>> after the boot, so it is sufficient to do this test only once. This has
>> been partially added by commit c20d36cc2a20 ("riscv: don't probe
>> unaligned access speed if already done"), but this optimisation works
>> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
>> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
>> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
>> first check to be always performed.
>>
>> Recently I've noticed that the first check introduced a regression in the
>> CPU hot-plug mechanism. This can be observed as a following panic on
>> QEmu:
>>
>> CPU1: off
>> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
>> CPU1: off
>> CPU1: failed to come online
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
>> Modules linked in:
>> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
>> Hardware name: riscv-virtio,qemu (DT)
>> epc : __flush_smp_call_function_queue+0x90/0x292
>>   ra : smpcfd_dying_cpu+0xe/0x1c
>> ...
>> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
>> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
>> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
>> [<ffffffff800142f8>] _cpu_up+0x218/0x322
>> [<ffffffff8001445e>] cpu_up+0x5c/0x98
>> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
>> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
>> [<ffffffff806680e8>] device_online+0x56/0x72
>> [<ffffffff80668190>] online_store+0x8c/0xae
>> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
>> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
>> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
>> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
>> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
>> [<ffffffff80230460>] sys_write+0xe/0x16
>> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
>> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
>> irq event stamp: 12335
>> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
>> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
>> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
>> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
>> ---[ end trace 0000000000000000 ]---
>> ------------[ cut here ]------------
>> kernel BUG at kernel/irq_work.c:245!
>> Kernel BUG [#1]
>> Modules linked in:
>> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
>> Hardware name: riscv-virtio,qemu (DT)
>> epc : irq_work_run_list+0x30/0x32
>>   ra : irq_work_run+0x2a/0x4c
>> ...
>> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
>> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
>> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
>> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
>> [<ffffffff800142f8>] _cpu_up+0x218/0x322
>> [<ffffffff8001445e>] cpu_up+0x5c/0x98
>> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
>> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
>> [<ffffffff806680e8>] device_online+0x56/0x72
>> [<ffffffff80668190>] online_store+0x8c/0xae
>> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
>> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
>> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
>> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
>> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
>> [<ffffffff80230460>] sys_write+0xe/0x16
>> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
>> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
>> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
>> ---[ end trace 0000000000000000 ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> SMP: stopping secondary CPUs
>> SMP: failed to stop secondary CPUs 0-1
>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>>
>> To avoid calling those checks in the CPU hot-plug paths again and again,
>> simply move the check at the beginning of the check_unaligned_access()
>> function and rely on the value determined during the system boot.
>>
>> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
>> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
>> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
>> Signed-off-by: Marek Szyprowski<m.szyprowski@samsung.com>
> Hi Marek,
> Thanks for the patch. I happened to spot that bug too
> (check_unaligned_access_emulated() clobbering the per-cpu variable
> back to unknown). Since I was rearranging that code to try to run the
> speed measurements in parallel, I moved that check around to hopefully
> solve the same issue you're reporting. Can you see if this patch also
> fixes your issue:
> https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
> . It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
> misaligned access speed in parallel").

Thanks for for the information. Yes, Your patch also fixes this issue, 
but it is a bit invasive (it perfectly fits for the -next material 
though). Maybe my little cleanup could be applied as a fix during -rc 
cycle? It has an advantage that it can be easily backported to v6.6-stable.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
  2023-11-08 17:18       ` Marek Szyprowski
@ 2023-11-08 17:30         ` Evan Green
  -1 siblings, 0 replies; 8+ messages in thread
From: Evan Green @ 2023-11-08 17:30 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

On Wed, Nov 8, 2023 at 9:18 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Dear Evan,
>
> On 08.11.2023 17:47, Evan Green wrote:
> > On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
> > <m.szyprowski@samsung.com>  wrote:
> >> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
> >> test for unaligned access speed. It is being performed when given CPU is
> >> onlined. Then, another test for misaligned access emulation has been
> >> added in commit 71c54b3d169d ("riscv: report misaligned accesses
> >> emulation to hwprobe"). This unaligned access speed doesn't really change
> >> after the boot, so it is sufficient to do this test only once. This has
> >> been partially added by commit c20d36cc2a20 ("riscv: don't probe
> >> unaligned access speed if already done"), but this optimisation works
> >> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
> >> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
> >> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
> >> first check to be always performed.
> >>
> >> Recently I've noticed that the first check introduced a regression in the
> >> CPU hot-plug mechanism. This can be observed as a following panic on
> >> QEmu:
> >>
> >> CPU1: off
> >> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
> >> CPU1: off
> >> CPU1: failed to come online
> >> ------------[ cut here ]------------
> >> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
> >> Modules linked in:
> >> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
> >> Hardware name: riscv-virtio,qemu (DT)
> >> epc : __flush_smp_call_function_queue+0x90/0x292
> >>   ra : smpcfd_dying_cpu+0xe/0x1c
> >> ...
> >> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
> >> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
> >> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> >> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> >> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> >> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> >> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> >> [<ffffffff806680e8>] device_online+0x56/0x72
> >> [<ffffffff80668190>] online_store+0x8c/0xae
> >> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> >> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> >> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> >> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> >> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> >> [<ffffffff80230460>] sys_write+0xe/0x16
> >> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> >> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> >> irq event stamp: 12335
> >> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
> >> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
> >> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
> >> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
> >> ---[ end trace 0000000000000000 ]---
> >> ------------[ cut here ]------------
> >> kernel BUG at kernel/irq_work.c:245!
> >> Kernel BUG [#1]
> >> Modules linked in:
> >> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
> >> Hardware name: riscv-virtio,qemu (DT)
> >> epc : irq_work_run_list+0x30/0x32
> >>   ra : irq_work_run+0x2a/0x4c
> >> ...
> >> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
> >> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
> >> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
> >> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> >> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> >> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> >> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> >> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> >> [<ffffffff806680e8>] device_online+0x56/0x72
> >> [<ffffffff80668190>] online_store+0x8c/0xae
> >> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> >> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> >> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> >> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> >> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> >> [<ffffffff80230460>] sys_write+0xe/0x16
> >> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> >> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> >> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
> >> ---[ end trace 0000000000000000 ]---
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >> SMP: stopping secondary CPUs
> >> SMP: failed to stop secondary CPUs 0-1
> >> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >>
> >> To avoid calling those checks in the CPU hot-plug paths again and again,
> >> simply move the check at the beginning of the check_unaligned_access()
> >> function and rely on the value determined during the system boot.
> >>
> >> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> >> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
> >> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
> >> Signed-off-by: Marek Szyprowski<m.szyprowski@samsung.com>
> > Hi Marek,
> > Thanks for the patch. I happened to spot that bug too
> > (check_unaligned_access_emulated() clobbering the per-cpu variable
> > back to unknown). Since I was rearranging that code to try to run the
> > speed measurements in parallel, I moved that check around to hopefully
> > solve the same issue you're reporting. Can you see if this patch also
> > fixes your issue:
> > https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
> > . It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
> > misaligned access speed in parallel").
>
> Thanks for for the information. Yes, Your patch also fixes this issue,
> but it is a bit invasive (it perfectly fits for the -next material
> though). Maybe my little cleanup could be applied as a fix during -rc
> cycle? It has an advantage that it can be easily backported to v6.6-stable.

That's fine with me.

Reviewed-by: Evan Green <evan@rivosinc.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: fix potential panic during CPU hot-plug
@ 2023-11-08 17:30         ` Evan Green
  0 siblings, 0 replies; 8+ messages in thread
From: Evan Green @ 2023-11-08 17:30 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-riscv, linux-kernel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Conor Dooley, Andrew Jones, Heiko Stuebner,
	Jisheng Zhang, Clément Léger

On Wed, Nov 8, 2023 at 9:18 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Dear Evan,
>
> On 08.11.2023 17:47, Evan Green wrote:
> > On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
> > <m.szyprowski@samsung.com>  wrote:
> >> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
> >> test for unaligned access speed. It is being performed when given CPU is
> >> onlined. Then, another test for misaligned access emulation has been
> >> added in commit 71c54b3d169d ("riscv: report misaligned accesses
> >> emulation to hwprobe"). This unaligned access speed doesn't really change
> >> after the boot, so it is sufficient to do this test only once. This has
> >> been partially added by commit c20d36cc2a20 ("riscv: don't probe
> >> unaligned access speed if already done"), but this optimisation works
> >> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
> >> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
> >> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
> >> first check to be always performed.
> >>
> >> Recently I've noticed that the first check introduced a regression in the
> >> CPU hot-plug mechanism. This can be observed as a following panic on
> >> QEmu:
> >>
> >> CPU1: off
> >> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
> >> CPU1: off
> >> CPU1: failed to come online
> >> ------------[ cut here ]------------
> >> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
> >> Modules linked in:
> >> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
> >> Hardware name: riscv-virtio,qemu (DT)
> >> epc : __flush_smp_call_function_queue+0x90/0x292
> >>   ra : smpcfd_dying_cpu+0xe/0x1c
> >> ...
> >> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
> >> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
> >> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> >> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> >> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> >> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> >> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> >> [<ffffffff806680e8>] device_online+0x56/0x72
> >> [<ffffffff80668190>] online_store+0x8c/0xae
> >> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> >> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> >> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> >> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> >> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> >> [<ffffffff80230460>] sys_write+0xe/0x16
> >> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> >> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> >> irq event stamp: 12335
> >> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
> >> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
> >> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
> >> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
> >> ---[ end trace 0000000000000000 ]---
> >> ------------[ cut here ]------------
> >> kernel BUG at kernel/irq_work.c:245!
> >> Kernel BUG [#1]
> >> Modules linked in:
> >> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
> >> Hardware name: riscv-virtio,qemu (DT)
> >> epc : irq_work_run_list+0x30/0x32
> >>   ra : irq_work_run+0x2a/0x4c
> >> ...
> >> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
> >> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
> >> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
> >> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> >> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> >> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> >> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> >> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> >> [<ffffffff806680e8>] device_online+0x56/0x72
> >> [<ffffffff80668190>] online_store+0x8c/0xae
> >> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> >> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> >> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> >> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> >> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> >> [<ffffffff80230460>] sys_write+0xe/0x16
> >> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> >> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> >> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
> >> ---[ end trace 0000000000000000 ]---
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >> SMP: stopping secondary CPUs
> >> SMP: failed to stop secondary CPUs 0-1
> >> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >>
> >> To avoid calling those checks in the CPU hot-plug paths again and again,
> >> simply move the check at the beginning of the check_unaligned_access()
> >> function and rely on the value determined during the system boot.
> >>
> >> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> >> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
> >> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
> >> Signed-off-by: Marek Szyprowski<m.szyprowski@samsung.com>
> > Hi Marek,
> > Thanks for the patch. I happened to spot that bug too
> > (check_unaligned_access_emulated() clobbering the per-cpu variable
> > back to unknown). Since I was rearranging that code to try to run the
> > speed measurements in parallel, I moved that check around to hopefully
> > solve the same issue you're reporting. Can you see if this patch also
> > fixes your issue:
> > https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
> > . It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
> > misaligned access speed in parallel").
>
> Thanks for for the information. Yes, Your patch also fixes this issue,
> but it is a bit invasive (it perfectly fits for the -next material
> though). Maybe my little cleanup could be applied as a fix during -rc
> cycle? It has an advantage that it can be easily backported to v6.6-stable.

That's fine with me.

Reviewed-by: Evan Green <evan@rivosinc.com>

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-11-08 17:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20231108112923eucas1p27474e921565e3f175e27e6598f0b71c3@eucas1p2.samsung.com>
2023-11-08 11:29 ` [PATCH] riscv: fix potential panic during CPU hot-plug Marek Szyprowski
2023-11-08 11:29   ` Marek Szyprowski
2023-11-08 16:47   ` Evan Green
2023-11-08 16:47     ` Evan Green
2023-11-08 17:18     ` Marek Szyprowski
2023-11-08 17:18       ` Marek Szyprowski
2023-11-08 17:30       ` Evan Green
2023-11-08 17:30         ` Evan Green

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.