* [PATCH 0/3] Try to make SMP booting slightly less fragile
@ 2019-08-27 15:18 Will Deacon
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon
Hi everyone,
After spending some time investigating SMP boot issues when using
(random?) configs from Qian Cai with lots of debug options enabled, I
hacked together these two patches which make SMP booting a little more
robust.
The whole thing is still a racy mess, but I'm not sure we can do much
about that without reworking it to use per-cpu data structures which
we can update atomically.
Will
--->8
Will Deacon (3):
arm64: smp: Increase secondary CPU boot timeout value
arm64: smp: Don't enter kernel with NULL stack pointer or task struct
arm64: smp: Treat unknown boot failures as being 'stuck in kernel'
arch/arm64/kernel/head.S | 8 ++++++++
arch/arm64/kernel/smp.c | 4 +++-
2 files changed, 11 insertions(+), 1 deletion(-)
--
2.11.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value
2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
2019-08-27 16:05 ` Mark Rutland
2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon
When many debug options are enabled simultaneously (e.g. PROVE_LOCKING,
KMEMLEAK, DEBUG_PAGE_ALLOC, KASAN etc), it is possible for us to timeout
when attempting to boot a secondary CPU and give up. Unfortunately, the
CPU will /eventually/ appear, and sit in the background happily stuck
in a recursive exception due to a NULL stack pointer.
Increase the timeout to 5s, which will of course be enough for anybody.
Signed-off-by: Will Deacon <will@kernel.org>
---
arch/arm64/kernel/smp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 018a33e01b0e..63c7a7682e93 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -123,7 +123,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
* time out.
*/
wait_for_completion_timeout(&cpu_running,
- msecs_to_jiffies(1000));
+ msecs_to_jiffies(5000));
if (!cpu_online(cpu)) {
pr_crit("CPU%u: failed to come online\n", cpu);
--
2.11.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct
2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
2019-08-27 16:04 ` Mark Rutland
2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon
Although SMP bringup is inherently racy, we can significantly reduce
the window during which secondary CPUs can unexpectedly enter the
kernel by sanity checking the 'stack' and 'task' fields of the
'secondary_data' structure. If the booting CPU gave up waiting for us,
then they will have been cleared to NULL and we should spin in a WFE; WFI
loop instead.
Signed-off-by: Will Deacon <will@kernel.org>
---
arch/arm64/kernel/head.S | 8 ++++++++
arch/arm64/kernel/smp.c | 1 +
2 files changed, 9 insertions(+)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 2cdacd1c141b..0baadf335172 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -724,14 +724,22 @@ __secondary_switched:
adr_l x0, secondary_data
ldr x1, [x0, #CPU_BOOT_STACK] // get secondary_data.stack
+ cbz x1, __secondary_too_slow
mov sp, x1
ldr x2, [x0, #CPU_BOOT_TASK]
+ cbz x2, __secondary_too_slow
msr sp_el0, x2
mov x29, #0
mov x30, #0
b secondary_start_kernel
ENDPROC(__secondary_switched)
+__secondary_too_slow:
+ wfe
+ wfi
+ b __secondary_too_slow
+ENDPROC(__secondary_too_slow)
+
/*
* The booting CPU updates the failed status @__early_cpu_boot_status,
* with MMU turned off.
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 63c7a7682e93..1f8aeb77cba5 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -136,6 +136,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
secondary_data.task = NULL;
secondary_data.stack = NULL;
+ __flush_dcache_area(&secondary_data, sizeof(secondary_data));
status = READ_ONCE(secondary_data.status);
if (ret && status) {
--
2.11.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel'
2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
2019-08-27 15:59 ` Mark Rutland
2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon
When we fail to bring a secondary CPU online and it fails in an unknown
state, we should assume the worst and increment 'cpus_stuck_in_kernel'
so that things like kexec() are disabled.
Signed-off-by: Will Deacon <will@kernel.org>
---
arch/arm64/kernel/smp.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1f8aeb77cba5..dc9fe879c279 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -147,6 +147,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
default:
pr_err("CPU%u: failed in unknown state : 0x%lx\n",
cpu, status);
+ cpus_stuck_in_kernel++;
break;
case CPU_KILL_ME:
if (!op_cpu_kill(cpu)) {
--
2.11.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel'
2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
@ 2019-08-27 15:59 ` Mark Rutland
0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 15:59 UTC (permalink / raw)
To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel
On Tue, Aug 27, 2019 at 04:18:15PM +0100, Will Deacon wrote:
> When we fail to bring a secondary CPU online and it fails in an unknown
> state, we should assume the worst and increment 'cpus_stuck_in_kernel'
> so that things like kexec() are disabled.
Definitely! I has assumed we already did this, but I see that we don't.
> Signed-off-by: Will Deacon <will@kernel.org>
I don't see a nicer way of doing this, so:
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Thanks,
Mark.
> ---
> arch/arm64/kernel/smp.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 1f8aeb77cba5..dc9fe879c279 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -147,6 +147,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
> default:
> pr_err("CPU%u: failed in unknown state : 0x%lx\n",
> cpu, status);
> + cpus_stuck_in_kernel++;
> break;
> case CPU_KILL_ME:
> if (!op_cpu_kill(cpu)) {
> --
> 2.11.0
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct
2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
@ 2019-08-27 16:04 ` Mark Rutland
0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 16:04 UTC (permalink / raw)
To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel
On Tue, Aug 27, 2019 at 04:18:14PM +0100, Will Deacon wrote:
> Although SMP bringup is inherently racy, we can significantly reduce
> the window during which secondary CPUs can unexpectedly enter the
> kernel by sanity checking the 'stack' and 'task' fields of the
> 'secondary_data' structure. If the booting CPU gave up waiting for us,
> then they will have been cleared to NULL and we should spin in a WFE; WFI
> loop instead.
>
> Signed-off-by: Will Deacon <will@kernel.org>
In future I think that we can handle this better using the MPIDR hash
and atomics to flip the values and cater for races, but this is better
than what we have, so FWIW:
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Thanks,
Mark.
> ---
> arch/arm64/kernel/head.S | 8 ++++++++
> arch/arm64/kernel/smp.c | 1 +
> 2 files changed, 9 insertions(+)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 2cdacd1c141b..0baadf335172 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -724,14 +724,22 @@ __secondary_switched:
>
> adr_l x0, secondary_data
> ldr x1, [x0, #CPU_BOOT_STACK] // get secondary_data.stack
> + cbz x1, __secondary_too_slow
> mov sp, x1
> ldr x2, [x0, #CPU_BOOT_TASK]
> + cbz x2, __secondary_too_slow
> msr sp_el0, x2
> mov x29, #0
> mov x30, #0
> b secondary_start_kernel
> ENDPROC(__secondary_switched)
>
> +__secondary_too_slow:
> + wfe
> + wfi
> + b __secondary_too_slow
> +ENDPROC(__secondary_too_slow)
> +
> /*
> * The booting CPU updates the failed status @__early_cpu_boot_status,
> * with MMU turned off.
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 63c7a7682e93..1f8aeb77cba5 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -136,6 +136,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>
> secondary_data.task = NULL;
> secondary_data.stack = NULL;
> + __flush_dcache_area(&secondary_data, sizeof(secondary_data));
> status = READ_ONCE(secondary_data.status);
> if (ret && status) {
>
> --
> 2.11.0
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
@ 2019-08-27 16:05 ` Mark Rutland
0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 16:05 UTC (permalink / raw)
To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel
On Tue, Aug 27, 2019 at 04:18:13PM +0100, Will Deacon wrote:
> When many debug options are enabled simultaneously (e.g. PROVE_LOCKING,
> KMEMLEAK, DEBUG_PAGE_ALLOC, KASAN etc), it is possible for us to timeout
> when attempting to boot a secondary CPU and give up. Unfortunately, the
> CPU will /eventually/ appear, and sit in the background happily stuck
> in a recursive exception due to a NULL stack pointer.
>
> Increase the timeout to 5s, which will of course be enough for anybody.
>
> Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Thanks,
Mark.
> ---
> arch/arm64/kernel/smp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 018a33e01b0e..63c7a7682e93 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -123,7 +123,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
> * time out.
> */
> wait_for_completion_timeout(&cpu_running,
> - msecs_to_jiffies(1000));
> + msecs_to_jiffies(5000));
>
> if (!cpu_online(cpu)) {
> pr_crit("CPU%u: failed to come online\n", cpu);
> --
> 2.11.0
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-08-27 16:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
2019-08-27 16:05 ` Mark Rutland
2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
2019-08-27 16:04 ` Mark Rutland
2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
2019-08-27 15:59 ` Mark Rutland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).