linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Try to make SMP booting slightly less fragile
@ 2019-08-27 15:18 Will Deacon
  2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon

Hi everyone,

After spending some time investigating SMP boot issues when using
(random?) configs from Qian Cai with lots of debug options enabled, I
hacked together these two patches which make SMP booting a little more
robust.

The whole thing is still a racy mess, but I'm not sure we can do much
about that without reworking it to use per-cpu data structures which
we can update atomically.

Will

--->8

Will Deacon (3):
  arm64: smp: Increase secondary CPU boot timeout value
  arm64: smp: Don't enter kernel with NULL stack pointer or task struct
  arm64: smp: Treat unknown boot failures as being 'stuck in kernel'

 arch/arm64/kernel/head.S | 8 ++++++++
 arch/arm64/kernel/smp.c  | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

-- 
2.11.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value
  2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
  2019-08-27 16:05   ` Mark Rutland
  2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
  2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
  2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon

When many debug options are enabled simultaneously (e.g. PROVE_LOCKING,
KMEMLEAK, DEBUG_PAGE_ALLOC, KASAN etc), it is possible for us to timeout
when attempting to boot a secondary CPU and give up. Unfortunately, the
CPU will /eventually/ appear, and sit in the background happily stuck
in a recursive exception due to a NULL stack pointer.

Increase the timeout to 5s, which will of course be enough for anybody.

Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 018a33e01b0e..63c7a7682e93 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -123,7 +123,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 		 * time out.
 		 */
 		wait_for_completion_timeout(&cpu_running,
-					    msecs_to_jiffies(1000));
+					    msecs_to_jiffies(5000));
 
 		if (!cpu_online(cpu)) {
 			pr_crit("CPU%u: failed to come online\n", cpu);
-- 
2.11.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct
  2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
  2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
  2019-08-27 16:04   ` Mark Rutland
  2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
  2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon

Although SMP bringup is inherently racy, we can significantly reduce
the window during which secondary CPUs can unexpectedly enter the
kernel by sanity checking the 'stack' and 'task' fields of the
'secondary_data' structure. If the booting CPU gave up waiting for us,
then they will have been cleared to NULL and we should spin in a WFE; WFI
loop instead.

Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/head.S | 8 ++++++++
 arch/arm64/kernel/smp.c  | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 2cdacd1c141b..0baadf335172 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -724,14 +724,22 @@ __secondary_switched:
 
 	adr_l	x0, secondary_data
 	ldr	x1, [x0, #CPU_BOOT_STACK]	// get secondary_data.stack
+	cbz	x1, __secondary_too_slow
 	mov	sp, x1
 	ldr	x2, [x0, #CPU_BOOT_TASK]
+	cbz	x2, __secondary_too_slow
 	msr	sp_el0, x2
 	mov	x29, #0
 	mov	x30, #0
 	b	secondary_start_kernel
 ENDPROC(__secondary_switched)
 
+__secondary_too_slow:
+	wfe
+	wfi
+	b	__secondary_too_slow
+ENDPROC(__secondary_too_slow)
+
 /*
  * The booting CPU updates the failed status @__early_cpu_boot_status,
  * with MMU turned off.
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 63c7a7682e93..1f8aeb77cba5 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -136,6 +136,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 
 	secondary_data.task = NULL;
 	secondary_data.stack = NULL;
+	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
 	status = READ_ONCE(secondary_data.status);
 	if (ret && status) {
 
-- 
2.11.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel'
  2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
  2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
  2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
@ 2019-08-27 15:18 ` Will Deacon
  2019-08-27 15:59   ` Mark Rutland
  2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2019-08-27 15:18 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: mark.rutland, catalin.marinas, Will Deacon

When we fail to bring a secondary CPU online and it fails in an unknown
state, we should assume the worst and increment 'cpus_stuck_in_kernel'
so that things like kexec() are disabled.

Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/smp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1f8aeb77cba5..dc9fe879c279 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -147,6 +147,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 		default:
 			pr_err("CPU%u: failed in unknown state : 0x%lx\n",
 					cpu, status);
+			cpus_stuck_in_kernel++;
 			break;
 		case CPU_KILL_ME:
 			if (!op_cpu_kill(cpu)) {
-- 
2.11.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel'
  2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
@ 2019-08-27 15:59   ` Mark Rutland
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 15:59 UTC (permalink / raw)
  To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel

On Tue, Aug 27, 2019 at 04:18:15PM +0100, Will Deacon wrote:
> When we fail to bring a secondary CPU online and it fails in an unknown
> state, we should assume the worst and increment 'cpus_stuck_in_kernel'
> so that things like kexec() are disabled.

Definitely! I has assumed we already did this, but I see that we don't.

> Signed-off-by: Will Deacon <will@kernel.org>

I don't see a nicer way of doing this, so:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> ---
>  arch/arm64/kernel/smp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 1f8aeb77cba5..dc9fe879c279 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -147,6 +147,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  		default:
>  			pr_err("CPU%u: failed in unknown state : 0x%lx\n",
>  					cpu, status);
> +			cpus_stuck_in_kernel++;
>  			break;
>  		case CPU_KILL_ME:
>  			if (!op_cpu_kill(cpu)) {
> -- 
> 2.11.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct
  2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
@ 2019-08-27 16:04   ` Mark Rutland
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 16:04 UTC (permalink / raw)
  To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel

On Tue, Aug 27, 2019 at 04:18:14PM +0100, Will Deacon wrote:
> Although SMP bringup is inherently racy, we can significantly reduce
> the window during which secondary CPUs can unexpectedly enter the
> kernel by sanity checking the 'stack' and 'task' fields of the
> 'secondary_data' structure. If the booting CPU gave up waiting for us,
> then they will have been cleared to NULL and we should spin in a WFE; WFI
> loop instead.
> 
> Signed-off-by: Will Deacon <will@kernel.org>

In future I think that we can handle this better using the MPIDR hash
and atomics to flip the values and cater for races, but this is better
than what we have, so FWIW:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> ---
>  arch/arm64/kernel/head.S | 8 ++++++++
>  arch/arm64/kernel/smp.c  | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 2cdacd1c141b..0baadf335172 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -724,14 +724,22 @@ __secondary_switched:
>  
>  	adr_l	x0, secondary_data
>  	ldr	x1, [x0, #CPU_BOOT_STACK]	// get secondary_data.stack
> +	cbz	x1, __secondary_too_slow
>  	mov	sp, x1
>  	ldr	x2, [x0, #CPU_BOOT_TASK]
> +	cbz	x2, __secondary_too_slow
>  	msr	sp_el0, x2
>  	mov	x29, #0
>  	mov	x30, #0
>  	b	secondary_start_kernel
>  ENDPROC(__secondary_switched)
>  
> +__secondary_too_slow:
> +	wfe
> +	wfi
> +	b	__secondary_too_slow
> +ENDPROC(__secondary_too_slow)
> +
>  /*
>   * The booting CPU updates the failed status @__early_cpu_boot_status,
>   * with MMU turned off.
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 63c7a7682e93..1f8aeb77cba5 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -136,6 +136,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  
>  	secondary_data.task = NULL;
>  	secondary_data.stack = NULL;
> +	__flush_dcache_area(&secondary_data, sizeof(secondary_data));
>  	status = READ_ONCE(secondary_data.status);
>  	if (ret && status) {
>  
> -- 
> 2.11.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value
  2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
@ 2019-08-27 16:05   ` Mark Rutland
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2019-08-27 16:05 UTC (permalink / raw)
  To: Will Deacon; +Cc: catalin.marinas, linux-arm-kernel

On Tue, Aug 27, 2019 at 04:18:13PM +0100, Will Deacon wrote:
> When many debug options are enabled simultaneously (e.g. PROVE_LOCKING,
> KMEMLEAK, DEBUG_PAGE_ALLOC, KASAN etc), it is possible for us to timeout
> when attempting to boot a secondary CPU and give up. Unfortunately, the
> CPU will /eventually/ appear, and sit in the background happily stuck
> in a recursive exception due to a NULL stack pointer.
> 
> Increase the timeout to 5s, which will of course be enough for anybody.
> 
> Signed-off-by: Will Deacon <will@kernel.org>

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> ---
>  arch/arm64/kernel/smp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 018a33e01b0e..63c7a7682e93 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -123,7 +123,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>  		 * time out.
>  		 */
>  		wait_for_completion_timeout(&cpu_running,
> -					    msecs_to_jiffies(1000));
> +					    msecs_to_jiffies(5000));
>  
>  		if (!cpu_online(cpu)) {
>  			pr_crit("CPU%u: failed to come online\n", cpu);
> -- 
> 2.11.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-27 16:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27 15:18 [PATCH 0/3] Try to make SMP booting slightly less fragile Will Deacon
2019-08-27 15:18 ` [PATCH 1/3] arm64: smp: Increase secondary CPU boot timeout value Will Deacon
2019-08-27 16:05   ` Mark Rutland
2019-08-27 15:18 ` [PATCH 2/3] arm64: smp: Don't enter kernel with NULL stack pointer or task struct Will Deacon
2019-08-27 16:04   ` Mark Rutland
2019-08-27 15:18 ` [PATCH 3/3] arm64: smp: Treat unknown boot failures as being 'stuck in kernel' Will Deacon
2019-08-27 15:59   ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).