[PATCH RFC] x86/reboot: Don't wait infinitely for IPI completion during reboot

* [PATCH RFC] x86/reboot: Don't wait infinitely for IPI completion during reboot
@ 2019-06-28 12:28 Grzegorz Halat
  2019-07-25 13:40 ` [tip:x86/core] x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails tip-bot for Grzegorz Halat
  2019-07-25 14:19 ` [tip:x86/apic] " tip-bot for Grzegorz Halat
  0 siblings, 2 replies; 3+ messages in thread
From: Grzegorz Halat @ 2019-06-28 12:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	x86, Don Zickus, Oleksandr Natalenko, Grzegorz Halat

Hi,
This patch is trying to address a deadlock that may happen during a
reboot.

CPU0 can stop a CPU which is holding a lock and there may be another CPU
with disabled interrupts waiting for the same lock. In this situation,
the CPU waiting for the lock can't be stopped and CPU0 will be
indefinitely waiting for IPI completion.

To avoid indefinitely looping kernel should send NMI after timeout.
It seems to me that this was the original intention of
commit 7d007d21e539 ("x86/reboot: Use NMI to assist in shutting down if
IRQ fails") but the removal of 'wait' in the first loop has been
forgotten.

I'm sending the patch as RFC, maybe there is a better way to handle this
situation. Another thing which wonders me is pr_emerg() after IPI.
Is this safe? I think that IPI can shut down a CPU that is holding a
lock used by pr_emerg(), so there may be a deadlock.

CPU0 is stuck in native_stop_other_cpus() waiting for shutdown of CPU2:

 #6 [ffffb1a7826b7dd0] delay_tsc at ffffffffb640f0b4
 #7 [ffffb1a7826b7dd0] native_stop_other_cpus at ffffffffb5c47ffa 
 #8 [ffffb1a7826b7df0] native_machine_shutdown at ffffffffb5c47132
 #9 [ffffb1a7826b7df8] native_machine_restart at ffffffffb5c47649
#10 [ffffb1a7826b7e00] __do_sys_reboot at ffffffffb5ccff62
#11 [ffffb1a7826b7f38] do_syscall_64 at ffffffffb5c0424b

The loop, in case of reboot() syscall wait==1:
File: linux/arch/x86/kernel/smp.c
218: 		timeout = USEC_PER_SEC;
219: 		while (num_online_cpus() > 1 && (wait || timeout--))
220: 			udelay(1);

CPU1 has been shutdown while holding tasklist_lock:

 #0 [ffff984418083fb8] stop_this_cpu at ffffffffb5c28495  << CPU stopped
 #1 [ffff984418083fc0] smp_reboot_interrupt at ffffffffb5c48169
 #2 [ffff984418083ff0] reboot_interrupt at ffffffffb6600bbf
--- <IRQ stack> ---
 #3 [ffffb1a782417be8] reboot_interrupt at ffffffffb6600bbf
    [exception RIP: delay_tsc+32]
 #4 [ffffb1a782417c98] probe_12098 at ffffffffc0813fab 
 #5 [ffffb1a782417cb0] do_wait at ffffffffb5cae5b4
 #6 [ffffb1a782417d08] optimized_callback at ffffffffb5c56be3
 #7 [ffffb1a782417d98] do_wait at ffffffffb5cae5b3 << read_lock() here
 #8 [ffffb1a782417df0] kernel_wait4 at ffffffffb5cafa4e
 #9 [ffffb1a782417e78] __do_sys_wait4 at ffffffffb5cafb73
#10 [ffffb1a782417f38] do_syscall_64 at ffffffffb5c0424b

Note: probe_12098 is SystemTap probe which injects delay.
I'm using SystemTap to reliably reproduce the issue 

CPU2 is trying to grab tasklist_lock with disabled interrupts:

RFLAGS: 00000006
 #5 [ffffb1a782397e78] queued_write_lock_slowpath at ffffffffb5d0052e
 #6 [ffffb1a782397e88] do_exit at ffffffffb5caf08e
 #7 [ffffb1a782397f08] do_group_exit at ffffffffb5caf8da
 #8 [ffffb1a782397f30] __x64_sys_exit_group at ffffffffb5caf954
 #9 [ffffb1a782397f38] do_syscall_64 at ffffffffb5c0424b

Signed-off-by: Grzegorz Halat <ghalat@redhat.com>
---
 arch/x86/kernel/smp.c | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 4693e2f3a03e..9186e432b8d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -211,30 +211,23 @@ static void native_stop_other_cpus(int wait)
 
 		apic->send_IPI_allbutself(REBOOT_VECTOR);
 
-		/*
-		 * Don't wait longer than a second if the caller
-		 * didn't ask us to wait.
-		 */
+		/* Don't wait longer than a second for IPI completion */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (num_online_cpus() > 1 && timeout--)
 			udelay(1);
 	}
 	
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if ((num_online_cpus() > 1) && (!smp_no_nmi_ipi))  {
-		if (register_nmi_handler(NMI_LOCAL, smp_stop_nmi_callback,
-					 NMI_FLAG_FIRST, "smp_stop"))
-			/* Note: we ignore failures here */
-			/* Hope the REBOOT_IRQ is good enough */
-			goto finish;
-
-		/* sync above data before sending IRQ */
-		wmb();
-
-		pr_emerg("Shutting down cpus with NMI\n");
+	if (num_online_cpus() > 1) {
+		if (!smp_no_nmi_ipi && !register_nmi_handler(NMI_LOCAL,
+			smp_stop_nmi_callback, NMI_FLAG_FIRST, "smp_stop")){
+			/* sync above data before sending IRQ */
+			wmb();
 
-		apic->send_IPI_allbutself(NMI_VECTOR);
+			pr_emerg("Shutting down cpus with NMI\n");
 
+			apic->send_IPI_allbutself(NMI_VECTOR);
+		}
 		/*
 		 * Don't wait longer than a 10 ms if the caller
 		 * didn't ask us to wait.
@@ -244,7 +237,6 @@ static void native_stop_other_cpus(int wait)
 			udelay(1);
 	}
 
-finish:
 	local_irq_save(flags);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread