All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
@ 2023-06-03 20:06 Thomas Gleixner
  2023-06-03 20:06 ` [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:06 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Hi!

Ashok observed triple faults when executing kexec() on a kernel which has
'nosmt' on the kernel commandline and HT enabled in the BIOS.

'nosmt' brings up the HT siblings to the point where they initiliazed the
CPU and then rolls the bringup back which parks them in mwait_play_dead().
The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast
MCE will immediately shut down the machine.

Some detective work revealed that:

  1) The kexec kernel can overwrite text, pagetables, stack and data of the
     previous kernel.

  2) If the kexec kernel writes to the memory which is monitored by an
     "offline" CPU, that CPU resumes execution. That's obviously doomed
     when the kexec kernel overwrote text, pagetables, data or stack.

While on my test machine the first kexec() after reset always "worked", the
second one reliably ended up in a triple fault.

The following series cures this by:

  1) Bringing offline CPUs which are stuck in mwait_play_dead() out of
     mwait by writing to the monitored cacheline

  2) Let the woken up CPUs check the written control word and drop into
     a HLT loop if the control word requests so.

This is only half safe because HLT can resume execution due to NMI, SMI and
MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
but there is at least one which prevents the NMI and SMI cause: INIT.

  3) If the system uses the regular INIT/STARTUP sequence to wake up
     secondary CPUS, then "park" all CPUs including the "offline" ones
     by sending them INIT IPIs.

The INIT IPI brings the CPU into a wait for wakeup state which is not
affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE
problem comes back.

But that's not really any different from a CPU sitting in the HLT loop on
the previous kernel. If a broadcast MCE arrives, HLT resumes execution and
the CPU tries to handle the MCE on overwritten text, pagetables etc.

So parking them via INIT is not completely solving the problem, but it
takes at least NMI and SMI out of the picture.

The series is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec

Thanks,

	tglx
---
 include/asm/smp.h |    4 +
 kernel/smp.c      |   62 +++++++++++++---------
 kernel/smpboot.c  |  151 ++++++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 156 insertions(+), 61 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus()
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
@ 2023-06-03 20:06 ` Thomas Gleixner
  2023-06-03 20:06 ` [patch 2/6] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:06 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

The wmb() after the successfull atomic_cmpxchg() is complete voodoo along
with the comment stating "sync above data before sending IRQ".

There is no "above" data except for the atomic_t stopping_cpu which has
just been acquired. The reboot IPI handler does not check any data and
unconditionally disables the CPU.

Remove this cargo cult barrier.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/smp.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -171,9 +171,6 @@ static void native_stop_other_cpus(int w
 		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
 			return;
 
-		/* sync above data before sending IRQ */
-		wmb();
-
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 2/6] x86/smp: Acquire stopping_cpu unconditionally
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
  2023-06-03 20:06 ` [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
@ 2023-06-03 20:06 ` Thomas Gleixner
  2023-06-03 20:07 ` [patch 3/6] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:06 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

There is no reason to acquire the stopping_cpu atomic_t only when there is
more than one online CPU.

Make it unconditional to prepare for fixing the kexec() problem when there
are present but "offline" CPUs which play dead in mwait_play_dead().

They need to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Move the acquire out of the num_online_cpus() > 1 condition so the upcoming
'kick mwait' fixup is properly protected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/smp.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -152,6 +152,10 @@ static void native_stop_other_cpus(int w
 	if (reboot_force)
 		return;
 
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
+		return;
+
 	/*
 	 * Use an own vector here because smp_call_function
 	 * does lots of things not suitable in a panic situation.
@@ -167,10 +171,6 @@ static void native_stop_other_cpus(int w
 	 * finish their work before we force them off with the NMI.
 	 */
 	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
-
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 3/6] x86/smp: Use dedicated cache-line for mwait_play_dead()
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
  2023-06-03 20:06 ` [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
  2023-06-03 20:06 ` [patch 2/6] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
@ 2023-06-03 20:07 ` Thomas Gleixner
  2023-06-03 20:07 ` [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.

But there is a use case where a "dead" CPU needs to be brought out of that
mwait(): kexec().

The CPU needs to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Use a dedicated per CPU storage to prepare for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/smpboot.c |   24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+struct mwait_cpu_dead {
+	unsigned int	control;
+	unsigned int	status;
+};
+
+/*
+ * Cache line aligned data for mwait_play_dead(). Separate on purpose so
+ * that it's unlikely to be touched by other CPUs.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead);
+
 /* Logical package management. We might want to allocate that dynamically */
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0);
  */
 static inline void mwait_play_dead(void)
 {
+	struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
@@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
@@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void)
 		 * case where we return around the loop.
 		 */
 		mb();
-		clflush(mwait_ptr);
+		clflush(md);
 		mb();
-		__monitor(mwait_ptr, 0, 0);
+		__monitor(md, 0, 0);
 		mb();
 		__mwait(eax, 0);
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
                   ` (2 preceding siblings ...)
  2023-06-03 20:07 ` [patch 3/6] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
@ 2023-06-03 20:07 ` Thomas Gleixner
  2023-06-03 20:54   ` Ashok Raj
  2023-06-04  3:19   ` Ashok Raj
  2023-06-03 20:07 ` [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman, Ashok Raj

TLDR: It's a mess.

When kexec() is executed on a system with "offline" CPUs, which are parked
in mwait_play_dead() it can end up in a triple fault during the bootup of
the kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernels text,
page tables, data and stack, If it writes to the cache line which is
monitored by an previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offline CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offline CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/smp.h |    2 +
 arch/x86/kernel/smp.c      |   21 +++++++---------
 arch/x86/kernel/smpboot.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 void cond_wakeup_cpu0(void);
 
+void smp_kick_mwait_play_dead(void);
+
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -21,6 +21,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -156,19 +157,17 @@ static void native_stop_other_cpus(int w
 	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
 		return;
 
-	/*
-	 * Use an own vector here because smp_call_function
-	 * does lots of things not suitable in a panic situation.
-	 */
+	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+	if (kexec_in_progress)
+		smp_kick_mwait_play_dead();
 
 	/*
-	 * We start by using the REBOOT_VECTOR irq.
-	 * The irq is treated as a sync point to allow critical
-	 * regions of code on other cpus to release their spin locks
-	 * and re-enable irqs.  Jumping straight to an NMI might
-	 * accidentally cause deadlocks with further shutdown/panic
-	 * code.  By syncing, we give the cpus up to one second to
-	 * finish their work before we force them off with the NMI.
+	 * Start by using the REBOOT_VECTOR. That acts as a sync point to
+	 * allow critical regions of code on other cpus to leave their
+	 * critical regions. Jumping straight to an NMI might accidentally
+	 * cause deadlocks with further shutdown code. This gives the CPUs
+	 * up to one second to finish their work before forcing them off
+	 * with the NMI.
 	 */
 	if (num_online_cpus() > 1) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,7 @@
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
@@ -106,6 +107,9 @@ struct mwait_cpu_dead {
 	unsigned int	status;
 };
 
+#define CPUDEAD_MWAIT_WAIT	0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT	0x4A17DEAD
+
 /*
  * Cache line aligned data for mwait_play_dead(). Separate on purpose so
  * that it's unlikely to be touched by other CPUs.
@@ -173,6 +177,10 @@ static void smp_callin(void)
 {
 	int cpuid;
 
+	/* Mop up eventual mwait_play_dead() wreckage */
+	this_cpu_write(mwait_cpu_dead.status, 0);
+	this_cpu_write(mwait_cpu_dead.control, 0);
+
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
 	 * cpu_callout_mask guarantees we don't get here before
@@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
 	wbinvd();
 
 	while (1) {
@@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
 		mb();
 		__mwait(eax, 0);
 
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+
 		cond_wakeup_cpu0();
 	}
 }
 
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+	struct mwait_cpu_dead *md;
+	unsigned int cpu, i;
+
+	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+		/* Does it sit in mwait_play_dead() ? */
+		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+			continue;
+
+		/* Wait maximal 5ms */
+		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+			/* Bring it out of mwait */
+			WRITE_ONCE(md->control, newstate);
+			udelay(5);
+		}
+
+		if (READ_ONCE(md->status) != newstate)
+			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
+	}
+}
+
 void __noreturn hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
                   ` (3 preceding siblings ...)
  2023-06-03 20:07 ` [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
@ 2023-06-03 20:07 ` Thomas Gleixner
  2023-06-04  4:02   ` Mika Penttilä
  2023-06-05  8:23   ` [patch v2 " Thomas Gleixner
  2023-06-03 20:07 ` [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
  2023-06-05 17:41 ` [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Sean Christopherson
  6 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/smpboot.c |   51 +++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 29 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -853,47 +853,40 @@ wakeup_secondary_cpu_via_nmi(int apicid,
 	return (send_status | accept_status);
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-	unsigned long send_status = 0, accept_status = 0;
-	int maxlvt, num_starts, j;
-
-	maxlvt = lapic_get_maxlvt();
+	int maxlvt = lapic_get_maxlvt();
 
-	/*
-	 * Be paranoid about clearing APIC errors.
-	 */
+	/* Be paranoid about clearing APIC errors. */
 	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
+		/* Due to the Pentium erratum 3AP.  */
+		if (maxlvt > 3)
 			apic_write(APIC_ESR, 0);
 		apic_read(APIC_ESR);
 	}
 
-	pr_debug("Asserting INIT\n");
-
-	/*
-	 * Turn INIT on target chip
-	 */
-	/*
-	 * Send IPI
-	 */
-	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-		       phys_apicid);
-
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	/* Assert INIT on the target CPU */
+	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
 
 	udelay(init_udelay);
 
-	pr_debug("Deasserting INIT\n");
-
-	/* Target chip */
-	/* Send IPI */
+	/* Deassert INIT on the target CPU */
 	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
+}
+
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+{
+	unsigned long send_status = 0, accept_status = 0;
+	int maxlvt, num_starts, j;
+
+	preempt_disable();
 
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	send_init_sequence(phys_apicid);
 
 	mb();
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
                   ` (4 preceding siblings ...)
  2023-06-03 20:07 ` [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
@ 2023-06-03 20:07 ` Thomas Gleixner
  2023-06-03 20:57   ` Ashok Raj
  2023-06-05 17:41 ` [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Sean Christopherson
  6 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-03 20:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.

Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.

A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.

So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/smp.h |    2 ++
 arch/x86/kernel/smp.c      |   38 +++++++++++++++++++++++++++++---------
 arch/x86/kernel/smpboot.c  |   19 +++++++++++++++++++
 3 files changed, 50 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -139,6 +139,8 @@ void native_send_call_func_ipi(const str
 void native_send_call_func_single_ipi(int cpu);
 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
 
+bool smp_park_nonboot_cpus_in_init(void);
+
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
 
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -130,7 +130,7 @@ static int smp_stop_nmi_callback(unsigne
 }
 
 /*
- * this function calls the 'stop' function on all other CPUs in the system.
+ * Disable virtualization, APIC etc. and park the CPU in a HLT loop
  */
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
@@ -147,8 +147,7 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned long flags;
-	unsigned long timeout;
+	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
@@ -164,10 +163,10 @@ static void native_stop_other_cpus(int w
 	/*
 	 * Start by using the REBOOT_VECTOR. That acts as a sync point to
 	 * allow critical regions of code on other cpus to leave their
-	 * critical regions. Jumping straight to an NMI might accidentally
-	 * cause deadlocks with further shutdown code. This gives the CPUs
-	 * up to one second to finish their work before forcing them off
-	 * with the NMI.
+	 * critical regions. Jumping straight to NMI or INIT might
+	 * accidentally cause deadlocks with further shutdown code. This
+	 * gives the CPUs up to one second to finish their work before
+	 * forcing them off with the NMI or INIT.
 	 */
 	if (num_online_cpus() > 1) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
@@ -175,7 +174,7 @@ static void native_stop_other_cpus(int w
 		/*
 		 * Don't wait longer than a second for IPI completion. The
 		 * wait request is not checked here because that would
-		 * prevent an NMI shutdown attempt in case that not all
+		 * prevent an NMI/INIT shutdown in case that not all
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
@@ -183,7 +182,27 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
-	/* if the REBOOT_VECTOR didn't work, try with the NMI */
+	/*
+	 * Park all nonboot CPUs in INIT including offline CPUs, if
+	 * possible. That's a safe place where they can't resume execution
+	 * of HLT and then execute the HLT loop from overwritten text or
+	 * page tables.
+	 *
+	 * The only downside is a broadcast MCE, but up to the point where
+	 * the kexec() kernel brought all APs online again an MCE will just
+	 * make HLT resume and handle the MCE. The machine crashs and burns
+	 * due to overwritten text, page tables and data. So there is a
+	 * choice between fire and frying pan. The result is pretty much
+	 * the same. Chose frying pan until x86 provides a sane mechanism
+	 * to park a CPU.
+	 */
+	if (smp_park_nonboot_cpus_in_init())
+		goto done;
+
+	/*
+	 * If park with INIT was not possible and the REBOOT_VECTOR didn't
+	 * take all secondary CPUs offline, try with the NMI.
+	 */
 	if (num_online_cpus() > 1) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
@@ -208,6 +227,7 @@ static void native_stop_other_cpus(int w
 			udelay(1);
 	}
 
+done:
 	local_irq_save(flags);
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1467,6 +1467,25 @@ void arch_thaw_secondary_cpus_end(void)
 	cache_aps_init();
 }
 
+bool smp_park_nonboot_cpus_in_init(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	unsigned int apicid;
+
+	if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
+		return false;
+
+	for_each_present_cpu(cpu) {
+		if (cpu == this_cpu)
+			continue;
+		apicid = apic->cpu_present_to_apicid(cpu);
+		if (apicid == BAD_APICID)
+			continue;
+		send_init_sequence(apicid);
+	}
+	return true;
+}
+
 /*
  * Early setup to make printk work.
  */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-03 20:07 ` [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
@ 2023-06-03 20:54   ` Ashok Raj
  2023-06-04  3:19   ` Ashok Raj
  1 sibling, 0 replies; 29+ messages in thread
From: Ashok Raj @ 2023-06-03 20:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman, Ashok Raj

On Sat, Jun 03, 2023 at 10:07:01PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.
> 
> When kexec() is executed on a system with "offline" CPUs, which are parked
> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
> 
> The reason is that kexec() eventually overwrites the previous kernels text,
> page tables, data and stack, If it writes to the cache line which is
> monitored by an previously offlined CPU, MWAIT resumes execution and ends
> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.
> 
> Cure this by bringing the offline CPUs out of MWAIT into HLT.
> 
> Write to the monitored cache line of each offline CPU, which makes MWAIT
> resume execution. The written control word tells the offline CPUs to issue
> HLT, which does not have the MWAIT problem.
> 
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
> 
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.
> 
> Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
> Reported-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Does this need a stable tag?

Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Ashok Raj <ashok.raj@intel.com>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible
  2023-06-03 20:07 ` [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
@ 2023-06-03 20:57   ` Ashok Raj
  0 siblings, 0 replies; 29+ messages in thread
From: Ashok Raj @ 2023-06-03 20:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman, Ashok Raj

On Sat, Jun 03, 2023 at 10:07:04PM +0200, Thomas Gleixner wrote:
> Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
> resume execution due to NMI, SMI and MCE, which has the same issue as the
> MWAIT loop.
> 
> Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.
> 
> A broadcast MCE will take the machine down, but a broadcast MCE which makes
> HLT resume and execute overwritten text, pagetables or data will end up in
> a disaster too.
> 
> So chose the lesser of two evils and kick the secondary CPUs into INIT
> unless the system has installed special wakeup mechanisms which are not
> using INIT.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

For the whole series.

Reviewed-by: Ashok Raj <ashok.raj@intel.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-03 20:07 ` [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
  2023-06-03 20:54   ` Ashok Raj
@ 2023-06-04  3:19   ` Ashok Raj
  2023-06-05  7:41     ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Ashok Raj @ 2023-06-04  3:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman, Ashok Raj

On Sat, Jun 03, 2023 at 10:07:01PM +0200, Thomas Gleixner wrote:

[snip]

> Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
> Reported-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/smp.h |    2 +
>  arch/x86/kernel/smp.c      |   21 +++++++---------
>  arch/x86/kernel/smpboot.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 71 insertions(+), 11 deletions(-)
> 
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu);
>  int wbinvd_on_all_cpus(void);
>  void cond_wakeup_cpu0(void);
>  
> +void smp_kick_mwait_play_dead(void);
> +

This seems like its missing prototype for #else for !CONFIG_SMP

 #else /* !CONFIG_SMP */
+#define smp_kick_mwait_play_dead(void) { }

Sorry I missed noticing this

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-03 20:07 ` [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
@ 2023-06-04  4:02   ` Mika Penttilä
  2023-06-04 10:24     ` Ashok Raj
  2023-06-05  7:54     ` Thomas Gleixner
  2023-06-05  8:23   ` [patch v2 " Thomas Gleixner
  1 sibling, 2 replies; 29+ messages in thread
From: Mika Penttilä @ 2023-06-04  4:02 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Hi,

On 3.6.2023 23.07, Thomas Gleixner wrote:
> Putting CPUs into INIT is a safer place during kexec() to park CPUs.
> 
> Split the INIT assert/deassert sequence out so it can be reused.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/kernel/smpboot.c |   51 +++++++++++++++++++---------------------------
>   1 file changed, 22 insertions(+), 29 deletions(-)
> 
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -853,47 +853,40 @@ wakeup_secondary_cpu_via_nmi(int apicid,
>   	return (send_status | accept_status);
>   }
>   
> -static int
> -wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> +static void send_init_sequence(int phys_apicid)
>   {
> -	unsigned long send_status = 0, accept_status = 0;
> -	int maxlvt, num_starts, j;
> -
> -	maxlvt = lapic_get_maxlvt();
> +	int maxlvt = lapic_get_maxlvt();
>   
> -	/*
> -	 * Be paranoid about clearing APIC errors.
> -	 */
> +	/* Be paranoid about clearing APIC errors. */
>   	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
> -		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
> +		/* Due to the Pentium erratum 3AP.  */
> +		if (maxlvt > 3)
>   			apic_write(APIC_ESR, 0);
>   		apic_read(APIC_ESR);
>   	}
>   
> -	pr_debug("Asserting INIT\n");
> -
> -	/*
> -	 * Turn INIT on target chip
> -	 */
> -	/*
> -	 * Send IPI
> -	 */
> -	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
> -		       phys_apicid);
> -
> -	pr_debug("Waiting for send to finish...\n");
> -	send_status = safe_apic_wait_icr_idle();
> +	/* Assert INIT on the target CPU */
> +	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
> +	safe_apic_wait_icr_idle();
>   
>   	udelay(init_udelay);
>   
> -	pr_debug("Deasserting INIT\n");
> -
> -	/* Target chip */
> -	/* Send IPI */
> +	/* Deassert INIT on the target CPU */
>   	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
> +	safe_apic_wait_icr_idle();
> +}
> +
> +/*
> + * Wake up AP by INIT, INIT, STARTUP sequence.
> + */
> +static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> +{
> +	unsigned long send_status = 0, accept_status = 0;
> +	int maxlvt, num_starts, j;
> +
> +	preempt_disable();

This seems like an unbalanced preempt disable..

>   
> -	pr_debug("Waiting for send to finish...\n");
> -	send_status = safe_apic_wait_icr_idle();
> +	send_init_sequence(phys_apicid);
>   
>   	mb();
>   
> 


--Mika

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-04  4:02   ` Mika Penttilä
@ 2023-06-04 10:24     ` Ashok Raj
  2023-06-05  7:54     ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Ashok Raj @ 2023-06-04 10:24 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen, Tony Luck,
	Arjan van de Veen, Peter Zijlstra, Eric Biederman, Ashok Raj

Hi

On Sun, Jun 04, 2023 at 07:02:33AM +0300, Mika Penttilä wrote:
> Hi,
> 
> On 3.6.2023 23.07, Thomas Gleixner wrote:
> > Putting CPUs into INIT is a safer place during kexec() to park CPUs.
> > 
> > Split the INIT assert/deassert sequence out so it can be reused.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >   arch/x86/kernel/smpboot.c |   51 +++++++++++++++++++---------------------------
> >   1 file changed, 22 insertions(+), 29 deletions(-)
> > 
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -853,47 +853,40 @@ wakeup_secondary_cpu_via_nmi(int apicid,
> >   	return (send_status | accept_status);
> >   }
> > -static int
> > -wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> > +static void send_init_sequence(int phys_apicid)
> >   {
> > -	unsigned long send_status = 0, accept_status = 0;
> > -	int maxlvt, num_starts, j;
> > -
> > -	maxlvt = lapic_get_maxlvt();
> > +	int maxlvt = lapic_get_maxlvt();
> > -	/*
> > -	 * Be paranoid about clearing APIC errors.
> > -	 */
> > +	/* Be paranoid about clearing APIC errors. */
> >   	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
> > -		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
> > +		/* Due to the Pentium erratum 3AP.  */
> > +		if (maxlvt > 3)
> >   			apic_write(APIC_ESR, 0);
> >   		apic_read(APIC_ESR);
> >   	}
> > -	pr_debug("Asserting INIT\n");
> > -
> > -	/*
> > -	 * Turn INIT on target chip
> > -	 */
> > -	/*
> > -	 * Send IPI
> > -	 */
> > -	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
> > -		       phys_apicid);
> > -
> > -	pr_debug("Waiting for send to finish...\n");
> > -	send_status = safe_apic_wait_icr_idle();
> > +	/* Assert INIT on the target CPU */
> > +	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
> > +	safe_apic_wait_icr_idle();
> >   	udelay(init_udelay);
> > -	pr_debug("Deasserting INIT\n");
> > -
> > -	/* Target chip */
> > -	/* Send IPI */
> > +	/* Deassert INIT on the target CPU */
> >   	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
> > +	safe_apic_wait_icr_idle();
> > +}
> > +
> > +/*
> > + * Wake up AP by INIT, INIT, STARTUP sequence.
> > + */
> > +static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> > +{
> > +	unsigned long send_status = 0, accept_status = 0;
> > +	int maxlvt, num_starts, j;

Also maxlvt used uninitialized.

> > +
> > +	preempt_disable();
> 
> This seems like an unbalanced preempt disable..
> 
> > -	pr_debug("Waiting for send to finish...\n");
> > -	send_status = safe_apic_wait_icr_idle();
> > +	send_init_sequence(phys_apicid);
> >   	mb();
> > 
> 
> 
> --Mika
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage
  2023-06-04  3:19   ` Ashok Raj
@ 2023-06-05  7:41     ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-05  7:41 UTC (permalink / raw)
  To: Ashok Raj
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman, Ashok Raj

On Sat, Jun 03 2023 at 20:19, Ashok Raj wrote:
> On Sat, Jun 03, 2023 at 10:07:01PM +0200, Thomas Gleixner wrote:
>>  
>> +void smp_kick_mwait_play_dead(void);
>> +
>
> This seems like its missing prototype for #else for !CONFIG_SMP
>
>  #else /* !CONFIG_SMP */
> +#define smp_kick_mwait_play_dead(void) { }

Nope. The code which calls this is only compiled when SMP=y

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-04  4:02   ` Mika Penttilä
  2023-06-04 10:24     ` Ashok Raj
@ 2023-06-05  7:54     ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-05  7:54 UTC (permalink / raw)
  To: Mika Penttilä, LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

On Sun, Jun 04 2023 at 07:02, Mika Penttilä wrote:
> On 3.6.2023 23.07, Thomas Gleixner wrote:
>> +/*
>> + * Wake up AP by INIT, INIT, STARTUP sequence.
>> + */
>> +static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
>> +{
>> +	unsigned long send_status = 0, accept_status = 0;
>> +	int maxlvt, num_starts, j;
>> +
>> +	preempt_disable();
>
> This seems like an unbalanced preempt disable..

Right. I screwed up a rebase completely....

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch v2 5/6] x86/smp: Split sending INIT IPI out into a helper function
  2023-06-03 20:07 ` [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
  2023-06-04  4:02   ` Mika Penttilä
@ 2023-06-05  8:23   ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-05  8:23 UTC (permalink / raw)
  To: LKML
  Cc: x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

Subject: x86/smp: Split sending INIT IPI out into a helper function
From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 02 Jun 2023 15:04:10 +0200

Putting CPUs into INIT is a safer place during kexec() to park CPUs.

Split the INIT assert/deassert sequence out so it can be reused.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Fix rebase screwup
---
 arch/x86/kernel/smpboot.c |   49 ++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -853,47 +853,38 @@ wakeup_secondary_cpu_via_nmi(int apicid,
 	return (send_status | accept_status);
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-	unsigned long send_status = 0, accept_status = 0;
-	int maxlvt, num_starts, j;
-
-	maxlvt = lapic_get_maxlvt();
+	int maxlvt = lapic_get_maxlvt();
 
-	/*
-	 * Be paranoid about clearing APIC errors.
-	 */
+	/* Be paranoid about clearing APIC errors. */
 	if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP.  */
+		/* Due to the Pentium erratum 3AP.  */
+		if (maxlvt > 3)
 			apic_write(APIC_ESR, 0);
 		apic_read(APIC_ESR);
 	}
 
-	pr_debug("Asserting INIT\n");
-
-	/*
-	 * Turn INIT on target chip
-	 */
-	/*
-	 * Send IPI
-	 */
-	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-		       phys_apicid);
-
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	/* Assert INIT on the target CPU */
+	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
 
 	udelay(init_udelay);
 
-	pr_debug("Deasserting INIT\n");
-
-	/* Target chip */
-	/* Send IPI */
+	/* Deassert INIT on the target CPU */
 	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+	safe_apic_wait_icr_idle();
+}
+
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+{
+	unsigned long send_status = 0, accept_status = 0;
+	int num_starts, j, maxlvt = lapic_get_maxlvt();
 
-	pr_debug("Waiting for send to finish...\n");
-	send_status = safe_apic_wait_icr_idle();
+	send_init_sequence(phys_apicid);
 
 	mb();
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
                   ` (5 preceding siblings ...)
  2023-06-03 20:07 ` [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
@ 2023-06-05 17:41 ` Sean Christopherson
  2023-06-05 22:41   ` Thomas Gleixner
  6 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2023-06-05 17:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> Hi!
> 
> Ashok observed triple faults when executing kexec() on a kernel which has
> 'nosmt' on the kernel commandline and HT enabled in the BIOS.
> 
> 'nosmt' brings up the HT siblings to the point where they initiliazed the
> CPU and then rolls the bringup back which parks them in mwait_play_dead().
> The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast
> MCE will immediately shut down the machine.

...

> This is only half safe because HLT can resume execution due to NMI, SMI and
> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,

On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
*might* cause shutdown, but at least there's a chance it will work.  And presumably
modern CPUs do pend the #MC until GIF=1.

> but there is at least one which prevents the NMI and SMI cause: INIT.
> 
>   3) If the system uses the regular INIT/STARTUP sequence to wake up
>      secondary CPUS, then "park" all CPUs including the "offline" ones
>      by sending them INIT IPIs.
> 
> The INIT IPI brings the CPU into a wait for wakeup state which is not
> affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE
> problem comes back.
> 
> But that's not really any different from a CPU sitting in the HLT loop on
> the previous kernel. If a broadcast MCE arrives, HLT resumes execution and
> the CPU tries to handle the MCE on overwritten text, pagetables etc.
> 
> So parking them via INIT is not completely solving the problem, but it
> takes at least NMI and SMI out of the picture.

Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
potentially cause problems too?

Why not carve out a page that's hidden across kexec() to hold whatever code+data
is needed to safely execute a HLT loop indefinitely?  E.g. doesn't the original
kernel provide the e820 tables for the post-kexec() kernel?  To avoid OOM after
many kexec(), reserving a page could be done iff the current kernel wasn't itself
kexec()'d.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-05 17:41 ` [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Sean Christopherson
@ 2023-06-05 22:41   ` Thomas Gleixner
  2023-06-05 23:08     ` Sean Christopherson
  2023-06-07 16:21     ` Ashok Raj
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-05 22:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> On Sat, Jun 03, 2023, Thomas Gleixner wrote:
>> This is only half safe because HLT can resume execution due to NMI, SMI and
>> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
>
> On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> *might* cause shutdown, but at least there's a chance it will work.  And presumably
> modern CPUs do pend the #MC until GIF=1.

Abusing SVME for that is definitely in the realm of creative bonus
points, but not necessarily a general purpose solution.

>> So parking them via INIT is not completely solving the problem, but it
>> takes at least NMI and SMI out of the picture.
>
> Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> potentially cause problems too?

Not that I'm aware of. If so then this would be a hideous firmware bug
as firmware must be aware of CPUs which hang around in INIT independent
of this.

> Why not carve out a page that's hidden across kexec() to hold whatever code+data
> is needed to safely execute a HLT loop indefinitely?

See below.

> E.g. doesn't the original kernel provide the e820 tables for the
> post-kexec() kernel?

Only for crash kernels if I'm not missing something.

Making this work for regular kexec() including this:

> To avoid OOM after many kexec(), reserving a page could be done iff
> the current kernel wasn't itself kexec()'d.

would be possible and I thought about it, but that needs a complete new
design of "offline", "shutdown offline" and a non-trivial amount of
backwards compatibility magic because you can't assume that the kexec()
kernel version is greater or equal to the current one. kexec() is
supposed to work both ways, downgrading and upgrading. IOW, that ship
sailed long ago.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-05 22:41   ` Thomas Gleixner
@ 2023-06-05 23:08     ` Sean Christopherson
  2023-06-06  7:20       ` Thomas Gleixner
  2023-06-07 16:21     ` Ashok Raj
  1 sibling, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2023-06-05 23:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

On Tue, Jun 06, 2023, Thomas Gleixner wrote:
> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> >> This is only half safe because HLT can resume execution due to NMI, SMI and
> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> >
> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > modern CPUs do pend the #MC until GIF=1.
> 
> Abusing SVME for that is definitely in the realm of creative bonus
> points, but not necessarily a general purpose solution.

Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-)

> >> So parking them via INIT is not completely solving the problem, but it
> >> takes at least NMI and SMI out of the picture.
> >
> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > potentially cause problems too?
> 
> Not that I'm aware of. If so then this would be a hideous firmware bug
> as firmware must be aware of CPUs which hang around in INIT independent
> of this.

I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g.
SmmWaitForApArrival().  I've never dug deeply into how EDKII uses SMM, what its
timeouts are, etc., I just remember coming across that code when poking around
EDKII for other stuff.

> > Why not carve out a page that's hidden across kexec() to hold whatever code+data
> > is needed to safely execute a HLT loop indefinitely?
> 
> See below.
> 
> > E.g. doesn't the original kernel provide the e820 tables for the
> > post-kexec() kernel?
> 
> Only for crash kernels if I'm not missing something.

Ah, drat.

> Making this work for regular kexec() including this:
> 
> > To avoid OOM after many kexec(), reserving a page could be done iff
> > the current kernel wasn't itself kexec()'d.
> 
> would be possible and I thought about it, but that needs a complete new
> design of "offline", "shutdown offline" and a non-trivial amount of
> backwards compatibility magic because you can't assume that the kexec()
> kernel version is greater or equal to the current one. kexec() is
> supposed to work both ways, downgrading and upgrading. IOW, that ship
> sailed long ago.

Right, but doesn't gaining "full" protection require ruling out unenlightened
downgrades?  E.g. if someone downgrades to an old kernel, doesn't hide the "offline"
CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever,
then that old kernel will do the naive MWAIT or unprotected HLT and it's hosed again.

If we're relying on the admin to hide the offline CPUs, could we usurp an existing
kernel param to hide a small chunk of memory instead?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-05 23:08     ` Sean Christopherson
@ 2023-06-06  7:20       ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2023-06-06  7:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, x86, Ashok Raj, Dave Hansen, Tony Luck, Arjan van de Veen,
	Peter Zijlstra, Eric Biederman

On Mon, Jun 05 2023 at 16:08, Sean Christopherson wrote:
> On Tue, Jun 06, 2023, Thomas Gleixner wrote:
>> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
>> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
>> >> This is only half safe because HLT can resume execution due to NMI, SMI and
>> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
>> >
>> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
>> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
>> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
>> > modern CPUs do pend the #MC until GIF=1.
>> 
>> Abusing SVME for that is definitely in the realm of creative bonus
>> points, but not necessarily a general purpose solution.
>
> Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-)

I feared that :)

>> >> So parking them via INIT is not completely solving the problem, but it
>> >> takes at least NMI and SMI out of the picture.
>> >
>> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
>> > potentially cause problems too?
>> 
>> Not that I'm aware of. If so then this would be a hideous firmware bug
>> as firmware must be aware of CPUs which hang around in INIT independent
>> of this.
>
> I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g.
> SmmWaitForApArrival().  I've never dug deeply into how EDKII uses SMM, what its
> timeouts are, etc., I just remember coming across that code when poking around
> EDKII for other stuff.

There is a comment:

  Note the SMI Handlers must ALWAYS take into account the cases that not
  all APs are available in an SMI run.

Also not all SMIs required global synchronization. But it's all an
inpenetrable mess...

>> Making this work for regular kexec() including this:
>> 
>> > To avoid OOM after many kexec(), reserving a page could be done iff
>> > the current kernel wasn't itself kexec()'d.
>> 
>> would be possible and I thought about it, but that needs a complete new
>> design of "offline", "shutdown offline" and a non-trivial amount of
>> backwards compatibility magic because you can't assume that the kexec()
>> kernel version is greater or equal to the current one. kexec() is
>> supposed to work both ways, downgrading and upgrading. IOW, that ship
>> sailed long ago.
>
> Right, but doesn't gaining "full" protection require ruling out unenlightened
> downgrades?  E.g. if someone downgrades to an old kernel, doesn't hide the "offline"
> CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever,
> then that old kernel will do the naive MWAIT or unprotected HLT and
> it's hosed again.

Of course.

> If we're relying on the admin to hide the offline CPUs, could we usurp
> an existing kernel param to hide a small chunk of memory instead?

The only "safe" place is below 1M I think. Not sure whether we have
some existing command line option to "hide" a range there. Neither am I
sure that this would be always the same range.

More questions than answers :)

Thanks

        tglx





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-05 22:41   ` Thomas Gleixner
  2023-06-05 23:08     ` Sean Christopherson
@ 2023-06-07 16:21     ` Ashok Raj
  2023-06-07 17:33       ` Sean Christopherson
  1 sibling, 1 reply; 29+ messages in thread
From: Ashok Raj @ 2023-06-07 16:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Ashok Raj

On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> >> This is only half safe because HLT can resume execution due to NMI, SMI and
> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> >
> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > modern CPUs do pend the #MC until GIF=1.
> 
> Abusing SVME for that is definitely in the realm of creative bonus
> points, but not necessarily a general purpose solution.
> 
> >> So parking them via INIT is not completely solving the problem, but it
> >> takes at least NMI and SMI out of the picture.
> >
> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > potentially cause problems too?
> 
> Not that I'm aware of. If so then this would be a hideous firmware bug
> as firmware must be aware of CPUs which hang around in INIT independent
> of this.

SMM does do the rendezvous of all CPUs, but also has a way to detect the
blocked ones, in WFS via some package scoped ubox register. So it knows to
skip those. I can find this in internal sources, but they aren't available
in the edk2 open reference code. They happen to be documented only in the
BWG, which isn't available freely.

I believe its behind the GetSmmDelayedBlockedDisabledCount()->
	SmmCpuFeaturesGetSmmRegister() 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-07 16:21     ` Ashok Raj
@ 2023-06-07 17:33       ` Sean Christopherson
  2023-06-07 22:19         ` Ashok Raj
  2023-06-09  8:40         ` Paolo Bonzini
  0 siblings, 2 replies; 29+ messages in thread
From: Sean Christopherson @ 2023-06-07 17:33 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen, Tony Luck,
	Arjan van de Veen, Peter Zijlstra, Eric Biederman, Ashok Raj

On Wed, Jun 07, 2023, Ashok Raj wrote:
> On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> > >> This is only half safe because HLT can resume execution due to NMI, SMI and
> > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> > >
> > > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > > modern CPUs do pend the #MC until GIF=1.
> > 
> > Abusing SVME for that is definitely in the realm of creative bonus
> > points, but not necessarily a general purpose solution.
> > 
> > >> So parking them via INIT is not completely solving the problem, but it
> > >> takes at least NMI and SMI out of the picture.
> > >
> > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > potentially cause problems too?
> > 
> > Not that I'm aware of. If so then this would be a hideous firmware bug
> > as firmware must be aware of CPUs which hang around in INIT independent
> > of this.
> 
> SMM does do the rendezvous of all CPUs, but also has a way to detect the
> blocked ones, in WFS via some package scoped ubox register. So it knows to
> skip those. I can find this in internal sources, but they aren't available
> in the edk2 open reference code. They happen to be documented only in the
> BWG, which isn't available freely.

Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.

Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-07 17:33       ` Sean Christopherson
@ 2023-06-07 22:19         ` Ashok Raj
  2023-06-08  3:46           ` Sean Christopherson
  2023-06-09  8:40         ` Paolo Bonzini
  1 sibling, 1 reply; 29+ messages in thread
From: Ashok Raj @ 2023-06-07 22:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen, Tony Luck,
	Arjan van de Veen, Peter Zijlstra, Eric Biederman, Ashok Raj

On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote:
> On Wed, Jun 07, 2023, Ashok Raj wrote:
> > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > > > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> > > >> This is only half safe because HLT can resume execution due to NMI, SMI and
> > > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> > > >
> > > > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > > > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > > > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > > > modern CPUs do pend the #MC until GIF=1.
> > > 
> > > Abusing SVME for that is definitely in the realm of creative bonus
> > > points, but not necessarily a general purpose solution.
> > > 
> > > >> So parking them via INIT is not completely solving the problem, but it
> > > >> takes at least NMI and SMI out of the picture.
> > > >
> > > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > > potentially cause problems too?
> > > 
> > > Not that I'm aware of. If so then this would be a hideous firmware bug
> > > as firmware must be aware of CPUs which hang around in INIT independent
> > > of this.
> > 
> > SMM does do the rendezvous of all CPUs, but also has a way to detect the
> > blocked ones, in WFS via some package scoped ubox register. So it knows to
> > skip those. I can find this in internal sources, but they aren't available
> > in the edk2 open reference code. They happen to be documented only in the
> > BWG, which isn't available freely.
> 
> Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure

Never knew SMM had any role in VM's.. I thought SMM was always native. 

Who owns this SMM for VM's.. from the VirtualBIOS?

> boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.
> 
> Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.

I always seem to turn off secureboot installing Ubuntu :-).. I'll try to
find someone who might know especially doing SMM In VM. 

Can you tell what needs to be validated in the guest?  Would doing kexec
inside the guest with the new patch set be sufficient? 

Or you mean in guest, do a kexec and launch secure boot of new kernel?

If there is a specific test you want done, let me know.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-07 22:19         ` Ashok Raj
@ 2023-06-08  3:46           ` Sean Christopherson
  2023-06-08  4:03             ` Ashok Raj
  2023-06-16 15:07             ` Ashok Raj
  0 siblings, 2 replies; 29+ messages in thread
From: Sean Christopherson @ 2023-06-08  3:46 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen, Tony Luck,
	Arjan van de Veen, Peter Zijlstra, Eric Biederman, Ashok Raj

On Wed, Jun 07, 2023, Ashok Raj wrote:
> On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote:
> > On Wed, Jun 07, 2023, Ashok Raj wrote:
> > > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > > > >> So parking them via INIT is not completely solving the problem, but it
> > > > >> takes at least NMI and SMI out of the picture.
> > > > >
> > > > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > > > potentially cause problems too?
> > > > 
> > > > Not that I'm aware of. If so then this would be a hideous firmware bug
> > > > as firmware must be aware of CPUs which hang around in INIT independent
> > > > of this.
> > > 
> > > SMM does do the rendezvous of all CPUs, but also has a way to detect the
> > > blocked ones, in WFS via some package scoped ubox register. So it knows to
> > > skip those. I can find this in internal sources, but they aren't available
> > > in the edk2 open reference code. They happen to be documented only in the
> > > BWG, which isn't available freely.
> > 
> > Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> > bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
> 
> Never knew SMM had any role in VM's.. I thought SMM was always native. 
> 
> Who owns this SMM for VM's.. from the VirtualBIOS?

Yes?

> > boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.
> > 
> > Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.
> 
> I always seem to turn off secureboot installing Ubuntu :-)

Yeah, I don't utilize it in any of my VMs either.

> I'll try to find someone who might know especially doing SMM In VM. 
> 
> Can you tell what needs to be validated in the guest?  Would doing kexec
> inside the guest with the new patch set be sufficient? 
> 
> Or you mean in guest, do a kexec and launch secure boot of new kernel?

Yes?  I don't actually have hands on experience with such a setup, I'm familiar
with it purely through bug reports, e.g. this one

https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com

> If there is a specific test you want done, let me know.

Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
to make sure this all works.  Like I said earlier, nice to have, but certainly not
necessary.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-08  3:46           ` Sean Christopherson
@ 2023-06-08  4:03             ` Ashok Raj
  2023-06-16 15:07             ` Ashok Raj
  1 sibling, 0 replies; 29+ messages in thread
From: Ashok Raj @ 2023-06-08  4:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ashok Raj, Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Ashok Raj

On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> 
> Yes?  I don't actually have hands on experience with such a setup, I'm familiar
> with it purely through bug reports, e.g. this one
> 
> https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> 
> > If there is a specific test you want done, let me know.
> 
> Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> to make sure this all works.  Like I said earlier, nice to have, but certainly not
> necessary.

Thanks for the Link Sean. I'll do some followup. 

-- 
Cheers,
Ashok


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-07 17:33       ` Sean Christopherson
  2023-06-07 22:19         ` Ashok Raj
@ 2023-06-09  8:40         ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2023-06-09  8:40 UTC (permalink / raw)
  To: Sean Christopherson, Ashok Raj
  Cc: Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen, Tony Luck,
	Arjan van de Veen, Peter Zijlstra, Eric Biederman, Ashok Raj

On 6/7/23 19:33, Sean Christopherson wrote:
Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs 
indefinitely
>>>> potentially cause problems too?
>>>
>>> Not that I'm aware of. If so then this would be a hideous firmware bug
>>> as firmware must be aware of CPUs which hang around in INIT independent
>>> of this.
>>
>> SMM does do the rendezvous of all CPUs, but also has a way to detect the
>> blocked ones, in WFS via some package scoped ubox register. So it knows to
>> skip those. I can find this in internal sources, but they aren't available
>> in the edk2 open reference code. They happen to be documented only in the
>> BWG, which isn't available freely.
> 
> Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
> boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.

VMs do not have things like periodic or watchdog SMIs, they only enter 
SMM in response to IPIs or writes to 0xB1.

The writes to 0xB1 in turn should only happen from UEFI runtime services 
related to the UEFI variable store.  Another possibility could be ACPI 
bytecode from either DSDT or APEI; not implemented yet and very unlikely 
to happen in the future, but not impossible either.

Either way they should not happen before the kexec-ed kernel has brought 
up all CPUs.

Paolo

> Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-08  3:46           ` Sean Christopherson
  2023-06-08  4:03             ` Ashok Raj
@ 2023-06-16 15:07             ` Ashok Raj
  2023-06-16 19:00               ` Sean Christopherson
  1 sibling, 1 reply; 29+ messages in thread
From: Ashok Raj @ 2023-06-16 15:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ashok Raj, Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Ashok Raj, Dhanraj, Vijay, Paolo Bonzini, Laszlo Ersek,
	Gerd Hoffmann, Andrea Bolognani, Daniel P. Berrangé

On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> 
> https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> 
> > If there is a specific test you want done, let me know.
> 
> Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> to make sure this all works.  Like I said earlier, nice to have, but certainly not
> necessary.

+ Vijay who was helping with testing this inside the VM.
+ Paolo, Laszlo 

I haven't found the exact method to test with secure boot/trusted boot yet.
But here is what we were able to test thus far.

Vijay was able to get OVMF recompiled with SMM included.

Thanks to Laszlo for pointing me in the right direction. And Paolo for
helping with some basic questions.

https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt

Surprisingly SMM emulation is sadly damn good :-) 

Recipe is to generate SMI by writing to port 0xb2. 

- On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34
  goes up by 1 on all logical CPUs.
- Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control
- Do another port 0xb2, we don't see any hangs
- Bring up SMT by echo on > control, and we can see even the offline CPUs
  got the SMI as indicated by MSR 0x34. (Which is as expected)

On guest, the only difference was when we turn on HT again, waking the CPUs
from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is
perfectly fine) All I was looking for was "no hang". And a normal kexec
with newly updated code works well inside a guest.

Would this qualify for the smoke test pass? I'll continue to look for a
secure boot install if this doesn't close it, just haven't landed at the
right spot yet.

-- 
Cheers,
Ashok

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-16 15:07             ` Ashok Raj
@ 2023-06-16 19:00               ` Sean Christopherson
  2023-06-16 19:03                 ` Ashok Raj
  0 siblings, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2023-06-16 19:00 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Ashok Raj, Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Vijay Dhanraj, Paolo Bonzini, Laszlo Ersek, Gerd Hoffmann,
	Andrea Bolognani, Daniel P. Berrangé

On Fri, Jun 16, 2023, Ashok Raj wrote:
> On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> > 
> > https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> > 
> > > If there is a specific test you want done, let me know.
> > 
> > Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> > to make sure this all works.  Like I said earlier, nice to have, but certainly not
> > necessary.
> 
> + Vijay who was helping with testing this inside the VM.
> + Paolo, Laszlo 
> 
> I haven't found the exact method to test with secure boot/trusted boot yet.
> But here is what we were able to test thus far.
> 
> Vijay was able to get OVMF recompiled with SMM included.
> 
> Thanks to Laszlo for pointing me in the right direction. And Paolo for
> helping with some basic questions.
> 
> https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt
> 
> Surprisingly SMM emulation is sadly damn good :-) 
> 
> Recipe is to generate SMI by writing to port 0xb2. 
> 
> - On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34
>   goes up by 1 on all logical CPUs.
> - Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control
> - Do another port 0xb2, we don't see any hangs
> - Bring up SMT by echo on > control, and we can see even the offline CPUs
>   got the SMI as indicated by MSR 0x34. (Which is as expected)
> 
> On guest, the only difference was when we turn on HT again, waking the CPUs
> from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is
> perfectly fine) All I was looking for was "no hang". And a normal kexec
> with newly updated code works well inside a guest.
> 
> Would this qualify for the smoke test pass? I'll continue to look for a
> secure boot install if this doesn't close it, just haven't landed at the
> right spot yet.

Good enough for me, thanks much!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-16 19:00               ` Sean Christopherson
@ 2023-06-16 19:03                 ` Ashok Raj
  2023-06-16 19:08                   ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Ashok Raj @ 2023-06-16 19:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ashok Raj, Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Vijay Dhanraj, Paolo Bonzini, Laszlo Ersek, Gerd Hoffmann,
	Andrea Bolognani, Daniel P. Berrangé,
	Ashok Raj

On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote:
> > Would this qualify for the smoke test pass? I'll continue to look for a
> > secure boot install if this doesn't close it, just haven't landed at the
> > right spot yet.
> 
> Good enough for me, thanks much!

Thanks a ton Sean.. if anything you have now scared me there is life for
SMM afterall even in a guest :-)... Completely took me by surprise!

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
  2023-06-16 19:03                 ` Ashok Raj
@ 2023-06-16 19:08                   ` Sean Christopherson
  0 siblings, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2023-06-16 19:08 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Ashok Raj, Thomas Gleixner, LKML, x86, Ashok Raj, Dave Hansen,
	Tony Luck, Arjan van de Veen, Peter Zijlstra, Eric Biederman,
	Vijay Dhanraj, Paolo Bonzini, Laszlo Ersek, Gerd Hoffmann,
	Andrea Bolognani, Daniel P. Berrangé

On Fri, Jun 16, 2023, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote:
> > > Would this qualify for the smoke test pass? I'll continue to look for a
> > > secure boot install if this doesn't close it, just haven't landed at the
> > > right spot yet.
> > 
> > Good enough for me, thanks much!
> 
> Thanks a ton Sean.. if anything you have now scared me there is life for
> SMM afterall even in a guest :-)

LOL, you and me both.  Thankfully GCE's implementation of Secure Boot for VMs
doesn't utilize SMM, so to a large extent I can close my eyes and plug my ears :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-06-16 19:08 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-03 20:06 [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Thomas Gleixner
2023-06-03 20:06 ` [patch 1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
2023-06-03 20:06 ` [patch 2/6] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
2023-06-03 20:07 ` [patch 3/6] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
2023-06-03 20:07 ` [patch 4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
2023-06-03 20:54   ` Ashok Raj
2023-06-04  3:19   ` Ashok Raj
2023-06-05  7:41     ` Thomas Gleixner
2023-06-03 20:07 ` [patch 5/6] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
2023-06-04  4:02   ` Mika Penttilä
2023-06-04 10:24     ` Ashok Raj
2023-06-05  7:54     ` Thomas Gleixner
2023-06-05  8:23   ` [patch v2 " Thomas Gleixner
2023-06-03 20:07 ` [patch 6/6] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner
2023-06-03 20:57   ` Ashok Raj
2023-06-05 17:41 ` [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Sean Christopherson
2023-06-05 22:41   ` Thomas Gleixner
2023-06-05 23:08     ` Sean Christopherson
2023-06-06  7:20       ` Thomas Gleixner
2023-06-07 16:21     ` Ashok Raj
2023-06-07 17:33       ` Sean Christopherson
2023-06-07 22:19         ` Ashok Raj
2023-06-08  3:46           ` Sean Christopherson
2023-06-08  4:03             ` Ashok Raj
2023-06-16 15:07             ` Ashok Raj
2023-06-16 19:00               ` Sean Christopherson
2023-06-16 19:03                 ` Ashok Raj
2023-06-16 19:08                   ` Sean Christopherson
2023-06-09  8:40         ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.