linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4 00/14] x86/tdx: Add kexec support
@ 2023-12-05  0:44 Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel. It requires BIOS-side enabling. If it missing we fallback to booting
2nd kernel with single CPU.

Please review. I would be glad for any feedback.

v4:
  - Fix build for !KEXEC_CORE;
  - Cleaner ATLERNATIVE use;
  - Update commit messages and comments;
  - Add Reviewed-bys;
v3:
  - Rework acpi_mp_crash_stop_other_cpus() to avoid invoking hotplug state
    machine;
  - Free page tables if reset vector setup failed;
  - Change asm_acpi_mp_play_dead() to pass reset vector and PGD as arguments;
  - Mark acpi_mp_* variables as static and __ro_after_init;
  - Use u32 for apicid;
  - Disable CPU offlining if reset vector setup failed;
  - Rename madt.S -> madt_playdead.S;
  - Mark tdx_kexec_unshare_mem() as static;
  - Rebase onto up-to-date tip/master;
  - Whitespace fixes;
  - Reorder patches;
  - Add Reviewed-bys;
  - Update comments and commit messages;
v2:
  - Rework how unsharing hook ups into kexec codepath;
  - Rework kvmclock_disable() fix based on Sean's;
  - s/cpu_hotplug_not_supported()/cpu_hotplug_disable_offlining()/;
  - use play_dead_common() to implement acpi_mp_play_dead();
  - cond_resched() in tdx_shared_memory_show();
  - s/target kernel/second kernel/;
  - Update commit messages and comments;

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

Kirill A. Shutemov (14):
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
  cpu/hotplug: Add support for declaring CPU offlining not supported
  cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  x86/mm: Return correct level from lookup_address() if pte is none
  x86/tdx: Account shared memory
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure
  x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

 arch/x86/Kconfig                     |   7 +
 arch/x86/coco/core.c                 |   1 -
 arch/x86/coco/tdx/kexec.c            |   0
 arch/x86/coco/tdx/tdx.c              | 205 ++++++++++++++-
 arch/x86/hyperv/ivm.c                |   9 +-
 arch/x86/include/asm/acpi.h          |   5 +
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/include/asm/smp.h           |   1 +
 arch/x86/include/asm/x86_init.h      |   5 +-
 arch/x86/kernel/acpi/Makefile        |  11 +-
 arch/x86/kernel/acpi/boot.c          |  86 +------
 arch/x86/kernel/acpi/madt_playdead.S |  21 ++
 arch/x86/kernel/acpi/madt_wakeup.c   | 363 +++++++++++++++++++++++++++
 arch/x86/kernel/crash.c              |   4 +
 arch/x86/kernel/e820.c               |   9 +-
 arch/x86/kernel/kvmclock.c           |  12 +-
 arch/x86/kernel/reboot.c             |  22 +-
 arch/x86/kernel/relocate_kernel_64.S |   3 +
 arch/x86/kernel/x86_init.c           |   4 +-
 arch/x86/mm/mem_encrypt_amd.c        |   8 +-
 arch/x86/mm/pat/set_memory.c         |  17 +-
 include/acpi/actbl2.h                |  19 +-
 include/linux/cc_platform.h          |  10 -
 include/linux/cpu.h                  |   2 +
 kernel/cpu.c                         |  12 +-
 25 files changed, 689 insertions(+), 148 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c
 create mode 100644 arch/x86/kernel/acpi/madt_playdead.S
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

-- 
2.41.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
@ 2023-12-05  0:44 ` Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 02/14] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Kirill A. Shutemov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.

Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig                   |  7 +++
 arch/x86/include/asm/acpi.h        |  5 ++
 arch/x86/kernel/acpi/Makefile      | 11 ++--
 arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
 arch/x86/kernel/acpi/madt_wakeup.c | 81 ++++++++++++++++++++++++++++
 5 files changed, 100 insertions(+), 90 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c456c9b1fc7c..969dca563077 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1104,6 +1104,13 @@ config X86_LOCAL_APIC
 	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
 	select IRQ_DOMAIN_HIERARCHY
 
+config X86_ACPI_MADT_WAKEUP
+	def_bool y
+	depends on X86_64
+	depends on ACPI
+	depends on SMP
+	depends on X86_LOCAL_APIC
+
 config X86_IO_APIC
 	def_bool y
 	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index f896eed4516c..2625b915ae7f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -76,6 +76,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
 
 #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
 
+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end);
+
 /*
  * Check if the CPU can handle C2 and deeper
  */
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..8c7329c88a75 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -1,11 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_ACPI)		+= boot.o
-obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
-obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
+obj-$(CONFIG_ACPI)			+= boot.o
+obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_APEI)			+= apei.o
+obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
-obj-y				+= cstate.o
+obj-y					+= cstate.o
 endif
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 1a0dd80d81ac..171d86fe71ef 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -67,13 +67,6 @@ static bool has_lapic_cpus __initdata;
 static bool acpi_support_online_capable;
 #endif
 
-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -369,60 +362,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
-{
-	/*
-	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
-	 *
-	 * Wakeup of secondary CPUs is fully serialized in the core code.
-	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
-	 */
-	if (!acpi_mp_wake_mailbox) {
-		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
-						sizeof(*acpi_mp_wake_mailbox),
-						MEMREMAP_WB);
-	}
-
-	/*
-	 * Mailbox memory is shared between the firmware and OS. Firmware will
-	 * listen on mailbox command address, and once it receives the wakeup
-	 * command, the CPU associated with the given apicid will be booted.
-	 *
-	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
-	 * firmware before the wakeup command is visible.  smp_store_release()
-	 * ensures ordering and visibility.
-	 */
-	acpi_mp_wake_mailbox->apic_id	    = apicid;
-	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
-	smp_store_release(&acpi_mp_wake_mailbox->command,
-			  ACPI_MP_WAKE_COMMAND_WAKEUP);
-
-	/*
-	 * Wait for the CPU to wake up.
-	 *
-	 * The CPU being woken up is essentially in a spin loop waiting to be
-	 * woken up. It should not take long for it wake up and acknowledge by
-	 * zeroing out ->command.
-	 *
-	 * ACPI specification doesn't provide any guidance on how long kernel
-	 * has to wait for a wake up acknowledgement. It also doesn't provide
-	 * a way to cancel a wake up request if it takes too long.
-	 *
-	 * In TDX environment, the VMM has control over how long it takes to
-	 * wake up secondary. It can postpone scheduling secondary vCPU
-	 * indefinitely. Giving up on wake up request and reporting error opens
-	 * possible attack vector for VMM: it can wake up a secondary CPU when
-	 * kernel doesn't expect it. Wait until positive result of the wake up
-	 * request.
-	 */
-	while (READ_ONCE(acpi_mp_wake_mailbox->command))
-		cpu_relax();
-
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1159,29 +1098,6 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
-				     const unsigned long end)
-{
-	struct acpi_madt_multiproc_wakeup *mp_wake;
-
-	if (!IS_ENABLED(CONFIG_SMP))
-		return -ENODEV;
-
-	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
-	if (BAD_MADT_ENTRY(mp_wake, end))
-		return -EINVAL;
-
-	acpi_table_print_madt_entry(&header->common);
-
-	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
-
-	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
-
-	return 0;
-}
-#endif				/* CONFIG_X86_64 */
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1378,7 +1294,7 @@ static void __init acpi_process_madt(void)
 				smp_found_config = 1;
 			}
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_ACPI_MADT_WAKEUP
 			/*
 			 * Parse MADT MP Wake entry.
 			 */
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
new file mode 100644
index 000000000000..f4be492b7e4c
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -0,0 +1,81 @@
+#include <linux/acpi.h>
+#include <linux/io.h>
+#include <asm/apic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+
+static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
+{
+	/*
+	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
+	 *
+	 * Wakeup of secondary CPUs is fully serialized in the core code.
+	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
+	 */
+	if (!acpi_mp_wake_mailbox) {
+		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+						sizeof(*acpi_mp_wake_mailbox),
+						MEMREMAP_WB);
+	}
+
+	/*
+	 * Mailbox memory is shared between the firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, the CPU associated with the given apicid will be booted.
+	 *
+	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
+	 * firmware before the wakeup command is visible.  smp_store_release()
+	 * ensures ordering and visibility.
+	 */
+	acpi_mp_wake_mailbox->apic_id	    = apicid;
+	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * Wait for the CPU to wake up.
+	 *
+	 * The CPU being woken up is essentially in a spin loop waiting to be
+	 * woken up. It should not take long for it wake up and acknowledge by
+	 * zeroing out ->command.
+	 *
+	 * ACPI specification doesn't provide any guidance on how long kernel
+	 * has to wait for a wake up acknowledgement. It also doesn't provide
+	 * a way to cancel a wake up request if it takes too long.
+	 *
+	 * In TDX environment, the VMM has control over how long it takes to
+	 * wake up secondary. It can postpone scheduling secondary vCPU
+	 * indefinitely. Giving up on wake up request and reporting error opens
+	 * possible attack vector for VMM: it can wake up a secondary CPU when
+	 * kernel doesn't expect it. Wait until positive result of the wake up
+	 * request.
+	 */
+	while (READ_ONCE(acpi_mp_wake_mailbox->command))
+		cpu_relax();
+
+	return 0;
+}
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+
+	return 0;
+}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 02/14] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
@ 2023-12-05  0:44 ` Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox initialized once
during ACPI MADT init and never changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kernel/acpi/madt_wakeup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index f4be492b7e4c..38ffd4524e44 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -5,10 +5,10 @@
 #include <asm/processor.h>
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
+static u64 acpi_mp_wake_mailbox_paddr __ro_after_init;
 
 /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __ro_after_init;
 
 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
  2023-12-05  0:44 ` [PATCHv4 02/14] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Kirill A. Shutemov
@ 2023-12-05  0:44 ` Kirill A. Shutemov
  2023-12-15 19:42   ` Thomas Gleixner
  2023-12-05  0:45 ` [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The ACPI MADT mailbox wakeup method doesn't allow to offline CPU after
it got woke up.

Currently offlining hotplug is prevented based on the confidential
computing attribute which is set for Intel TDX. But TDX is not
the only possible user of the wake up method. The MADT wakeup can be
implemented outside of a confidential computing environment. Offline
support is a property of the wakeup method, not the CoCo implementation.

Introduce cpu_hotplug_disable_offlining() that can be called to indicate
that CPU offlining should be disabled.

This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI
MADT wakeup method.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/cpu.h |  2 ++
 kernel/cpu.c        | 13 ++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index fc8094419084..46f2e34a0c5e 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -134,6 +134,7 @@ extern void cpus_read_lock(void);
 extern void cpus_read_unlock(void);
 extern int  cpus_read_trylock(void);
 extern void lockdep_assert_cpus_held(void);
+extern void cpu_hotplug_disable_offlining(void);
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 void clear_tasks_mm_cpumask(int cpu);
@@ -149,6 +150,7 @@ static inline void cpus_read_lock(void) { }
 static inline void cpus_read_unlock(void) { }
 static inline int  cpus_read_trylock(void) { return true; }
 static inline void lockdep_assert_cpus_held(void) { }
+static inline void cpu_hotplug_disable_offlining(void) { }
 static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
 static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a86972a91991..af8034ccda8e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -484,6 +484,8 @@ static int cpu_hotplug_disabled;
 
 DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
 
+static bool cpu_hotplug_offline_disabled;
+
 void cpus_read_lock(void)
 {
 	percpu_down_read(&cpu_hotplug_lock);
@@ -543,6 +545,14 @@ static void lockdep_release_cpus_lock(void)
 	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
 }
 
+/* Declare CPU offlining not supported */
+void cpu_hotplug_disable_offlining(void)
+{
+	cpu_maps_update_begin();
+	cpu_hotplug_offline_disabled = true;
+	cpu_maps_update_done();
+}
+
 /*
  * Wait for currently running CPU hotplug operations to complete (if any) and
  * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
@@ -1522,7 +1532,8 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) ||
+	    cpu_hotplug_offline_disabled)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2023-12-05  0:44 ` [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-15 19:43   ` Thomas Gleixner
  2023-12-05  0:45 ` [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled Kirill A. Shutemov
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently hotplug prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method.

Disable CPU offlining on ACPI MADT wakeup enumeration.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c               |  1 -
 arch/x86/kernel/acpi/madt_wakeup.c |  3 +++
 include/linux/cc_platform.h        | 10 ----------
 kernel/cpu.c                       |  3 +--
 4 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..f07c3bb7deab 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
-	case CC_ATTR_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
 	case CC_ATTR_MEM_ENCRYPT:
 		return true;
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 38ffd4524e44..f7e33cea1be5 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,4 +1,5 @@
 #include <linux/acpi.h>
+#include <linux/cpu.h>
 #include <linux/io.h>
 #include <asm/apic.h>
 #include <asm/barrier.h>
@@ -75,6 +76,8 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
 
+	cpu_hotplug_disable_offlining();
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..d08dd65b5c43 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -80,16 +80,6 @@ enum cc_attr {
 	 * using AMD SEV-SNP features.
 	 */
 	CC_ATTR_GUEST_SEV_SNP,
-
-	/**
-	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
-	 *
-	 * The platform/OS is running as a guest/virtual machine does not
-	 * support CPU hotplug feature.
-	 *
-	 * Examples include TDX Guest.
-	 */
-	CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index af8034ccda8e..a9e1628cebbb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1532,8 +1532,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) ||
-	    cpu_hotplug_offline_disabled)
+	if (cpu_hotplug_offline_disabled)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-11 23:10   ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov, Vitaly Kuznetsov, Paolo Bonzini, Wanpeng Li

kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:

	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)

kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.

Do not disable kvmclock if it was not enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kernel/kvmclock.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..f2fff625576d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -24,8 +24,8 @@
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
-static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static int msr_kvm_system_time __ro_after_init;
+static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
@@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-	native_write_msr(msr_kvm_system_time, 0, 0);
+	if (msr_kvm_system_time)
+		native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
@@ -294,7 +295,10 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
 		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
 		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-	} else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+	} else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+	} else {
 		return;
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05 23:58   ` Huang, Kai
  2023-12-05  0:45 ` [PATCHv4 07/14] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Use alternatives to keep the flag during kexec for TDX guests.

The change doesn't affect non-TDX-guest environments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kernel/relocate_kernel_64.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..cd6a53667c6b 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,12 +145,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	 * Set cr4 to a known state:
 	 *  - physical address extension enabled
 	 *  - 5-level paging, if it was enabled before
+	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
+	 *    in TDX guests.
 	 */
 	movl	$X86_CR4_PAE, %eax
 	testq	$X86_CR4_LA57, %r13
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
 1:
+	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %eax), X86_FEATURE_TDX_GUEST
 	movq	%rax, %cr4
 
 	jmp 1f
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 07/14] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 08/14] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX is going to have more than one reason to fail
enc_status_change_prepare().

Change the callback to return errno instead of assuming -EIO;
enc_status_change_finish() changed too to keep the interface symmetric.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c         | 20 +++++++++++---------
 arch/x86/hyperv/ivm.c           |  9 +++------
 arch/x86/include/asm/x86_init.h |  4 ++--
 arch/x86/kernel/x86_init.c      |  4 ++--
 arch/x86/mm/mem_encrypt_amd.c   |  8 ++++----
 arch/x86/mm/pat/set_memory.c    |  9 +++++----
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1b5d17a9f70d..2d90043a0e91 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -797,28 +797,30 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	return true;
 }
 
-static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
-					  bool enc)
+static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+					 bool enc)
 {
 	/*
 	 * Only handle shared->private conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (enc)
-		return tdx_enc_status_changed(vaddr, numpages, enc);
-	return true;
+	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+		return -EIO;
+
+	return 0;
 }
 
-static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 					 bool enc)
 {
 	/*
 	 * Only handle private->shared conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (!enc)
-		return tdx_enc_status_changed(vaddr, numpages, enc);
-	return true;
+	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+		return -EIO;
+
+	return 0;
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 02e55237d919..2e1be1afeebe 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -510,13 +510,12 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
  * with host. This function works as wrap of hv_mark_gpa_visibility()
  * with memory base and size.
  */
-static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
+static int hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
 {
 	enum hv_mem_host_visibility visibility = enc ?
 			VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
 	u64 *pfn_array;
 	int ret = 0;
-	bool result = true;
 	int i, pfn;
 
 	pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
@@ -530,17 +529,15 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
 		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
 			ret = hv_mark_gpa_visibility(pfn, pfn_array,
 						     visibility);
-			if (ret) {
-				result = false;
+			if (ret)
 				goto err_free_pfn_array;
-			}
 			pfn = 0;
 		}
 	}
 
  err_free_pfn_array:
 	kfree(pfn_array);
-	return result;
+	return ret;
 }
 
 static bool hv_vtom_tlb_flush_required(bool private)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index c878616a18b8..c9503fe2d13a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,8 +150,8 @@ struct x86_init_acpi {
  * @enc_cache_flush_required	Returns true if a cache flush is needed before changing page encryption status
  */
 struct x86_guest {
-	bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
-	bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
+	int (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+	int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
 };
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3b4773..f0f54e109eb9 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -131,8 +131,8 @@ struct x86_cpuinit_ops x86_cpuinit = {
 
 static void default_nmi_init(void) { };
 
-static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
+static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
 static bool is_private_mmio_noop(u64 addr) {return false; }
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index a68f2dda0948..6cf6cc8ae6a6 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -282,7 +282,7 @@ static void enc_dec_hypercall(unsigned long vaddr, unsigned long size, bool enc)
 #endif
 }
 
-static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
 {
 	/*
 	 * To maintain the security guarantees of SEV-SNP guests, make sure
@@ -291,11 +291,11 @@ static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
 	if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
 		snp_set_memory_shared(vaddr, npages);
 
-	return true;
+	return 0;
 }
 
 /* Return true unconditionally: return value doesn't matter for the SEV side */
-static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
 {
 	/*
 	 * After memory is mapped encrypted in the page table, validate it
@@ -307,7 +307,7 @@ static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool e
 	if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
 		enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
 
-	return true;
+	return 0;
 }
 
 static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6fbf22d5fa56 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2152,8 +2152,9 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
 
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
-	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
-		return -EIO;
+	ret = x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+	if (ret)
+		return ret;
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2168,8 +2169,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 
 	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
 	if (!ret) {
-		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
-			ret = -EIO;
+		ret = x86_platform.guest.enc_status_change_finish(addr,
+								  numpages, enc);
 	}
 
 	return ret;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 08/14] x86/mm: Return correct level from lookup_address() if pte is none
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 07/14] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 09/14] x86/tdx: Account shared memory Kirill A. Shutemov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

lookup_address() only returns correct page table level for the entry if
the entry is not none.

Make the helper to always return correct 'level'. It allows to implement
iterator over kernel page tables using lookup_address().

Add one more entry into enum pg_level to indicate size of VA covered by
one PGD entry in 5-level paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/pat/set_memory.c         | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..3f648ffdfbe5 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -548,6 +548,7 @@ enum pg_level {
 	PG_LEVEL_2M,
 	PG_LEVEL_1G,
 	PG_LEVEL_512G,
+	PG_LEVEL_256T,
 	PG_LEVEL_NUM
 };
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6fbf22d5fa56..01f827eb8e80 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 
-	*level = PG_LEVEL_NONE;
+	*level = PG_LEVEL_256T;
 
 	if (pgd_none(*pgd))
 		return NULL;
 
+	*level = PG_LEVEL_512G;
 	p4d = p4d_offset(pgd, address);
 	if (p4d_none(*p4d))
 		return NULL;
 
-	*level = PG_LEVEL_512G;
 	if (p4d_large(*p4d) || !p4d_present(*p4d))
 		return (pte_t *)p4d;
 
+	*level = PG_LEVEL_1G;
 	pud = pud_offset(p4d, address);
 	if (pud_none(*pud))
 		return NULL;
 
-	*level = PG_LEVEL_1G;
 	if (pud_large(*pud) || !pud_present(*pud))
 		return (pte_t *)pud;
 
+	*level = PG_LEVEL_2M;
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return NULL;
 
-	*level = PG_LEVEL_2M;
 	if (pmd_large(*pmd) || !pmd_present(*pmd))
 		return (pte_t *)pmd;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 09/14] x86/tdx: Account shared memory
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 08/14] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The kernel will convert all shared memory back to private during kexec.
The direct mapping page tables will provide information on which memory
is shared.

It is extremely important to convert all shared memory. If a page is
missed, it will cause the second kernel to crash when it accesses it.

Keep track of the number of shared pages. This will allow for
cross-checking against the shared information in the direct mapping and
reporting if the shared bit is lost.

Include a debugfs interface that allows for the check to be performed at
any point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 69 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 2d90043a0e91..fcc159497554 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/debugfs.h>
 #include <linux/export.h>
 #include <linux/io.h>
 #include <asm/coco.h>
@@ -37,6 +38,13 @@
 
 #define TDREPORT_SUBTYPE_0	0
 
+static atomic_long_t nr_shared;
+
+static inline bool pte_decrypted(pte_t pte)
+{
+	return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __noreturn __tdx_hypercall_failed(void)
 {
@@ -820,6 +828,11 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
 		return -EIO;
 
+	if (enc)
+		atomic_long_sub(numpages, &nr_shared);
+	else
+		atomic_long_add(numpages, &nr_shared);
+
 	return 0;
 }
 
@@ -895,3 +908,59 @@ void __init tdx_early_init(void)
 
 	pr_info("Guest detected\n");
 }
+
+#ifdef CONFIG_DEBUG_FS
+static int tdx_shared_memory_show(struct seq_file *m, void *p)
+{
+	unsigned long addr, end;
+	unsigned long found = 0;
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte))
+			found += size / PAGE_SIZE;
+
+		addr += size;
+
+		cond_resched();
+	}
+
+	seq_printf(m, "Number of shared pages in kernel page tables:  %16lu\n",
+		   found);
+	seq_printf(m, "Number of pages accounted as shared:           %16ld\n",
+		   atomic_long_read(&nr_shared));
+	return 0;
+}
+
+static int tdx_shared_memory_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, tdx_shared_memory_show, NULL);
+}
+
+static const struct file_operations tdx_shared_memory_fops = {
+	.open           = tdx_shared_memory_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static __init int debug_tdx_shared_memory(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return 0;
+
+	debugfs_create_file("tdx_shared_memory", S_IRUSR, arch_debugfs_dir,
+			    NULL, &tdx_shared_memory_fops);
+	return 0;
+}
+fs_initcall(debug_tdx_shared_memory);
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 09/14] x86/tdx: Account shared memory Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-06  1:28   ` Edgecombe, Rick P
  2023-12-05  0:45 ` [PATCHv4 11/14] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/kexec.c       |   0
 arch/x86/coco/tdx/tdx.c         | 120 +++++++++++++++++++++++++++++++-
 arch/x86/include/asm/x86_init.h |   1 +
 arch/x86/kernel/crash.c         |   4 ++
 arch/x86/kernel/reboot.c        |  10 +++
 5 files changed, 133 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c

diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index fcc159497554..46355ea9f4cf 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -6,14 +6,17 @@
 
 #include <linux/cpufeature.h>
 #include <linux/debugfs.h>
+#include <linux/delay.h>
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/kexec.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/set_memory.h>
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -40,6 +43,9 @@
 
 static atomic_long_t nr_shared;
 
+static atomic_t conversions_in_progress;
+static bool conversion_allowed = true;
+
 static inline bool pte_decrypted(pte_t pte)
 {
 	return cc_mkdec(pte_val(pte)) == pte_val(pte);
@@ -725,6 +731,14 @@ static bool tdx_tlb_flush_required(bool private)
 
 static bool tdx_cache_flush_required(void)
 {
+	/*
+	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
+	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
+	 * implicit conversion to shared.
+	 */
+	if (!conversion_allowed)
+		return false;
+
 	/*
 	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
 	 * TDX doesn't have such capability.
@@ -808,12 +822,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
 					 bool enc)
 {
+	atomic_inc(&conversions_in_progress);
+
+	/*
+	 * Check after bumping conversions_in_progress to serialize
+	 * against tdx_shutdown().
+	 */
+	if (!conversion_allowed) {
+		atomic_dec(&conversions_in_progress);
+		return -EBUSY;
+	}
+
 	/*
 	 * Only handle shared->private conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	return 0;
 }
@@ -825,17 +852,104 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	 * Only handle private->shared conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	if (enc)
 		atomic_long_sub(numpages, &nr_shared);
 	else
 		atomic_long_add(numpages, &nr_shared);
 
+	atomic_dec(&conversions_in_progress);
+
 	return 0;
 }
 
+static void tdx_kexec_unshare_mem(bool crash)
+{
+	unsigned long addr, end;
+	long found = 0, shared;
+
+	/* Stop new private<->shared conversions */
+	conversion_allowed = false;
+
+	/*
+	 * Crash kernel reaches here with interrupts disabled: can't wait for
+	 * conversions to finish.
+	 *
+	 * If race happened, just report and proceed.
+	 */
+	if (!crash) {
+		unsigned long timeout;
+
+		/*
+		 * Wait for in-flight conversions to complete.
+		 *
+		 * Do not wait more than 30 seconds.
+		 */
+		timeout = 30 * USEC_PER_SEC;
+		while (atomic_read(&conversions_in_progress) && timeout--)
+			udelay(1);
+	}
+
+	if (atomic_read(&conversions_in_progress))
+		pr_warn("Failed to finish shared<->private conversions\n");
+
+	/*
+	 * Walk direct mapping and convert all shared memory back to private,
+	 */
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte)) {
+			int pages = size / PAGE_SIZE;
+
+			/*
+			 * Touching memory with shared bit set triggers implicit
+			 * conversion to shared.
+			 *
+			 * Make sure nobody touches the shared range from
+			 * now on.
+			 *
+			 * Bypass unmapping for crash scenario. Unmapping
+			 * requires sleepable context, but in crash case kernel
+			 * hits the code path with interrupts disabled.
+			 * It shouldn't be a problem as all secondary CPUs are
+			 * down and kernel runs with interrupts disabled, so
+			 * there is no room for race.
+			 */
+			if (!crash)
+				set_memory_np(addr, pages);
+
+			if (!tdx_enc_status_changed(addr, pages, true)) {
+				pr_err("Failed to unshare range %#lx-%#lx\n",
+				       addr, addr + size);
+			}
+
+			found += pages;
+		}
+
+		addr += size;
+	}
+
+	shared = atomic_long_read(&nr_shared);
+	if (shared != found) {
+		pr_err("shared page accounting is off\n");
+		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+	}
+}
+
 void __init tdx_early_init(void)
 {
 	struct tdx_module_args args = {
@@ -895,6 +1009,8 @@ void __init tdx_early_init(void)
 	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
 	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
 
+	x86_platform.guest.enc_kexec_unshare_mem     = tdx_kexec_unshare_mem;
+
 	/*
 	 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 	 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index c9503fe2d13a..917358821a31 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -154,6 +154,7 @@ struct x86_guest {
 	int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
+	void (*enc_kexec_unshare_mem)(bool crash);
 };
 
 /**
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c92d88680dbf..1618224775f5 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -40,6 +40,7 @@
 #include <asm/intel_pt.h>
 #include <asm/crash.h>
 #include <asm/cmdline.h>
+#include <asm/tdx.h>
 
 /* Used while preparing memory map entries for second kernel */
 struct crash_memmap_data {
@@ -107,6 +108,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 
 	crash_smp_send_stop();
 
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+		x86_platform.guest.enc_kexec_unshare_mem(true);
+
 	cpu_emergency_disable_virtualization();
 
 	/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 830425e6d38e..c81afffaa954 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -12,6 +12,7 @@
 #include <linux/delay.h>
 #include <linux/objtool.h>
 #include <linux/pgtable.h>
+#include <linux/kexec.h>
 #include <acpi/reboot.h>
 #include <asm/io.h>
 #include <asm/apic.h>
@@ -31,6 +32,7 @@
 #include <asm/realmode.h>
 #include <asm/x86_init.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 /*
  * Power off function, if any
@@ -716,6 +718,14 @@ static void native_machine_emergency_restart(void)
 
 void native_machine_shutdown(void)
 {
+	/*
+	 * Call enc_kexec_unshare_mem() while all CPUs are still active and
+	 * interrupts are enabled. This will allow all in-flight memory
+	 * conversions to finish cleanly before unsharing all memory.
+	 */
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) && kexec_in_progress)
+		x86_platform.guest.enc_kexec_unshare_mem(false);
+
 	/* Stop the cpus and apics */
 #ifdef CONFIG_X86_IO_APIC
 	/*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 11/14] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 12/14] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Kirill A. Shutemov
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

e820__end_of_ram_pfn() is used to calculate max_pfn which, among other
things, guides where direct mapping ends. Any memory above max_pfn is
not going to be present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the ram based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI
tables and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory
above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash
memory as E820_TYPE_RAM on boot. If the preallocated range is small, it
can fit under the last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX
guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass
it between the kernels on kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/e820.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..99c80680dc9e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -827,7 +827,7 @@ u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type type)
+static unsigned long __init e820_end_ram_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -838,7 +838,8 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (entry->type != type)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_ACPI)
 			continue;
 
 		start_pfn = entry->addr >> PAGE_SHIFT;
@@ -864,12 +865,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 
 unsigned long __init e820__end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
+	return e820_end_ram_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820__end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM);
+	return e820_end_ram_pfn(1UL << (32 - PAGE_SHIFT));
 }
 
 static void __init early_panic(char *msg)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 12/14] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 11/14] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
  2023-12-05  0:45 ` [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
  13 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

To prepare for the addition of support for MADT wakeup structure version
1, it is necessary to provide more appropriate names for the fields in
the structure.

The field 'mailbox_version' renamed as 'version'. This field signifies
the version of the structure and the related protocols, rather than the
version of the mailbox. This field has not been utilized in the code
thus far.

The field 'base_address' renamed as 'mailbox_address' to clarify the
kind of address it represents. In version 1, the structure includes the
reset vector address. Clear and distinct naming helps to prevent any
confusion.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/acpi/madt_wakeup.c | 2 +-
 include/acpi/actbl2.h              | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index f7e33cea1be5..386adbb03094 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -74,7 +74,7 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	acpi_table_print_madt_entry(&header->common);
 
-	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+	acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
 
 	cpu_hotplug_disable_offlining();
 
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index 3751ae69432f..23b4cfb640fc 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1109,9 +1109,9 @@ struct acpi_madt_generic_translator {
 
 struct acpi_madt_multiproc_wakeup {
 	struct acpi_subtable_header header;
-	u16 mailbox_version;
+	u16 version;
 	u32 reserved;		/* reserved - must be zero */
-	u64 base_address;
+	u64 mailbox_address;
 };
 
 #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE        2032
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 12/14] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-15 20:08   ` Thomas Gleixner
  2023-12-05  0:45 ` [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
  13 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
kexec: the second kernel won't be able to use more than one CPU.

Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
The acpi_wakeup_cpu() will use it to bring up secondary cpus.

Zero out mailbox address in the ACPI MADT wakeup structure to indicate
that the mailbox is not usable.  This prevents the kexec()-ed kernel
from reading a vaild mailbox, which in turn makes the kexec()-ed kernel
only be able to use the boot CPU.

This is Linux-specific protocol and not reflected in ACPI spec.

Booting the second kernel with signle CPU is enough to cover the most
common case for kexec -- kdump.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/acpi/madt_wakeup.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 386adbb03094..5d92d12f1042 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -13,6 +13,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __ro_afte
 
 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
 {
+	if (!acpi_mp_wake_mailbox_paddr) {
+		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
+		return -EOPNOTSUPP;
+	}
+
 	/*
 	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
 	 *
@@ -78,6 +83,23 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	cpu_hotplug_disable_offlining();
 
+	/*
+	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
+	 * It limits kexec: the second kernel won't be able to use more than
+	 * one CPU.
+	 *
+	 * Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
+	 * The acpi_wakeup_cpu() will use it to bring up secondary cpus.
+	 *
+	 * Zero out mailbox address in the ACPI MADT wakeup structure to
+	 * indicate that the mailbox is not usable.  This prevents the
+	 * kexec()-ed kernel from reading a vaild mailbox, which in turn
+	 * makes the kexec()-ed kernel only be able to use the boot CPU.
+	 *
+	 * This is Linux-specific protocol and not reflected in ACPI spec.
+	 */
+	mp_wake->mailbox_address = 0;
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2023-12-05  0:45 ` [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
@ 2023-12-05  0:45 ` Kirill A. Shutemov
  2023-12-05 23:36   ` Huang, Kai
  2023-12-15 20:29   ` Thomas Gleixner
  13 siblings, 2 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-05  0:45 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

MADT Multiprocessor Wakeup structure version 1 brings support of CPU
offlining: BIOS provides a reset vector where the CPU has to jump to
offline itself. The new TEST mailbox command can be used to test the CPU
offlined successfully and BIOS has control over it.

Add CPU offling support for ACPI MADT wakeup method by implementing
custom cpu_die, play_dead and stop_other_cpus SMP operations.

CPU offlining makes is possible to hand over secondary CPUs over kexec,
not limiting the second kernel to single CPU.

The change conforms to the approved ACPI spec change proposal. See the
Link.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
---
 arch/x86/include/asm/smp.h           |   1 +
 arch/x86/kernel/acpi/Makefile        |   2 +-
 arch/x86/kernel/acpi/madt_playdead.S |  21 ++
 arch/x86/kernel/acpi/madt_wakeup.c   | 295 +++++++++++++++++++++++++--
 arch/x86/kernel/reboot.c             |  12 +-
 include/acpi/actbl2.h                |  15 +-
 6 files changed, 321 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt_playdead.S

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 4fab2ed454f3..3c8efba86d5c 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -38,6 +38,7 @@ struct smp_ops {
 	int (*cpu_disable)(void);
 	void (*cpu_die)(unsigned int cpu);
 	void (*play_dead)(void);
+	void (*crash_play_dead)(void);
 
 	void (*send_call_func_ipi)(const struct cpumask *mask);
 	void (*send_call_func_single_ipi)(int cpu);
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index 8c7329c88a75..37b1f28846de 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI)			+= boot.o
 obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
 obj-$(CONFIG_ACPI_APEI)			+= apei.o
 obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
-obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o madt_playdead.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
 obj-y					+= cstate.o
diff --git a/arch/x86/kernel/acpi/madt_playdead.S b/arch/x86/kernel/acpi/madt_playdead.S
new file mode 100644
index 000000000000..68f83865a1e3
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_playdead.S
@@ -0,0 +1,21 @@
+#include <linux/linkage.h>
+#include <asm/nospec-branch.h>
+#include <asm/page_types.h>
+#include <asm/processor-flags.h>
+
+	.text
+	.align PAGE_SIZE
+SYM_FUNC_START(asm_acpi_mp_play_dead)
+	/* Turn off global entries. Following CR3 write will flush them. */
+	movq	%cr4, %rdx
+	andq	$~(X86_CR4_PGE), %rdx
+	movq	%rdx, %cr4
+
+	/* Switch to identity mapping */
+	movq	%rsi, %rax
+	movq	%rax, %cr3
+
+	/* Jump to reset vector */
+	ANNOTATE_RETPOLINE_SAFE
+	jmp	*%rdi
+SYM_FUNC_END(asm_acpi_mp_play_dead)
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 5d92d12f1042..f8cf7a048743 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,9 +1,18 @@
 #include <linux/acpi.h>
 #include <linux/cpu.h>
+#include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/kexec.h>
+#include <linux/memblock.h>
+#include <linux/pgtable.h>
+#include <linux/sched/hotplug.h>
 #include <asm/apic.h>
 #include <asm/barrier.h>
+#include <asm/init.h>
+#include <asm/intel_pt.h>
+#include <asm/nmi.h>
 #include <asm/processor.h>
+#include <asm/reboot.h>
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
 static u64 acpi_mp_wake_mailbox_paddr __ro_after_init;
@@ -11,6 +20,228 @@ static u64 acpi_mp_wake_mailbox_paddr __ro_after_init;
 /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
 static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __ro_after_init;
 
+static u64 acpi_mp_pgd __ro_after_init;
+static u64 acpi_mp_reset_vector_paddr __ro_after_init;
+
+void asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa);
+
+static void crash_acpi_mp_play_dead(void)
+{
+	asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
+			      acpi_mp_pgd);
+}
+
+static void acpi_mp_play_dead(void)
+{
+	play_dead_common();
+	asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
+			      acpi_mp_pgd);
+}
+
+static void acpi_mp_cpu_die(unsigned int cpu)
+{
+	u32 apicid = per_cpu(x86_cpu_to_apicid, cpu);
+	unsigned long timeout;
+
+	/*
+	 * Use TEST mailbox command to prove that BIOS got control over
+	 * the CPU before declaring it dead.
+	 *
+	 * BIOS has to clear 'command' field of the mailbox.
+	 */
+	acpi_mp_wake_mailbox->apic_id = apicid;
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_TEST);
+
+	/* Don't wait longer than a second. */
+	timeout = USEC_PER_SEC;
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+		udelay(1);
+}
+
+static void acpi_mp_stop_other_cpus(int wait)
+{
+	smp_shutdown_nonboot_cpus(smp_processor_id());
+}
+
+/* The argument is required to match type of x86_mapping_info::alloc_pgt_page */
+static void __init *alloc_pgt_page(void *dummy)
+{
+	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+/*
+ * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
+ * the same place as in the kernel page tables. asm_acpi_mp_play_dead() switches
+ * to the identity mapping and the function has be present at the same spot in
+ * the virtual address space before and after switching page tables.
+ */
+static int __init init_transition_pgtable(pgd_t *pgd)
+{
+	pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+	unsigned long vaddr, paddr;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	vaddr = (unsigned long)asm_acpi_mp_play_dead;
+	pgd += pgd_index(vaddr);
+	if (!pgd_present(*pgd)) {
+		p4d = (p4d_t *)alloc_pgt_page(NULL);
+		if (!p4d)
+			return -ENOMEM;
+		set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+	}
+	p4d = p4d_offset(pgd, vaddr);
+	if (!p4d_present(*p4d)) {
+		pud = (pud_t *)alloc_pgt_page(NULL);
+		if (!pud)
+			return -ENOMEM;
+		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+	}
+	pud = pud_offset(p4d, vaddr);
+	if (!pud_present(*pud)) {
+		pmd = (pmd_t *)alloc_pgt_page(NULL);
+		if (!pmd)
+			return -ENOMEM;
+		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+	}
+	pmd = pmd_offset(pud, vaddr);
+	if (!pmd_present(*pmd)) {
+		pte = (pte_t *)alloc_pgt_page(NULL);
+		if (!pte)
+			return -ENOMEM;
+		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+	}
+	pte = pte_offset_kernel(pmd, vaddr);
+
+	paddr = __pa(vaddr);
+	set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+
+	return 0;
+}
+
+static void __init free_pte(pmd_t *pmd)
+{
+	pte_t *pte = pte_offset_kernel(pmd, 0);
+
+	memblock_free(pte, PAGE_SIZE);
+}
+
+static void __init free_pmd(pud_t *pud)
+{
+	pmd_t *pmd = pmd_offset(pud, 0);
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_present(pmd[i]))
+		    continue;
+
+		if (pmd_leaf(pmd[i]))
+		    continue;
+
+		free_pte(&pmd[i]);
+	}
+
+	memblock_free(pmd, PAGE_SIZE);
+}
+
+static void __init free_pud(p4d_t *p4d)
+{
+	pud_t *pud = pud_offset(p4d, 0);
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_present(pud[i]))
+			continue;
+
+		if (pud_leaf(pud[i]))
+		    continue;
+
+		free_pmd(&pud[i]);
+	}
+
+	memblock_free(pud, PAGE_SIZE);
+}
+
+static void __init free_p4d(pgd_t *pgd)
+{
+	p4d_t *p4d = p4d_offset(pgd, 0);
+	int i;
+
+	for (i = 0; i < PTRS_PER_P4D; i++) {
+		if (!p4d_present(p4d[i]))
+			continue;
+
+		free_pud(&p4d[i]);
+	}
+
+	if (pgtable_l5_enabled())
+		memblock_free(p4d, PAGE_SIZE);
+}
+
+static void __init free_pgd(pgd_t *pgd)
+{
+	int i;
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		if (!pgd_present(pgd[i]))
+			continue;
+
+		free_p4d(&pgd[i]);
+	}
+
+	memblock_free(pgd, PAGE_SIZE);
+}
+
+static int __init acpi_mp_setup_reset(u64 reset_vector)
+{
+	pgd_t *pgd;
+	struct x86_mapping_info info = {
+		.alloc_pgt_page = alloc_pgt_page,
+		.page_flag      = __PAGE_KERNEL_LARGE_EXEC,
+		.kernpg_flag    = _KERNPG_TABLE_NOENC,
+	};
+
+	pgd = alloc_pgt_page(NULL);
+	if (!pgd)
+		return -ENOMEM;
+
+	for (int i = 0; i < nr_pfn_mapped; i++) {
+		unsigned long mstart, mend;
+
+		mstart = pfn_mapped[i].start << PAGE_SHIFT;
+		mend   = pfn_mapped[i].end << PAGE_SHIFT;
+		if (kernel_ident_mapping_init(&info, pgd, mstart, mend)) {
+			free_pgd(pgd);
+			return -ENOMEM;
+		}
+	}
+
+	if (kernel_ident_mapping_init(&info, pgd,
+				      PAGE_ALIGN_DOWN(reset_vector),
+				      PAGE_ALIGN(reset_vector + 1))) {
+		free_pgd(pgd);
+		return -ENOMEM;
+	}
+
+	if (init_transition_pgtable(pgd)) {
+		free_pgd(pgd);
+		return -ENOMEM;
+	}
+
+	smp_ops.play_dead = acpi_mp_play_dead;
+	smp_ops.crash_play_dead = crash_acpi_mp_play_dead;
+	smp_ops.cpu_die = acpi_mp_cpu_die;
+	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
+
+	acpi_mp_reset_vector_paddr = reset_vector;
+	acpi_mp_pgd = __pa(pgd);
+
+	return 0;
+}
+
 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
 {
 	if (!acpi_mp_wake_mailbox_paddr) {
@@ -68,37 +299,63 @@ static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
 	return 0;
 }
 
+static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup *mp_wake)
+{
+	cpu_hotplug_disable_offlining();
+
+	/*
+	 * Zero out mailbox address in the ACPI MADT wakeup structure
+	 * to indicate that the mailbox is not usable.  This prevents
+	 * the kexec()-ed kernel from reading a vaild mailbox, which in
+	 * turn makes the kexec()-ed kernel only be able to use the boot
+	 * CPU.
+	 *
+	 * This is Linux-specific protocol and not reflected in ACPI spec.
+	 *
+	 * acpi_mp_wake_mailbox_paddr already has the mailbox address.
+	 * The acpi_wakeup_cpu() will use it to bring up secondary cpus for
+	 * the current kernel.
+	 */
+	mp_wake->mailbox_address = 0;
+}
+
 int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 			      const unsigned long end)
 {
 	struct acpi_madt_multiproc_wakeup *mp_wake;
 
 	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
-	if (BAD_MADT_ENTRY(mp_wake, end))
+
+        /*
+         * Cannot use the standard BAD_MADT_ENTRY() to sanity check the @mp_wake
+         * entry.  'sizeof (struct acpi_madt_multiproc_wakeup)' can be larger
+         * than the actual size of the MP wakeup entry in ACPI table because the
+	 * 'reset_vector' is only available in the V1 MP wakeup structure.
+         */
+	if (!mp_wake)
+		return -EINVAL;
+	if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0)
+		return -EINVAL;
+	if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0)
 		return -EINVAL;
 
 	acpi_table_print_madt_entry(&header->common);
 
 	acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
 
-	cpu_hotplug_disable_offlining();
-
-	/*
-	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
-	 * It limits kexec: the second kernel won't be able to use more than
-	 * one CPU.
-	 *
-	 * Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
-	 * The acpi_wakeup_cpu() will use it to bring up secondary cpus.
-	 *
-	 * Zero out mailbox address in the ACPI MADT wakeup structure to
-	 * indicate that the mailbox is not usable.  This prevents the
-	 * kexec()-ed kernel from reading a vaild mailbox, which in turn
-	 * makes the kexec()-ed kernel only be able to use the boot CPU.
-	 *
-	 * This is Linux-specific protocol and not reflected in ACPI spec.
-	 */
-	mp_wake->mailbox_address = 0;
+	if (mp_wake->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1 &&
+	    mp_wake->header.length >= ACPI_MADT_MP_WAKEUP_SIZE_V1) {
+		if (acpi_mp_setup_reset(mp_wake->reset_vector)) {
+			pr_warn("Failed to setup MADT reset vector\n");
+			acpi_mp_disable_offlining(mp_wake);
+		}
+	} else {
+		/*
+		 * CPU offlining requires version 1 of the ACPI MADT wakeup
+		 * structure.
+		 */
+		acpi_mp_disable_offlining(mp_wake);
+	}
 
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index c81afffaa954..99e6ab552da0 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -878,10 +878,14 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 	cpu_emergency_disable_virtualization();
 
 	atomic_dec(&waiting_for_crash_ipi);
-	/* Assume hlt works */
-	halt();
-	for (;;)
-		cpu_relax();
+
+	if (smp_ops.crash_play_dead) {
+	    smp_ops.crash_play_dead();
+	} else {
+		halt();
+		for (;;)
+			cpu_relax();
+	}
 
 	return NMI_HANDLED;
 }
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index 23b4cfb640fc..8348bf46a648 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1112,8 +1112,20 @@ struct acpi_madt_multiproc_wakeup {
 	u16 version;
 	u32 reserved;		/* reserved - must be zero */
 	u64 mailbox_address;
+	u64 reset_vector;
 };
 
+/* Values for Version field above */
+
+enum acpi_madt_multiproc_wakeup_version {
+	ACPI_MADT_MP_WAKEUP_VERSION_NONE = 0,
+	ACPI_MADT_MP_WAKEUP_VERSION_V1 = 1,
+	ACPI_MADT_MP_WAKEUP_VERSION_RESERVED = 2, /* 2 and greater are reserved */
+};
+
+#define ACPI_MADT_MP_WAKEUP_SIZE_V0	16
+#define ACPI_MADT_MP_WAKEUP_SIZE_V1	24
+
 #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE        2032
 #define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE  2048
 
@@ -1126,7 +1138,8 @@ struct acpi_madt_multiproc_wakeup_mailbox {
 	u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE];	/* reserved for firmware use */
 };
 
-#define ACPI_MP_WAKE_COMMAND_WAKEUP    1
+#define ACPI_MP_WAKE_COMMAND_WAKEUP	1
+#define ACPI_MP_WAKE_COMMAND_TEST	2
 
 /* 17: CPU Core Interrupt Controller (ACPI 6.5) */
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-05  0:45 ` [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
@ 2023-12-05 23:36   ` Huang, Kai
  2023-12-22 11:19     ` kirill.shutemov
  2023-12-15 20:29   ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Huang, Kai @ 2023-12-05 23:36 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, linux-kernel, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, ashish.kalra, kexec, seanjc, bhe, linux-coco


> +
> +static void acpi_mp_stop_other_cpus(int wait)
> +{
> +	smp_shutdown_nonboot_cpus(smp_processor_id());
> +}

Is this and ...

+	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;

... this below still needed?

I think the current native_stop_other_cpus() should just work given you have set
up ...

+	smp_ops.crash_play_dead = crash_acpi_mp_play_dead;

... for TDX guest?

> +
> +/* The argument is required to match type of x86_mapping_info::alloc_pgt_page */
> +static void __init *alloc_pgt_page(void *dummy)
> +{
> +	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +/*
> + * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
> + * the same place as in the kernel page tables. asm_acpi_mp_play_dead() switches
> + * to the identity mapping and the function has be present at the same spot in
> + * the virtual address space before and after switching page tables.
> + */
> +static int __init init_transition_pgtable(pgd_t *pgd)
> +{
> +	pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
> +	unsigned long vaddr, paddr;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +
> +	vaddr = (unsigned long)asm_acpi_mp_play_dead;
> +	pgd += pgd_index(vaddr);
> +	if (!pgd_present(*pgd)) {
> +		p4d = (p4d_t *)alloc_pgt_page(NULL);
> +		if (!p4d)
> +			return -ENOMEM;
> +		set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
> +	}
> +	p4d = p4d_offset(pgd, vaddr);
> +	if (!p4d_present(*p4d)) {
> +		pud = (pud_t *)alloc_pgt_page(NULL);
> +		if (!pud)
> +			return -ENOMEM;
> +		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
> +	}
> +	pud = pud_offset(p4d, vaddr);
> +	if (!pud_present(*pud)) {
> +		pmd = (pmd_t *)alloc_pgt_page(NULL);
> +		if (!pmd)
> +			return -ENOMEM;
> +		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
> +	}
> +	pmd = pmd_offset(pud, vaddr);
> +	if (!pmd_present(*pmd)) {
> +		pte = (pte_t *)alloc_pgt_page(NULL);
> +		if (!pte)
> +			return -ENOMEM;
> +		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
> +	}
> +	pte = pte_offset_kernel(pmd, vaddr);
> +
> +	paddr = __pa(vaddr);
> +	set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
> +
> +	return 0;
> +}

Sorry for saying this late.  I think we can also use kernel_ident_mapping_init()
to do the init_transition_pgtable()?  We can set struct x86_mapping_info::offset
to __PAGE_OFFSET to do that?

Looks set_up_temporary_mappings() in arch/x86/power/hibernate_64.c uses the same
trick.

Anyway I am not sure how many LoC (assuming can do) can be saved so up to you.

> +
> +static void __init free_pte(pmd_t *pmd)
> +{
> +	pte_t *pte = pte_offset_kernel(pmd, 0);
> +
> +	memblock_free(pte, PAGE_SIZE);
> +}
> +
> +static void __init free_pmd(pud_t *pud)
> +{
> +	pmd_t *pmd = pmd_offset(pud, 0);
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		if (!pmd_present(pmd[i]))
> +		    continue;
> +
> +		if (pmd_leaf(pmd[i]))
> +		    continue;
> +
> +		free_pte(&pmd[i]);
> +	}
> +
> +	memblock_free(pmd, PAGE_SIZE);
> +}
> +
> +static void __init free_pud(p4d_t *p4d)
> +{
> +	pud_t *pud = pud_offset(p4d, 0);
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		if (!pud_present(pud[i]))
> +			continue;
> +
> +		if (pud_leaf(pud[i]))
> +		    continue;
> +
> +		free_pmd(&pud[i]);
> +	}
> +
> +	memblock_free(pud, PAGE_SIZE);
> +}
> +
> +static void __init free_p4d(pgd_t *pgd)
> +{
> +	p4d_t *p4d = p4d_offset(pgd, 0);
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_P4D; i++) {
> +		if (!p4d_present(p4d[i]))
> +			continue;
> +
> +		free_pud(&p4d[i]);
> +	}
> +
> +	if (pgtable_l5_enabled())
> +		memblock_free(p4d, PAGE_SIZE);
> +}
> +
> +static void __init free_pgd(pgd_t *pgd)
> +{
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PGD; i++) {
> +		if (!pgd_present(pgd[i]))
> +			continue;
> +
> +		free_p4d(&pgd[i]);
> +	}
> +
> +	memblock_free(pgd, PAGE_SIZE);
> +}

It's a little bit sad such cleanup code isn't in common code, e.g., with a 

	void (*free_pgt_page)(void *);

to allow the user to specify how to free the page table.

But this can be future job if needed.


[...]

>  int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  			      const unsigned long end)
>  {
>  	struct acpi_madt_multiproc_wakeup *mp_wake;
>  
>  	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> -	if (BAD_MADT_ENTRY(mp_wake, end))
> +
> +        /*
> +         * Cannot use the standard BAD_MADT_ENTRY() to sanity check the @mp_wake
> +         * entry.  'sizeof (struct acpi_madt_multiproc_wakeup)' can be larger
> +         * than the actual size of the MP wakeup entry in ACPI table because the
> +	 * 'reset_vector' is only available in the V1 MP wakeup structure.
> +         */

Space/tab issue.

> +	if (!mp_wake)
> +		return -EINVAL;
> +	if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0)
> +		return -EINVAL;
> +	if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0)
>  		return -EINVAL;
>  


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-12-05  0:45 ` [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
@ 2023-12-05 23:58   ` Huang, Kai
  2023-12-06 13:26     ` kirill.shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Kai @ 2023-12-05 23:58 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, linux-kernel, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, ashish.kalra, kexec, seanjc, bhe, linux-coco

On Tue, 2023-12-05 at 03:45 +0300, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
> 
> Use alternatives to keep the flag during kexec for TDX guests.
> 
> The change doesn't affect non-TDX-guest environments.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/kernel/relocate_kernel_64.S | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..cd6a53667c6b 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -145,12 +145,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	 * Set cr4 to a known state:
>  	 *  - physical address extension enabled
>  	 *  - 5-level paging, if it was enabled before
> +	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
> +	 *    in TDX guests.
>  	 */
>  	movl	$X86_CR4_PAE, %eax
>  	testq	$X86_CR4_LA57, %r13
>  	jz	1f
>  	orl	$X86_CR4_LA57, %eax
>  1:
> +	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %eax), X86_FEATURE_TDX_GUEST
>  	movq	%rax, %cr4
>  
>  	jmp 1f

Nit:

It seems <asm/alternative.h> isn't included in relocate_kernel_64.S.  Maybe it's
better to do it explicitly.

Maybe even better to explicitly include <linux/stringify.h> too, but I see
<asm/alternative.h> already does that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec
  2023-12-05  0:45 ` [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
@ 2023-12-06  1:28   ` Edgecombe, Rick P
  2023-12-06 15:07     ` kirill.shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Edgecombe, Rick P @ 2023-12-06  1:28 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: kexec, Reshetova, Elena, Nakajima, Jun, rafael, peterz, Huang,
	Kai, sathyanarayanan.kuppuswamy, Hunter, Adrian, thomas.lendacky,
	ashish.kalra, linux-coco, seanjc, bhe, linux-kernel

On Tue, 2023-12-05 at 03:45 +0300, Kirill A. Shutemov wrote: 
> +static void tdx_kexec_unshare_mem(bool crash)
> +{
> +       unsigned long addr, end;
> +       long found = 0, shared;
> +
> +       /* Stop new private<->shared conversions */
> +       conversion_allowed = false;

I wonder if this might need a compiler barrier here to be totally safe.
I'm not sure.

> +
> +       /*
> +        * Crash kernel reaches here with interrupts disabled: can't
> wait for
> +        * conversions to finish.
> +        *
> +        * If race happened, just report and proceed.
> +        */
> +       if (!crash) {
> +               unsigned long timeout;
> +
> +               /*
> +                * Wait for in-flight conversions to complete.
> +                *
> +                * Do not wait more than 30 seconds.
> +                */
> +               timeout = 30 * USEC_PER_SEC;
> +               while (atomic_read(&conversions_in_progress) &&
> timeout--)
> +                       udelay(1);
> +       }
> +
> +       if (atomic_read(&conversions_in_progress))
> +               pr_warn("Failed to finish shared<->private
> conversions\n");

I can't think of any non-ridiculous way to handle this case. Maybe we
need VMM help.

> 
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 830425e6d38e..c81afffaa954 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -12,6 +12,7 @@
>  #include <linux/delay.h>
>  #include <linux/objtool.h>
>  #include <linux/pgtable.h>
> +#include <linux/kexec.h>
>  #include <acpi/reboot.h>
>  #include <asm/io.h>
>  #include <asm/apic.h>
> @@ -31,6 +32,7 @@
>  #include <asm/realmode.h>
>  #include <asm/x86_init.h>
>  #include <asm/efi.h>
> +#include <asm/tdx.h>
>  
>  /*
>   * Power off function, if any
> @@ -716,6 +718,14 @@ static void
> native_machine_emergency_restart(void)
>  
>  void native_machine_shutdown(void)
>  {
> +       /*
> +        * Call enc_kexec_unshare_mem() while all CPUs are still
> active and
> +        * interrupts are enabled. This will allow all in-flight
> memory
> +        * conversions to finish cleanly before unsharing all memory.
> +        */
> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> kexec_in_progress)
> +               x86_platform.guest.enc_kexec_unshare_mem(false);

These questions are coming from an incomplete understanding of the
kexec/reboot operation. Please disregard if it is not helpful.

By doing this while other tasks can still run, it handles the
conversion races in the !crash case. But then it sets shared pages to
NP. What happens if another active task tries to write to one?

I guess we rely on the kernel_restart_prepare()->device_shutdown() to
clean up, which runs before native_machine_shutdown(). So there might
be conversions in progress when tdx_kexec_unshare_mem() is called, from
the allocator work queues. But the actual memory won't be accessed
during that operation.

But the console must be active? Or otherwise who can see these
warnings. It doesn't use a shared page? Or the KVM clock, which looks
to clean up at cpu tear down, which now happens after
tdx_kexec_unshare_mem()? So I wonder if there might be cases.

If so, maybe you could halt the conversions in
native_machine_shutdown(), then do the actual reset to private after
tasks can't schedule. I'd still wonder about if anything might try to
access a shared page triggered by the console output.


> +
>         /* Stop the cpus and apics */
>  #ifdef CONFIG_X86_IO_APIC
>         /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-12-05 23:58   ` Huang, Kai
@ 2023-12-06 13:26     ` kirill.shutemov
  0 siblings, 0 replies; 32+ messages in thread
From: kirill.shutemov @ 2023-12-06 13:26 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, linux-kernel,
	sathyanarayanan.kuppuswamy, Hunter, Adrian, thomas.lendacky,
	ashish.kalra, kexec, seanjc, bhe, linux-coco

On Tue, Dec 05, 2023 at 11:58:45PM +0000, Huang, Kai wrote:
> On Tue, 2023-12-05 at 03:45 +0300, Kirill A. Shutemov wrote:
> > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> > to #VE.
> > 
> > Use alternatives to keep the flag during kexec for TDX guests.
> > 
> > The change doesn't affect non-TDX-guest environments.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/kernel/relocate_kernel_64.S | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 56cab1bb25f5..cd6a53667c6b 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -145,12 +145,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	 * Set cr4 to a known state:
> >  	 *  - physical address extension enabled
> >  	 *  - 5-level paging, if it was enabled before
> > +	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
> > +	 *    in TDX guests.
> >  	 */
> >  	movl	$X86_CR4_PAE, %eax
> >  	testq	$X86_CR4_LA57, %r13
> >  	jz	1f
> >  	orl	$X86_CR4_LA57, %eax
> >  1:
> > +	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %eax), X86_FEATURE_TDX_GUEST
> >  	movq	%rax, %cr4
> >  
> >  	jmp 1f
> 
> Nit:
> 
> It seems <asm/alternative.h> isn't included in relocate_kernel_64.S.  Maybe it's
> better to do it explicitly.
> 
> Maybe even better to explicitly include <linux/stringify.h> too, but I see
> <asm/alternative.h> already does that.

Okay, I will add both.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec
  2023-12-06  1:28   ` Edgecombe, Rick P
@ 2023-12-06 15:07     ` kirill.shutemov
  2023-12-06 18:32       ` Edgecombe, Rick P
  0 siblings, 1 reply; 32+ messages in thread
From: kirill.shutemov @ 2023-12-06 15:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: tglx, mingo, x86, bp, dave.hansen, kexec, Reshetova, Elena,
	Nakajima, Jun, rafael, peterz, Huang, Kai,
	sathyanarayanan.kuppuswamy, Hunter, Adrian, thomas.lendacky,
	ashish.kalra, linux-coco, seanjc, bhe, linux-kernel

On Wed, Dec 06, 2023 at 01:28:08AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2023-12-05 at 03:45 +0300, Kirill A. Shutemov wrote: 
> > +static void tdx_kexec_unshare_mem(bool crash)
> > +{
> > +       unsigned long addr, end;
> > +       long found = 0, shared;
> > +
> > +       /* Stop new private<->shared conversions */
> > +       conversion_allowed = false;
> 
> I wonder if this might need a compiler barrier here to be totally safe.
> I'm not sure.

Yeah, it should be cleaner with a barrier.

> > +
> > +       /*
> > +        * Crash kernel reaches here with interrupts disabled: can't
> > wait for
> > +        * conversions to finish.
> > +        *
> > +        * If race happened, just report and proceed.
> > +        */
> > +       if (!crash) {
> > +               unsigned long timeout;
> > +
> > +               /*
> > +                * Wait for in-flight conversions to complete.
> > +                *
> > +                * Do not wait more than 30 seconds.
> > +                */
> > +               timeout = 30 * USEC_PER_SEC;
> > +               while (atomic_read(&conversions_in_progress) &&
> > timeout--)
> > +                       udelay(1);
> > +       }
> > +
> > +       if (atomic_read(&conversions_in_progress))
> > +               pr_warn("Failed to finish shared<->private
> > conversions\n");
> 
> I can't think of any non-ridiculous way to handle this case. Maybe we
> need VMM help.

Do you see a specific way how VMM can help here?

> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 830425e6d38e..c81afffaa954 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/delay.h>
> >  #include <linux/objtool.h>
> >  #include <linux/pgtable.h>
> > +#include <linux/kexec.h>
> >  #include <acpi/reboot.h>
> >  #include <asm/io.h>
> >  #include <asm/apic.h>
> > @@ -31,6 +32,7 @@
> >  #include <asm/realmode.h>
> >  #include <asm/x86_init.h>
> >  #include <asm/efi.h>
> > +#include <asm/tdx.h>
> >  
> >  /*
> >   * Power off function, if any
> > @@ -716,6 +718,14 @@ static void
> > native_machine_emergency_restart(void)
> >  
> >  void native_machine_shutdown(void)
> >  {
> > +       /*
> > +        * Call enc_kexec_unshare_mem() while all CPUs are still
> > active and
> > +        * interrupts are enabled. This will allow all in-flight
> > memory
> > +        * conversions to finish cleanly before unsharing all memory.
> > +        */
> > +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> > kexec_in_progress)
> > +               x86_platform.guest.enc_kexec_unshare_mem(false);
> 
> These questions are coming from an incomplete understanding of the
> kexec/reboot operation. Please disregard if it is not helpful.
> 
> By doing this while other tasks can still run, it handles the
> conversion races in the !crash case. But then it sets shared pages to
> NP. What happens if another active task tries to write to one?
> 
> I guess we rely on the kernel_restart_prepare()->device_shutdown() to
> clean up, which runs before native_machine_shutdown(). So there might
> be conversions in progress when tdx_kexec_unshare_mem() is called, from
> the allocator work queues. But the actual memory won't be accessed
> during that operation.

Right, devices has to be shutdown by then.

> But the console must be active? Or otherwise who can see these
> warnings. It doesn't use a shared page? Or the KVM clock, which looks
> to clean up at cpu tear down, which now happens after
> tdx_kexec_unshare_mem()? So I wonder if there might be cases.

Virtio console is not functional by then, but serial is. Serial uses port
I/O and doesn't need shared memory.

> If so, maybe you could halt the conversions in
> native_machine_shutdown(), then do the actual reset to private after
> tasks can't schedule.

It would also mean that we cannot use set_memory_np() there as it requires
sleepable context. I would rather keep conversion in
native_machine_shutdown() path.

> I'd still wonder about if anything might try to
> access a shared page triggered by the console output.

set_memory_np() would make it obvious if it ever happens.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec
  2023-12-06 15:07     ` kirill.shutemov
@ 2023-12-06 18:32       ` Edgecombe, Rick P
  0 siblings, 0 replies; 32+ messages in thread
From: Edgecombe, Rick P @ 2023-12-06 18:32 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kexec, linux-coco, Huang, Kai, ashish.kalra, dave.hansen,
	thomas.lendacky, Hunter, Adrian, Reshetova, Elena, linux-kernel,
	mingo, seanjc, tglx, bhe, Nakajima, Jun, peterz, bp, rafael,
	sathyanarayanan.kuppuswamy, x86

On Wed, 2023-12-06 at 18:07 +0300, kirill.shutemov@linux.intel.com
wrote:
>  I can't think of any non-ridiculous way to handle this case. Maybe
> we
> > need VMM help.
> 
> Do you see a specific way how VMM can help here?

I didn't have a specific idea. I was just thinking that the problem is
that guest doesn't know the exact private/shared state of the GFNs
because of the potentially interrupted conversion processes. But the
VMM does have this information. 

What about something like: The VMM could expose something like MapGPA
that searches for a shared GPA and return it. So you ask it to convert
the next shared GPA it can find to private and it searches (in the
host) the xarray stuff to find a GPA that is shared. Then in the guest,
it has a shared GPA and check the direct map PTE to reset, and accept.

The guest could call the new MapGPA-like hypercall in a loop until all
GPAs are reset.

> > I'd still wonder about if anything might try to
> > access a shared page triggered by the console output.
> 
> set_memory_np() would make it obvious if it ever happens.

I think this is a worthwhile improvement over the existing complete
lack of support, but it's not race free. With the barrier comments, and
given the lack of good alternatives:

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-12-05  0:45 ` [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled Kirill A. Shutemov
@ 2023-12-11 23:10   ` Kirill A. Shutemov
  2023-12-13 17:22     ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-11 23:10 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Huang, Kai,
	Baoquan He, kexec, linux-coco, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li

On Tue, Dec 05, 2023 at 03:45:01AM +0300, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
> 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enabled.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>

Paolo, Sean, any chance you can get it in through KVM tree while the rest
of kexec patchset is pending? The problem is visible on normal reboot too.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-12-11 23:10   ` Kirill A. Shutemov
@ 2023-12-13 17:22     ` Sean Christopherson
  2024-01-04 15:05       ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2023-12-13 17:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
	Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
	Jun Nakajima, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Kai Huang, Baoquan He, kexec, linux-coco, linux-kernel,
	Vitaly Kuznetsov, Wanpeng Li

On Tue, Dec 12, 2023, Kirill A. Shutemov wrote:
> On Tue, Dec 05, 2023 at 03:45:01AM +0300, Kirill A. Shutemov wrote:
> > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > present in the VM. It leads to write to a MSR that doesn't exist on some
> > configurations, namely in TDX guest:
> > 
> > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > 
> > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > features.
> > 
> > Do not disable kvmclock if it was not enabled.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> 
> Paolo, Sean, any chance you can get it in through KVM tree while the rest
> of kexec patchset is pending? The problem is visible on normal reboot too.

Paolo is going to grab this (possibly for 6.7-rc?).  I'll keep this tagged on my
end in case that doesn't happen "soon".

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported
  2023-12-05  0:44 ` [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
@ 2023-12-15 19:42   ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2023-12-15 19:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Tue, Dec 05 2023 at 03:44, Kirill A. Shutemov wrote:
>  
> +static bool cpu_hotplug_offline_disabled;

__ro_after_init?

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
  2023-12-05  0:45 ` [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
@ 2023-12-15 19:43   ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2023-12-15 19:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:

> ACPI MADT doesn't allow to offline CPU after it got woke up.
>
> Currently hotplug prevented based on the confidential computing

Currently CPU hotplug is prevented...

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  2023-12-05  0:45 ` [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
@ 2023-12-15 20:08   ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2023-12-15 20:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up. It limits

to offline a CPU after it was onlined. This limits kexec: ...

> kexec: the second kernel won't be able to use more than one CPU.

... one CPU, which is enough to cover the kdump case.


> Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> The acpi_wakeup_cpu() will use it to bring up secondary cpus.
>
> Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> that the mailbox is not usable.  This prevents the kexec()-ed kernel
> from reading a vaild mailbox, which in turn makes the kexec()-ed kernel
> only be able to use the boot CPU.
>
> This is Linux-specific protocol and not reflected in ACPI spec.
>
> Booting the second kernel with signle CPU is enough to cover the most
> common case for kexec -- kdump.

This is confusing at best and I doubt that kdump is the most common case
for every one.

  To prevent a kexec kernel from onlining secondary CPUs invalidate the
  mailbox address in the ACPI MADT wakeup structure which prevents a
  kexec kernel to use it.

  This is safe as the booting kernel has the mailbox address cached
  already and acpi_wakeup_cpu() uses the cached value to bring up the
  secondary CPUs.

  Note: This is a Linux specific convention and not covered by the
        ACPI specification.

Hmm?

> +	/*
> +	 * ACPI MADT doesn't allow to offline CPU after it got woke up.

to offline a CPU after it was onlined.

> +	 * It limits kexec: the second kernel won't be able to use more than

           This limits kexec: ...

> +	 * one CPU.
> +	 *
> +	 * Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> +	 * The acpi_wakeup_cpu() will use it to bring up secondary cpus.
> +	 *
> +	 * Zero out mailbox address in the ACPI MADT wakeup structure to
> +	 * indicate that the mailbox is not usable.  This prevents the
> +	 * kexec()-ed kernel from reading a vaild mailbox, which in turn
> +	 * makes the kexec()-ed kernel only be able to use the boot CPU.
> +	 *
> +	 * This is Linux-specific protocol and not reflected in ACPI spec.

See changelog comment...

> +	 */
> +	mp_wake->mailbox_address = 0;
> +
>  	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>  	return 0;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-05  0:45 ` [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
  2023-12-05 23:36   ` Huang, Kai
@ 2023-12-15 20:29   ` Thomas Gleixner
  2023-12-22 16:34     ` Kirill A. Shutemov
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2023-12-15 20:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Tue, Dec 05 2023 at 03:45, Kirill A. Shutemov wrote:

> MADT Multiprocessor Wakeup structure version 1 brings support of CPU
> offlining: BIOS provides a reset vector where the CPU has to jump to
> offline itself.

CPU has to jump to for offlining itself.

> The new TEST mailbox command can be used to test the CPU offlined
> successfully and BIOS has control over it.

test whether the CPU offlined itself which means the BIOS has control
over the CPU and can online it again via the ACPI MADT wakeup method.

> Add CPU offling support for ACPI MADT wakeup method by implementing

for the ACPI

> custom cpu_die, play_dead and stop_other_cpus SMP operations.

cpu_die(), play_dead() ...

> CPU offlining makes is possible to hand over secondary CPUs over kexec,
> not limiting the second kernel to single CPU.

to a single CPU.

> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index 4fab2ed454f3..3c8efba86d5c 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -38,6 +38,7 @@ struct smp_ops {
>  	int (*cpu_disable)(void);
>  	void (*cpu_die)(unsigned int cpu);
>  	void (*play_dead)(void);
> +	void (*crash_play_dead)(void);

This new callback and the callsite change wants to be introduced in a
preparatory patch. This one is doing too many things at once, really.
  
> diff --git a/arch/x86/kernel/acpi/madt_playdead.S b/arch/x86/kernel/acpi/madt_playdead.S
> new file mode 100644
> index 000000000000..68f83865a1e3
> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt_playdead.S
> @@ -0,0 +1,21 @@
> +#include <linux/linkage.h>
> +#include <asm/nospec-branch.h>
> +#include <asm/page_types.h>
> +#include <asm/processor-flags.h>
> +
> +	.text
> +	.align PAGE_SIZE

Newline please

Please document what the register arguments to this function are.

> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> +	/* Turn off global entries. Following CR3 write will flush them. */
> +	movq	%cr4, %rdx
> +	andq	$~(X86_CR4_PGE), %rdx
> +	movq	%rdx, %cr4
> +
> +	/* Switch to identity mapping */
> +	movq	%rsi, %rax
> +	movq	%rax, %cr3
> +
> +	/* Jump to reset vector */
> +	ANNOTATE_RETPOLINE_SAFE
> +	jmp	*%rdi
> +SYM_FUNC_END(asm_acpi_mp_play_dead)

> +static u64 acpi_mp_pgd __ro_after_init;
> +static u64 acpi_mp_reset_vector_paddr __ro_after_init;
> +
> +void asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa);

Declarations want to be in a header file.

> +static void crash_acpi_mp_play_dead(void)
> +{
> +	asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
> +			      acpi_mp_pgd);

Pointless line break.

> +}
> +
> +static void acpi_mp_play_dead(void)
> +{
> +	play_dead_common();
> +	asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr,
> +			      acpi_mp_pgd);

Ditto.

> +}
> +
> +static void acpi_mp_cpu_die(unsigned int cpu)
> +{
> +	u32 apicid = per_cpu(x86_cpu_to_apicid, cpu);
> +	unsigned long timeout;
> +
> +	/*
> +	 * Use TEST mailbox command to prove that BIOS got control over
> +	 * the CPU before declaring it dead.
> +	 *
> +	 * BIOS has to clear 'command' field of the mailbox.
> +	 */
> +	acpi_mp_wake_mailbox->apic_id = apicid;
> +	smp_store_release(&acpi_mp_wake_mailbox->command,
> +			  ACPI_MP_WAKE_COMMAND_TEST);
> +
> +	/* Don't wait longer than a second. */
> +	timeout = USEC_PER_SEC;
> +	while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
> +		udelay(1);

So this waits and then does nothing if the wait fails. What's the point?

...
<SNIP 170 lines of pagetable muck>

Do we really need this specific hackery or is there some similar
identity mapping muck which can be generalized?

> +	smp_ops.play_dead = acpi_mp_play_dead;
> +	smp_ops.crash_play_dead = crash_acpi_mp_play_dead;
> +	smp_ops.cpu_die = acpi_mp_cpu_die;
> +	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
> +
> +	acpi_mp_reset_vector_paddr = reset_vector;
> +	acpi_mp_pgd = __pa(pgd);
> +
> +	return 0;
> +}
> +
>  static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
>  {
>  	if (!acpi_mp_wake_mailbox_paddr) {
> @@ -68,37 +299,63 @@ static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
>  	return 0;
>  }
>  
> +static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup *mp_wake)
> +{
> +	cpu_hotplug_disable_offlining();
> +
> +	/*
> +	 * Zero out mailbox address in the ACPI MADT wakeup structure
> +	 * to indicate that the mailbox is not usable.  This prevents
> +	 * the kexec()-ed kernel from reading a vaild mailbox, which in
> +	 * turn makes the kexec()-ed kernel only be able to use the boot
> +	 * CPU.
> +	 *
> +	 * This is Linux-specific protocol and not reflected in ACPI spec.
> +	 *
> +	 * acpi_mp_wake_mailbox_paddr already has the mailbox address.
> +	 * The acpi_wakeup_cpu() will use it to bring up secondary cpus for
> +	 * the current kernel.
> +	 */
> +	mp_wake->mailbox_address = 0;
> +}

The previous patch could have split this out into a helper already, no?

> +
>  int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  			      const unsigned long end)
>  {
>  	struct acpi_madt_multiproc_wakeup *mp_wake;
>  
>  	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> -	if (BAD_MADT_ENTRY(mp_wake, end))
> +
> +        /*
> +         * Cannot use the standard BAD_MADT_ENTRY() to sanity check the @mp_wake
> +         * entry.  'sizeof (struct acpi_madt_multiproc_wakeup)' can be larger
> +         * than the actual size of the MP wakeup entry in ACPI table because the
> +	 * 'reset_vector' is only available in the V1 MP wakeup structure.
> +         */

The comment is white space damaged. Use tabs everywhere please and not
only in one line.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-05 23:36   ` Huang, Kai
@ 2023-12-22 11:19     ` kirill.shutemov
  2023-12-22 11:38       ` Huang, Kai
  0 siblings, 1 reply; 32+ messages in thread
From: kirill.shutemov @ 2023-12-22 11:19 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, linux-kernel,
	sathyanarayanan.kuppuswamy, Hunter, Adrian, thomas.lendacky,
	ashish.kalra, kexec, seanjc, bhe, linux-coco

On Tue, Dec 05, 2023 at 11:36:55PM +0000, Huang, Kai wrote:
> 
> > +
> > +static void acpi_mp_stop_other_cpus(int wait)
> > +{
> > +	smp_shutdown_nonboot_cpus(smp_processor_id());
> > +}
> 
> Is this and ...
> 
> +	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
> 
> ... this below still needed?
> 
> I think the current native_stop_other_cpus() should just work given you have set
> up ...
> 
> +	smp_ops.crash_play_dead = crash_acpi_mp_play_dead;
> 
> ... for TDX guest?

To make it work stop_this_cpu() would need to be modified to use
smp_ops.crash_play_dead() instead of native_halt(). But name of the
callback doesn't match the function, so I renamed it to
smp_ops.stop_this_cpu().

> Sorry for saying this late.  I think we can also use kernel_ident_mapping_init()
> to do the init_transition_pgtable()?  We can set struct x86_mapping_info::offset
> to __PAGE_OFFSET to do that?
> 
> Looks set_up_temporary_mappings() in arch/x86/power/hibernate_64.c uses the same
> trick.
> 
> Anyway I am not sure how many LoC (assuming can do) can be saved so up to you.

Yeah. Benefit is not clear to me. I will leave it as is.


> 
> It's a little bit sad such cleanup code isn't in common code, e.g., with a 
> 
> 	void (*free_pgt_page)(void *);
> 
> to allow the user to specify how to free the page table.
> 
> But this can be future job if needed.

I will consider moving this cleanup in common code. And maybe fix other
users of kernel_ident_mapping_init(). Nobody seems to care to cleanup page
tables on ENOMEM.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-22 11:19     ` kirill.shutemov
@ 2023-12-22 11:38       ` Huang, Kai
  0 siblings, 0 replies; 32+ messages in thread
From: Huang, Kai @ 2023-12-22 11:38 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kexec, linux-coco, ashish.kalra, dave.hansen, thomas.lendacky,
	Hunter, Adrian, Reshetova, Elena, linux-kernel, mingo, seanjc,
	tglx, bhe, Nakajima, Jun, peterz, bp, Edgecombe, Rick P, rafael,
	sathyanarayanan.kuppuswamy, x86

On Fri, 2023-12-22 at 14:19 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Dec 05, 2023 at 11:36:55PM +0000, Huang, Kai wrote:
> > 
> > > +
> > > +static void acpi_mp_stop_other_cpus(int wait)
> > > +{
> > > +	smp_shutdown_nonboot_cpus(smp_processor_id());
> > > +}
> > 
> > Is this and ...
> > 
> > +	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
> > 
> > ... this below still needed?
> > 
> > I think the current native_stop_other_cpus() should just work given you have set
> > up ...
> > 
> > +	smp_ops.crash_play_dead = crash_acpi_mp_play_dead;
> > 
> > ... for TDX guest?
> 
> To make it work stop_this_cpu() would need to be modified to use
> smp_ops.crash_play_dead() instead of native_halt(). But name of the
> callback doesn't match the function, so I renamed it to
> smp_ops.stop_this_cpu().

Seems reasonable to me.  Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-12-15 20:29   ` Thomas Gleixner
@ 2023-12-22 16:34     ` Kirill A. Shutemov
  0 siblings, 0 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2023-12-22 16:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Baoquan He, kexec, linux-coco, linux-kernel

On Fri, Dec 15, 2023 at 09:29:13PM +0100, Thomas Gleixner wrote:
> So this waits and then does nothing if the wait fails. What's the point?
> 
> ...
> <SNIP 170 lines of pagetable muck>
> 
> Do we really need this specific hackery or is there some similar
> identity mapping muck which can be generalized?

I've addressed all your feedback, but this gave me pause. Looks like none
of kernel_ident_mapping_init() users frees memory on failure.

Is it okay to get this part as is and I will follow up with patchset that
fixes memory handling for all kernel_ident_mapping_init() users?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-12-13 17:22     ` Sean Christopherson
@ 2024-01-04 15:05       ` Kirill A. Shutemov
  2024-01-09 14:59         ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2024-01-04 15:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
	Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
	Jun Nakajima, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Kai Huang, Baoquan He, kexec, linux-coco, linux-kernel,
	Vitaly Kuznetsov, Wanpeng Li

On Wed, Dec 13, 2023 at 09:22:34AM -0800, Sean Christopherson wrote:
> On Tue, Dec 12, 2023, Kirill A. Shutemov wrote:
> > On Tue, Dec 05, 2023 at 03:45:01AM +0300, Kirill A. Shutemov wrote:
> > > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > > present in the VM. It leads to write to a MSR that doesn't exist on some
> > > configurations, namely in TDX guest:
> > > 
> > > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > > 
> > > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > > features.
> > > 
> > > Do not disable kvmclock if it was not enabled.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> > > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > > Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Wanpeng Li <wanpengli@tencent.com>
> > 
> > Paolo, Sean, any chance you can get it in through KVM tree while the rest
> > of kexec patchset is pending? The problem is visible on normal reboot too.
> 
> Paolo is going to grab this (possibly for 6.7-rc?).  I'll keep this tagged on my
> end in case that doesn't happen "soon".

Sean, any update on this?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2024-01-04 15:05       ` Kirill A. Shutemov
@ 2024-01-09 14:59         ` Sean Christopherson
  0 siblings, 0 replies; 32+ messages in thread
From: Sean Christopherson @ 2024-01-09 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
	Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
	Jun Nakajima, Rick Edgecombe, Tom Lendacky, Ashish Kalra,
	Kai Huang, Baoquan He, kexec, linux-coco, linux-kernel,
	Vitaly Kuznetsov, Wanpeng Li

On Thu, Jan 04, 2024, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2023 at 09:22:34AM -0800, Sean Christopherson wrote:
> > On Tue, Dec 12, 2023, Kirill A. Shutemov wrote:
> > > On Tue, Dec 05, 2023 at 03:45:01AM +0300, Kirill A. Shutemov wrote:
> > > > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > > > present in the VM. It leads to write to a MSR that doesn't exist on some
> > > > configurations, namely in TDX guest:
> > > > 
> > > > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > > > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > > > 
> > > > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > > > features.
> > > > 
> > > > Do not disable kvmclock if it was not enabled.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> > > > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > > > Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Wanpeng Li <wanpengli@tencent.com>
> > > 
> > > Paolo, Sean, any chance you can get it in through KVM tree while the rest
> > > of kexec patchset is pending? The problem is visible on normal reboot too.
> > 
> > Paolo is going to grab this (possibly for 6.7-rc?).  I'll keep this tagged on my
> > end in case that doesn't happen "soon".
> 
> Sean, any update on this?

'Tis now in kvm/next, commit 1c6d984f523f ("x86/kvm: Do not try to disable kvmclock
if it was not enabled").  The one time procrastinating on responding actually worked. ;-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-01-09 14:59 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-05  0:44 [PATCHv4 00/14] x86/tdx: Add kexec support Kirill A. Shutemov
2023-12-05  0:44 ` [PATCHv4 01/14] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
2023-12-05  0:44 ` [PATCHv4 02/14] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Kirill A. Shutemov
2023-12-05  0:44 ` [PATCHv4 03/14] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
2023-12-15 19:42   ` Thomas Gleixner
2023-12-05  0:45 ` [PATCHv4 04/14] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
2023-12-15 19:43   ` Thomas Gleixner
2023-12-05  0:45 ` [PATCHv4 05/14] x86/kvm: Do not try to disable kvmclock if it was not enabled Kirill A. Shutemov
2023-12-11 23:10   ` Kirill A. Shutemov
2023-12-13 17:22     ` Sean Christopherson
2024-01-04 15:05       ` Kirill A. Shutemov
2024-01-09 14:59         ` Sean Christopherson
2023-12-05  0:45 ` [PATCHv4 06/14] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
2023-12-05 23:58   ` Huang, Kai
2023-12-06 13:26     ` kirill.shutemov
2023-12-05  0:45 ` [PATCHv4 07/14] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
2023-12-05  0:45 ` [PATCHv4 08/14] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
2023-12-05  0:45 ` [PATCHv4 09/14] x86/tdx: Account shared memory Kirill A. Shutemov
2023-12-05  0:45 ` [PATCHv4 10/14] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
2023-12-06  1:28   ` Edgecombe, Rick P
2023-12-06 15:07     ` kirill.shutemov
2023-12-06 18:32       ` Edgecombe, Rick P
2023-12-05  0:45 ` [PATCHv4 11/14] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
2023-12-05  0:45 ` [PATCHv4 12/14] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Kirill A. Shutemov
2023-12-05  0:45 ` [PATCHv4 13/14] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
2023-12-15 20:08   ` Thomas Gleixner
2023-12-05  0:45 ` [PATCHv4 14/14] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
2023-12-05 23:36   ` Huang, Kai
2023-12-22 11:19     ` kirill.shutemov
2023-12-22 11:38       ` Huang, Kai
2023-12-15 20:29   ` Thomas Gleixner
2023-12-22 16:34     ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).