All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] x86/tdx: Add kexec support
@ 2023-10-05 13:13 ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

They bring kexec support to the point when we can start the new kernel,
but it will only be able to use single CPU. It should be enough to cover
the most common case: crashkernel.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel.

Please review. I would be glad for any feedback.

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

Kirill A. Shutemov (13):
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  kernel/cpu: Add support for declaring CPU hotplug not supported
  cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  x86/mm: Return correct level from lookup_address() if pte is none
  KVM: x86: Add config option to gate emergency virt callback support
  x86/tdx: Account shared memory
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

 arch/x86/Kconfig                     |   8 +
 arch/x86/coco/core.c                 |   1 -
 arch/x86/coco/tdx/kexec.c            |   0
 arch/x86/coco/tdx/tdx.c              | 220 +++++++++++++++++++++-
 arch/x86/hyperv/ivm.c                |   9 +-
 arch/x86/include/asm/acpi.h          |   5 +
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/include/asm/reboot.h        |   4 +-
 arch/x86/include/asm/x86_init.h      |   4 +-
 arch/x86/kernel/acpi/Makefile        |  11 +-
 arch/x86/kernel/acpi/boot.c          |  88 +--------
 arch/x86/kernel/acpi/madt.S          |  28 +++
 arch/x86/kernel/acpi/madt_wakeup.c   | 262 +++++++++++++++++++++++++++
 arch/x86/kernel/e820.c               |   9 +-
 arch/x86/kernel/kvmclock.c           |   9 +-
 arch/x86/kernel/reboot.c             |   4 +-
 arch/x86/kernel/relocate_kernel_64.S |   5 +
 arch/x86/kernel/x86_init.c           |   4 +-
 arch/x86/kvm/Kconfig                 |   5 +
 arch/x86/mm/mem_encrypt_amd.c        |   8 +-
 arch/x86/mm/pat/set_memory.c         |  17 +-
 include/acpi/actbl2.h                |  19 +-
 include/linux/cc_platform.h          |  10 -
 include/linux/cpu.h                  |   2 +
 kernel/cpu.c                         |  17 +-
 25 files changed, 604 insertions(+), 146 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c
 create mode 100644 arch/x86/kernel/acpi/madt.S
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 00/13] x86/tdx: Add kexec support
@ 2023-10-05 13:13 ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

They bring kexec support to the point when we can start the new kernel,
but it will only be able to use single CPU. It should be enough to cover
the most common case: crashkernel.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel.

Please review. I would be glad for any feedback.

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

Kirill A. Shutemov (13):
  x86/acpi: Extract ACPI MADT wakeup code into a separate file
  kernel/cpu: Add support for declaring CPU hotplug not supported
  cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  x86/mm: Return correct level from lookup_address() if pte is none
  KVM: x86: Add config option to gate emergency virt callback support
  x86/tdx: Account shared memory
  x86/tdx: Convert shared memory back to private on kexec
  x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

 arch/x86/Kconfig                     |   8 +
 arch/x86/coco/core.c                 |   1 -
 arch/x86/coco/tdx/kexec.c            |   0
 arch/x86/coco/tdx/tdx.c              | 220 +++++++++++++++++++++-
 arch/x86/hyperv/ivm.c                |   9 +-
 arch/x86/include/asm/acpi.h          |   5 +
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/include/asm/reboot.h        |   4 +-
 arch/x86/include/asm/x86_init.h      |   4 +-
 arch/x86/kernel/acpi/Makefile        |  11 +-
 arch/x86/kernel/acpi/boot.c          |  88 +--------
 arch/x86/kernel/acpi/madt.S          |  28 +++
 arch/x86/kernel/acpi/madt_wakeup.c   | 262 +++++++++++++++++++++++++++
 arch/x86/kernel/e820.c               |   9 +-
 arch/x86/kernel/kvmclock.c           |   9 +-
 arch/x86/kernel/reboot.c             |   4 +-
 arch/x86/kernel/relocate_kernel_64.S |   5 +
 arch/x86/kernel/x86_init.c           |   4 +-
 arch/x86/kvm/Kconfig                 |   5 +
 arch/x86/mm/mem_encrypt_amd.c        |   8 +-
 arch/x86/mm/pat/set_memory.c         |  17 +-
 include/acpi/actbl2.h                |  19 +-
 include/linux/cc_platform.h          |  10 -
 include/linux/cpu.h                  |   2 +
 kernel/cpu.c                         |  17 +-
 25 files changed, 604 insertions(+), 146 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c
 create mode 100644 arch/x86/kernel/acpi/madt.S
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

-- 
2.41.0


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

In order to prepare for the expansion of support for the ACPI MADT
wakeup method, the relevant code has been moved into a separate file.
A new configuration option has been introduced to clearly indicate
dependencies without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   |  7 +++
 arch/x86/include/asm/acpi.h        |  5 ++
 arch/x86/kernel/acpi/Makefile      | 11 ++--
 arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
 arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
 5 files changed, 99 insertions(+), 90 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3154dbc49cf5..7368d254d01f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
 	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
 	select IRQ_DOMAIN_HIERARCHY
 
+config X86_ACPI_MADT_WAKEUP
+	def_bool y
+	depends on X86_64
+	depends on ACPI
+	depends on SMP
+	depends on X86_LOCAL_APIC
+
 config X86_IO_APIC
 	def_bool y
 	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index c8a7fc23f63c..b536b5a6a57b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
 
 #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
 
+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end);
+
 /*
  * Check if the CPU can handle C2 and deeper
  */
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..8c7329c88a75 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -1,11 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_ACPI)		+= boot.o
-obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
-obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
+obj-$(CONFIG_ACPI)			+= boot.o
+obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_APEI)			+= apei.o
+obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
-obj-y				+= cstate.o
+obj-y					+= cstate.o
 endif
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2a0ea38955df..111bd226ad99 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -66,13 +66,6 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 static bool acpi_support_online_capable;
 #endif
 
-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -357,60 +350,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
-{
-	/*
-	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
-	 *
-	 * Wakeup of secondary CPUs is fully serialized in the core code.
-	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
-	 */
-	if (!acpi_mp_wake_mailbox) {
-		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
-						sizeof(*acpi_mp_wake_mailbox),
-						MEMREMAP_WB);
-	}
-
-	/*
-	 * Mailbox memory is shared between the firmware and OS. Firmware will
-	 * listen on mailbox command address, and once it receives the wakeup
-	 * command, the CPU associated with the given apicid will be booted.
-	 *
-	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
-	 * firmware before the wakeup command is visible.  smp_store_release()
-	 * ensures ordering and visibility.
-	 */
-	acpi_mp_wake_mailbox->apic_id	    = apicid;
-	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
-	smp_store_release(&acpi_mp_wake_mailbox->command,
-			  ACPI_MP_WAKE_COMMAND_WAKEUP);
-
-	/*
-	 * Wait for the CPU to wake up.
-	 *
-	 * The CPU being woken up is essentially in a spin loop waiting to be
-	 * woken up. It should not take long for it wake up and acknowledge by
-	 * zeroing out ->command.
-	 *
-	 * ACPI specification doesn't provide any guidance on how long kernel
-	 * has to wait for a wake up acknowledgement. It also doesn't provide
-	 * a way to cancel a wake up request if it takes too long.
-	 *
-	 * In TDX environment, the VMM has control over how long it takes to
-	 * wake up secondary. It can postpone scheduling secondary vCPU
-	 * indefinitely. Giving up on wake up request and reporting error opens
-	 * possible attack vector for VMM: it can wake up a secondary CPU when
-	 * kernel doesn't expect it. Wait until positive result of the wake up
-	 * request.
-	 */
-	while (READ_ONCE(acpi_mp_wake_mailbox->command))
-		cpu_relax();
-
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1160,29 +1099,6 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
-				     const unsigned long end)
-{
-	struct acpi_madt_multiproc_wakeup *mp_wake;
-
-	if (!IS_ENABLED(CONFIG_SMP))
-		return -ENODEV;
-
-	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
-	if (BAD_MADT_ENTRY(mp_wake, end))
-		return -EINVAL;
-
-	acpi_table_print_madt_entry(&header->common);
-
-	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
-
-	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
-
-	return 0;
-}
-#endif				/* CONFIG_X86_64 */
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1379,7 +1295,7 @@ static void __init acpi_process_madt(void)
 				smp_found_config = 1;
 			}
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_ACPI_MADT_WAKEUP
 			/*
 			 * Parse MADT MP Wake entry.
 			 */
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
new file mode 100644
index 000000000000..1b9747bfd5b9
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -0,0 +1,80 @@
+#include <linux/acpi.h>
+#include <asm/apic.h>
+
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	/*
+	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
+	 *
+	 * Wakeup of secondary CPUs is fully serialized in the core code.
+	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
+	 */
+	if (!acpi_mp_wake_mailbox) {
+		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+						sizeof(*acpi_mp_wake_mailbox),
+						MEMREMAP_WB);
+	}
+
+	/*
+	 * Mailbox memory is shared between the firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, the CPU associated with the given apicid will be booted.
+	 *
+	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
+	 * firmware before the wakeup command is visible.  smp_store_release()
+	 * ensures ordering and visibility.
+	 */
+	acpi_mp_wake_mailbox->apic_id	    = apicid;
+	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * Wait for the CPU to wake up.
+	 *
+	 * The CPU being woken up is essentially in a spin loop waiting to be
+	 * woken up. It should not take long for it wake up and acknowledge by
+	 * zeroing out ->command.
+	 *
+	 * ACPI specification doesn't provide any guidance on how long kernel
+	 * has to wait for a wake up acknowledgement. It also doesn't provide
+	 * a way to cancel a wake up request if it takes too long.
+	 *
+	 * In TDX environment, the VMM has control over how long it takes to
+	 * wake up secondary. It can postpone scheduling secondary vCPU
+	 * indefinitely. Giving up on wake up request and reporting error opens
+	 * possible attack vector for VMM: it can wake up a secondary CPU when
+	 * kernel doesn't expect it. Wait until positive result of the wake up
+	 * request.
+	 */
+	while (READ_ONCE(acpi_mp_wake_mailbox->command))
+		cpu_relax();
+
+	return 0;
+}
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+
+	return 0;
+}
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

In order to prepare for the expansion of support for the ACPI MADT
wakeup method, the relevant code has been moved into a separate file.
A new configuration option has been introduced to clearly indicate
dependencies without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   |  7 +++
 arch/x86/include/asm/acpi.h        |  5 ++
 arch/x86/kernel/acpi/Makefile      | 11 ++--
 arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
 arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
 5 files changed, 99 insertions(+), 90 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3154dbc49cf5..7368d254d01f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
 	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
 	select IRQ_DOMAIN_HIERARCHY
 
+config X86_ACPI_MADT_WAKEUP
+	def_bool y
+	depends on X86_64
+	depends on ACPI
+	depends on SMP
+	depends on X86_LOCAL_APIC
+
 config X86_IO_APIC
 	def_bool y
 	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index c8a7fc23f63c..b536b5a6a57b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
 
 #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
 
+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end);
+
 /*
  * Check if the CPU can handle C2 and deeper
  */
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..8c7329c88a75 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -1,11 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_ACPI)		+= boot.o
-obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
-obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
+obj-$(CONFIG_ACPI)			+= boot.o
+obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_APEI)			+= apei.o
+obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
-obj-y				+= cstate.o
+obj-y					+= cstate.o
 endif
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2a0ea38955df..111bd226ad99 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -66,13 +66,6 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 static bool acpi_support_online_capable;
 #endif
 
-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -357,60 +350,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
-{
-	/*
-	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
-	 *
-	 * Wakeup of secondary CPUs is fully serialized in the core code.
-	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
-	 */
-	if (!acpi_mp_wake_mailbox) {
-		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
-						sizeof(*acpi_mp_wake_mailbox),
-						MEMREMAP_WB);
-	}
-
-	/*
-	 * Mailbox memory is shared between the firmware and OS. Firmware will
-	 * listen on mailbox command address, and once it receives the wakeup
-	 * command, the CPU associated with the given apicid will be booted.
-	 *
-	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
-	 * firmware before the wakeup command is visible.  smp_store_release()
-	 * ensures ordering and visibility.
-	 */
-	acpi_mp_wake_mailbox->apic_id	    = apicid;
-	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
-	smp_store_release(&acpi_mp_wake_mailbox->command,
-			  ACPI_MP_WAKE_COMMAND_WAKEUP);
-
-	/*
-	 * Wait for the CPU to wake up.
-	 *
-	 * The CPU being woken up is essentially in a spin loop waiting to be
-	 * woken up. It should not take long for it wake up and acknowledge by
-	 * zeroing out ->command.
-	 *
-	 * ACPI specification doesn't provide any guidance on how long kernel
-	 * has to wait for a wake up acknowledgement. It also doesn't provide
-	 * a way to cancel a wake up request if it takes too long.
-	 *
-	 * In TDX environment, the VMM has control over how long it takes to
-	 * wake up secondary. It can postpone scheduling secondary vCPU
-	 * indefinitely. Giving up on wake up request and reporting error opens
-	 * possible attack vector for VMM: it can wake up a secondary CPU when
-	 * kernel doesn't expect it. Wait until positive result of the wake up
-	 * request.
-	 */
-	while (READ_ONCE(acpi_mp_wake_mailbox->command))
-		cpu_relax();
-
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1160,29 +1099,6 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
-
-#ifdef CONFIG_X86_64
-static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
-				     const unsigned long end)
-{
-	struct acpi_madt_multiproc_wakeup *mp_wake;
-
-	if (!IS_ENABLED(CONFIG_SMP))
-		return -ENODEV;
-
-	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
-	if (BAD_MADT_ENTRY(mp_wake, end))
-		return -EINVAL;
-
-	acpi_table_print_madt_entry(&header->common);
-
-	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
-
-	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
-
-	return 0;
-}
-#endif				/* CONFIG_X86_64 */
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1379,7 +1295,7 @@ static void __init acpi_process_madt(void)
 				smp_found_config = 1;
 			}
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_ACPI_MADT_WAKEUP
 			/*
 			 * Parse MADT MP Wake entry.
 			 */
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
new file mode 100644
index 000000000000..1b9747bfd5b9
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -0,0 +1,80 @@
+#include <linux/acpi.h>
+#include <asm/apic.h>
+
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	/*
+	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
+	 *
+	 * Wakeup of secondary CPUs is fully serialized in the core code.
+	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
+	 */
+	if (!acpi_mp_wake_mailbox) {
+		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+						sizeof(*acpi_mp_wake_mailbox),
+						MEMREMAP_WB);
+	}
+
+	/*
+	 * Mailbox memory is shared between the firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, the CPU associated with the given apicid will be booted.
+	 *
+	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
+	 * firmware before the wakeup command is visible.  smp_store_release()
+	 * ensures ordering and visibility.
+	 */
+	acpi_mp_wake_mailbox->apic_id	    = apicid;
+	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * Wait for the CPU to wake up.
+	 *
+	 * The CPU being woken up is essentially in a spin loop waiting to be
+	 * woken up. It should not take long for it wake up and acknowledge by
+	 * zeroing out ->command.
+	 *
+	 * ACPI specification doesn't provide any guidance on how long kernel
+	 * has to wait for a wake up acknowledgement. It also doesn't provide
+	 * a way to cancel a wake up request if it takes too long.
+	 *
+	 * In TDX environment, the VMM has control over how long it takes to
+	 * wake up secondary. It can postpone scheduling secondary vCPU
+	 * indefinitely. Giving up on wake up request and reporting error opens
+	 * possible attack vector for VMM: it can wake up a secondary CPU when
+	 * kernel doesn't expect it. Wait until positive result of the wake up
+	 * request.
+	 */
+	while (READ_ONCE(acpi_mp_wake_mailbox->command))
+		cpu_relax();
+
+	return 0;
+}
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+			      const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+
+	return 0;
+}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
  2023-10-05 13:13 ` Kirill A. Shutemov
  (?)
  (?)
@ 2023-10-05 13:13 ` Kirill A. Shutemov
  2023-10-10 13:35     ` Kuppuswamy Sathyanarayanan
  2023-10-11 13:08     ` Thomas Gleixner
  -1 siblings, 2 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The function cpu_hotplug_not_supported() can be called to indicate that
CPU hotplug should be disabled. It does not prevent the initial bring up
of the CPU, but it stops subsequent offlining.

This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/cpu.h |  2 ++
 kernel/cpu.c        | 17 ++++++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f19f56501809..aab3887cadbc 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
 extern void cpus_read_unlock(void);
 extern int  cpus_read_trylock(void);
 extern void lockdep_assert_cpus_held(void);
+extern void cpu_hotplug_not_supported(void);
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 void clear_tasks_mm_cpumask(int cpu);
@@ -147,6 +148,7 @@ static inline void cpus_read_lock(void) { }
 static inline void cpus_read_unlock(void) { }
 static inline int  cpus_read_trylock(void) { return true; }
 static inline void lockdep_assert_cpus_held(void) { }
+static inline void cpu_hotplug_not_supported(void) { }
 static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
 static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6de7c6bb74ee..cf536fe1a88a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -484,6 +484,9 @@ static int cpu_hotplug_disabled;
 
 DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
 
+/* Cleared if platform declares CPU hotplug not supported */
+static bool cpu_hotplug_supported = true;
+
 void cpus_read_lock(void)
 {
 	percpu_down_read(&cpu_hotplug_lock);
@@ -543,6 +546,18 @@ static void lockdep_release_cpus_lock(void)
 	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
 }
 
+/*
+ * Declare CPU hotplug not supported.
+ *
+ * It doesn't prevent initial bring up of the CPU, but stops offlining.
+ */
+void cpu_hotplug_not_supported(void)
+{
+	cpu_maps_update_begin();
+	cpu_hotplug_supported = false;
+	cpu_maps_update_done();
+}
+
 /*
  * Wait for currently running CPU hotplug operations to complete (if any) and
  * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
@@ -1507,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently hotplug prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method.

Mark CPU hotplug as "not supported" on ACPI MADT wakeup enumeration.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c               |  1 -
 arch/x86/kernel/acpi/madt_wakeup.c |  4 ++++
 include/linux/cc_platform.h        | 10 ----------
 kernel/cpu.c                       |  2 +-
 4 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..f07c3bb7deab 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
-	case CC_ATTR_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
 	case CC_ATTR_MEM_ENCRYPT:
 		return true;
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 1b9747bfd5b9..15bdf10b1393 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,4 +1,5 @@
 #include <linux/acpi.h>
+#include <linux/cpu.h>
 #include <asm/apic.h>
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
@@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
 
+	/* Disable CPU onlining/offlining */
+	cpu_hotplug_not_supported();
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..d08dd65b5c43 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -80,16 +80,6 @@ enum cc_attr {
 	 * using AMD SEV-SNP features.
 	 */
 	CC_ATTR_GUEST_SEV_SNP,
-
-	/**
-	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
-	 *
-	 * The platform/OS is running as a guest/virtual machine does not
-	 * support CPU hotplug feature.
-	 *
-	 * Examples include TDX Guest.
-	 */
-	CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index cf536fe1a88a..9d4279476b40 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
+	if (!cpu_hotplug_supported)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently hotplug prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method.

Mark CPU hotplug as "not supported" on ACPI MADT wakeup enumeration.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c               |  1 -
 arch/x86/kernel/acpi/madt_wakeup.c |  4 ++++
 include/linux/cc_platform.h        | 10 ----------
 kernel/cpu.c                       |  2 +-
 4 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..f07c3bb7deab 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
-	case CC_ATTR_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
 	case CC_ATTR_MEM_ENCRYPT:
 		return true;
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 1b9747bfd5b9..15bdf10b1393 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,4 +1,5 @@
 #include <linux/acpi.h>
+#include <linux/cpu.h>
 #include <asm/apic.h>
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
@@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
 
+	/* Disable CPU onlining/offlining */
+	cpu_hotplug_not_supported();
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..d08dd65b5c43 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -80,16 +80,6 @@ enum cc_attr {
 	 * using AMD SEV-SNP features.
 	 */
 	CC_ATTR_GUEST_SEV_SNP,
-
-	/**
-	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
-	 *
-	 * The platform/OS is running as a guest/virtual machine does not
-	 * support CPU hotplug feature.
-	 *
-	 * Examples include TDX Guest.
-	 */
-	CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index cf536fe1a88a..9d4279476b40 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
+	if (!cpu_hotplug_supported)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:

	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)

kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.

Do not disable kvmclock if it was not enumerated or disabled by user
from kernel command line.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
---
 arch/x86/kernel/kvmclock.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..cba2e732e53f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -22,7 +22,7 @@
 #include <asm/x86_init.h>
 #include <asm/kvmclock.h>
 
-static int kvmclock __initdata = 1;
+static int kvmclock __ro_after_init = 1;
 static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
@@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-	native_write_msr(msr_kvm_system_time, 0, 0);
+	if (!kvm_para_available() || !kvmclock)
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
+	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
+		native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:

	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)

kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.

Do not disable kvmclock if it was not enumerated or disabled by user
from kernel command line.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
---
 arch/x86/kernel/kvmclock.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..cba2e732e53f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -22,7 +22,7 @@
 #include <asm/x86_init.h>
 #include <asm/kvmclock.h>
 
-static int kvmclock __initdata = 1;
+static int kvmclock __ro_after_init = 1;
 static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
@@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-	native_write_msr(msr_kvm_system_time, 0, 0);
+	if (!kvm_para_available() || !kvmclock)
+		return;
+
+	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
+	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
+		native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Use alternatives to keep the flag during kexec for TDX guests.

The change doesn't affect non-TDX environments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/relocate_kernel_64.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..bea89814b48e 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	 * Set cr4 to a known state:
 	 *  - physical address extension enabled
 	 *  - 5-level paging, if it was enabled before
+	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
+	 *    in TDX guests.
 	 */
 	movl	$X86_CR4_PAE, %eax
 	testq	$X86_CR4_LA57, %r13
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
+1:
+	ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
+	orl	$X86_CR4_MCE, %eax
 1:
 	movq	%rax, %cr4
 
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Use alternatives to keep the flag during kexec for TDX guests.

The change doesn't affect non-TDX environments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/relocate_kernel_64.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..bea89814b48e 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	 * Set cr4 to a known state:
 	 *  - physical address extension enabled
 	 *  - 5-level paging, if it was enabled before
+	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
+	 *    in TDX guests.
 	 */
 	movl	$X86_CR4_PAE, %eax
 	testq	$X86_CR4_LA57, %r13
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
+1:
+	ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
+	orl	$X86_CR4_MCE, %eax
 1:
 	movq	%rax, %cr4
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 06/13] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
  2023-10-05 13:13 ` Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  (?)
@ 2023-10-05 13:13 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX is going to have more than one reason to fail
enc_status_change_prepare().

Change the callback to return errno instead of assuming -EIO;
enc_status_change_finish() changed too to keep the interface symmetric.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c         | 20 +++++++++++---------
 arch/x86/hyperv/ivm.c           |  9 +++------
 arch/x86/include/asm/x86_init.h |  4 ++--
 arch/x86/kernel/x86_init.c      |  4 ++--
 arch/x86/mm/mem_encrypt_amd.c   |  8 ++++----
 arch/x86/mm/pat/set_memory.c    |  9 +++++----
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3e6dbd2199cf..46022283d955 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -776,28 +776,30 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	return true;
 }
 
-static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
-					  bool enc)
+static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+					 bool enc)
 {
 	/*
 	 * Only handle shared->private conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (enc)
-		return tdx_enc_status_changed(vaddr, numpages, enc);
-	return true;
+	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+		return -EIO;
+
+	return 0;
 }
 
-static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 					 bool enc)
 {
 	/*
 	 * Only handle private->shared conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (!enc)
-		return tdx_enc_status_changed(vaddr, numpages, enc);
-	return true;
+	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+		return -EIO;
+
+	return 0;
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c1088d3661d5..7d2241059d49 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -510,13 +510,12 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
  * with host. This function works as wrap of hv_mark_gpa_visibility()
  * with memory base and size.
  */
-static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
+static int hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
 {
 	enum hv_mem_host_visibility visibility = enc ?
 			VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
 	u64 *pfn_array;
 	int ret = 0;
-	bool result = true;
 	int i, pfn;
 
 	pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
@@ -530,17 +529,15 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
 		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
 			ret = hv_mark_gpa_visibility(pfn, pfn_array,
 						     visibility);
-			if (ret) {
-				result = false;
+			if (ret)
 				goto err_free_pfn_array;
-			}
 			pfn = 0;
 		}
 	}
 
  err_free_pfn_array:
 	kfree(pfn_array);
-	return result;
+	return ret;
 }
 
 static bool hv_vtom_tlb_flush_required(bool private)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5240d88db52a..5031cbc6e211 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,8 +150,8 @@ struct x86_init_acpi {
  * @enc_cache_flush_required	Returns true if a cache flush is needed before changing page encryption status
  */
 struct x86_guest {
-	bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
-	bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
+	int (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+	int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
 };
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3b4773..f0f54e109eb9 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -131,8 +131,8 @@ struct x86_cpuinit_ops x86_cpuinit = {
 
 static void default_nmi_init(void) { };
 
-static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
+static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
 static bool is_private_mmio_noop(u64 addr) {return false; }
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 6faea41e99b6..9cbdfbf8cd45 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -318,7 +318,7 @@ static void enc_dec_hypercall(unsigned long vaddr, unsigned long size, bool enc)
 #endif
 }
 
-static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
 {
 	/*
 	 * To maintain the security guarantees of SEV-SNP guests, make sure
@@ -327,11 +327,11 @@ static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
 	if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
 		snp_set_memory_shared(vaddr, npages);
 
-	return true;
+	return 0;
 }
 
 /* Return true unconditionally: return value doesn't matter for the SEV side */
-static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
 {
 	/*
 	 * After memory is mapped encrypted in the page table, validate it
@@ -343,7 +343,7 @@ static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool e
 	if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
 		enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
 
-	return true;
+	return 0;
 }
 
 static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6fbf22d5fa56 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2152,8 +2152,9 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
 
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
-	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
-		return -EIO;
+	ret = x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+	if (ret)
+		return ret;
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2168,8 +2169,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 
 	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
 	if (!ret) {
-		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
-			ret = -EIO;
+		ret = x86_platform.guest.enc_status_change_finish(addr,
+								  numpages, enc);
 	}
 
 	return ret;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 07/13] x86/mm: Return correct level from lookup_address() if pte is none
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

lookup_address() only returns correct page table level for the entry if
the entry is not none.

Make the helper to always return correct 'level'. It allows to implement
iterator over kernel page tables using lookup_address().

Add one more entry into enum pg_level to indicate size of VA covered by
one PGD entry in 5-level paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/pat/set_memory.c         | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..3f648ffdfbe5 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -548,6 +548,7 @@ enum pg_level {
 	PG_LEVEL_2M,
 	PG_LEVEL_1G,
 	PG_LEVEL_512G,
+	PG_LEVEL_256T,
 	PG_LEVEL_NUM
 };
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6fbf22d5fa56..01f827eb8e80 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 
-	*level = PG_LEVEL_NONE;
+	*level = PG_LEVEL_256T;
 
 	if (pgd_none(*pgd))
 		return NULL;
 
+	*level = PG_LEVEL_512G;
 	p4d = p4d_offset(pgd, address);
 	if (p4d_none(*p4d))
 		return NULL;
 
-	*level = PG_LEVEL_512G;
 	if (p4d_large(*p4d) || !p4d_present(*p4d))
 		return (pte_t *)p4d;
 
+	*level = PG_LEVEL_1G;
 	pud = pud_offset(p4d, address);
 	if (pud_none(*pud))
 		return NULL;
 
-	*level = PG_LEVEL_1G;
 	if (pud_large(*pud) || !pud_present(*pud))
 		return (pte_t *)pud;
 
+	*level = PG_LEVEL_2M;
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return NULL;
 
-	*level = PG_LEVEL_2M;
 	if (pmd_large(*pmd) || !pmd_present(*pmd))
 		return (pte_t *)pmd;
 
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 07/13] x86/mm: Return correct level from lookup_address() if pte is none
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

lookup_address() only returns correct page table level for the entry if
the entry is not none.

Make the helper to always return correct 'level'. It allows to implement
iterator over kernel page tables using lookup_address().

Add one more entry into enum pg_level to indicate size of VA covered by
one PGD entry in 5-level paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/pat/set_memory.c         | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..3f648ffdfbe5 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -548,6 +548,7 @@ enum pg_level {
 	PG_LEVEL_2M,
 	PG_LEVEL_1G,
 	PG_LEVEL_512G,
+	PG_LEVEL_256T,
 	PG_LEVEL_NUM
 };
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6fbf22d5fa56..01f827eb8e80 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -666,32 +666,32 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 
-	*level = PG_LEVEL_NONE;
+	*level = PG_LEVEL_256T;
 
 	if (pgd_none(*pgd))
 		return NULL;
 
+	*level = PG_LEVEL_512G;
 	p4d = p4d_offset(pgd, address);
 	if (p4d_none(*p4d))
 		return NULL;
 
-	*level = PG_LEVEL_512G;
 	if (p4d_large(*p4d) || !p4d_present(*p4d))
 		return (pte_t *)p4d;
 
+	*level = PG_LEVEL_1G;
 	pud = pud_offset(p4d, address);
 	if (pud_none(*pud))
 		return NULL;
 
-	*level = PG_LEVEL_1G;
 	if (pud_large(*pud) || !pud_present(*pud))
 		return (pte_t *)pud;
 
+	*level = PG_LEVEL_2M;
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return NULL;
 
-	*level = PG_LEVEL_2M;
 	if (pmd_large(*pmd) || !pmd_present(*pmd))
 		return (pte_t *)pmd;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 08/13] KVM: x86: Add config option to gate emergency virt callback support
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

KVM uses emergency virt call back to shutdown virtualization extension
during crash, so the crash kernel can work correctly.

So far the virt callback is only supported if KVM_INTEL or KVM_AMD is
enabled. TDX guest has similar needs.

Add a config option to gate virt emergency callback support.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/reboot.h | 4 ++--
 arch/x86/kernel/reboot.c      | 4 ++--
 arch/x86/kvm/Kconfig          | 5 +++++
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 6536873f8fc0..f72bdd4abbe8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,14 +25,14 @@ void __noreturn machine_real_restart(unsigned int type);
 #define MRR_BIOS	0
 #define MRR_APM		1
 
-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
+#ifdef CONFIG_EMERGENCY_VIRT_CALLBACK
 typedef void (cpu_emergency_virt_cb)(void);
 void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback);
 void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback);
 void cpu_emergency_disable_virtualization(void);
 #else
 static inline void cpu_emergency_disable_virtualization(void) {}
-#endif /* CONFIG_KVM_INTEL || CONFIG_KVM_AMD */
+#endif /* CONFIG_EMERGENCY_VIRT_CALLBACK */
 
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 830425e6d38e..6a781f2f11c8 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -529,7 +529,7 @@ static inline void kb_wait(void)
 
 static inline void nmi_shootdown_cpus_on_restart(void);
 
-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
+#ifdef CONFIG_EMERGENCY_VIRT_CALLBACK
 /* RCU-protected callback to disable virtualization prior to reboot. */
 static cpu_emergency_virt_cb __rcu *cpu_emergency_virt_callback;
 
@@ -599,7 +599,7 @@ static void emergency_reboot_disable_virtualization(void)
 }
 #else
 static void emergency_reboot_disable_virtualization(void) { }
-#endif /* CONFIG_KVM_INTEL || CONFIG_KVM_AMD */
+#endif /* CONFIG_EMERGENCY_VIRT_CALLBACK */
 
 void __attribute__((weak)) mach_reboot_fixups(void)
 {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140d..7df3f0c45cfe 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -80,6 +80,7 @@ config KVM_WERROR
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
+	select EMERGENCY_VIRT_CALLBACK
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
@@ -102,6 +103,7 @@ config X86_SGX_KVM
 config KVM_AMD
 	tristate "KVM for AMD processors support"
 	depends on KVM && (CPU_SUP_AMD || CPU_SUP_HYGON)
+	select EMERGENCY_VIRT_CALLBACK
 	help
 	  Provides support for KVM on AMD processors equipped with the AMD-V
 	  (SVM) extensions.
@@ -155,3 +157,6 @@ config KVM_EXTERNAL_WRITE_TRACKING
 	bool
 
 endif # VIRTUALIZATION
+
+config EMERGENCY_VIRT_CALLBACK
+	bool
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 08/13] KVM: x86: Add config option to gate emergency virt callback support
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

KVM uses emergency virt call back to shutdown virtualization extension
during crash, so the crash kernel can work correctly.

So far the virt callback is only supported if KVM_INTEL or KVM_AMD is
enabled. TDX guest has similar needs.

Add a config option to gate virt emergency callback support.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/reboot.h | 4 ++--
 arch/x86/kernel/reboot.c      | 4 ++--
 arch/x86/kvm/Kconfig          | 5 +++++
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 6536873f8fc0..f72bdd4abbe8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -25,14 +25,14 @@ void __noreturn machine_real_restart(unsigned int type);
 #define MRR_BIOS	0
 #define MRR_APM		1
 
-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
+#ifdef CONFIG_EMERGENCY_VIRT_CALLBACK
 typedef void (cpu_emergency_virt_cb)(void);
 void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback);
 void cpu_emergency_unregister_virt_callback(cpu_emergency_virt_cb *callback);
 void cpu_emergency_disable_virtualization(void);
 #else
 static inline void cpu_emergency_disable_virtualization(void) {}
-#endif /* CONFIG_KVM_INTEL || CONFIG_KVM_AMD */
+#endif /* CONFIG_EMERGENCY_VIRT_CALLBACK */
 
 typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 830425e6d38e..6a781f2f11c8 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -529,7 +529,7 @@ static inline void kb_wait(void)
 
 static inline void nmi_shootdown_cpus_on_restart(void);
 
-#if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD)
+#ifdef CONFIG_EMERGENCY_VIRT_CALLBACK
 /* RCU-protected callback to disable virtualization prior to reboot. */
 static cpu_emergency_virt_cb __rcu *cpu_emergency_virt_callback;
 
@@ -599,7 +599,7 @@ static void emergency_reboot_disable_virtualization(void)
 }
 #else
 static void emergency_reboot_disable_virtualization(void) { }
-#endif /* CONFIG_KVM_INTEL || CONFIG_KVM_AMD */
+#endif /* CONFIG_EMERGENCY_VIRT_CALLBACK */
 
 void __attribute__((weak)) mach_reboot_fixups(void)
 {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140d..7df3f0c45cfe 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -80,6 +80,7 @@ config KVM_WERROR
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
+	select EMERGENCY_VIRT_CALLBACK
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
@@ -102,6 +103,7 @@ config X86_SGX_KVM
 config KVM_AMD
 	tristate "KVM for AMD processors support"
 	depends on KVM && (CPU_SUP_AMD || CPU_SUP_HYGON)
+	select EMERGENCY_VIRT_CALLBACK
 	help
 	  Provides support for KVM on AMD processors equipped with the AMD-V
 	  (SVM) extensions.
@@ -155,3 +157,6 @@ config KVM_EXTERNAL_WRITE_TRACKING
 	bool
 
 endif # VIRTUALIZATION
+
+config EMERGENCY_VIRT_CALLBACK
+	bool
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 09/13] x86/tdx: Account shared memory
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The kernel will convert all shared memory back to private during kexec.
The direct mapping page tables will provide information on which memory
is shared.

It is extremely important to convert all shared memory. If a page is
missed, it will cause the target kernel to crash when it accesses it.

Keep track of the number of shared pages. This will allow for
cross-checking against the shared information in the direct mapping and
reporting if the shared bit is lost.

Include a debugfs interface that allows for the check to be performed at
any point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 67 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 46022283d955..56e152126f20 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/debugfs.h>
 #include <linux/export.h>
 #include <linux/io.h>
 #include <asm/coco.h>
@@ -37,6 +38,13 @@
 
 #define TDREPORT_SUBTYPE_0	0
 
+static atomic_long_t nr_shared;
+
+static inline bool pte_decrypted(pte_t pte)
+{
+	return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __noreturn __tdx_hypercall_failed(void)
 {
@@ -799,6 +807,11 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
 		return -EIO;
 
+	if (enc)
+		atomic_long_sub(numpages, &nr_shared);
+	else
+		atomic_long_add(numpages, &nr_shared);
+
 	return 0;
 }
 
@@ -871,3 +884,57 @@ void __init tdx_early_init(void)
 
 	pr_info("Guest detected\n");
 }
+
+#ifdef CONFIG_DEBUG_FS
+static int tdx_shared_memory_show(struct seq_file *m, void *p)
+{
+	unsigned long addr, end;
+	unsigned long found = 0;
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte))
+			found += size / PAGE_SIZE;
+
+		addr += size;
+	}
+
+	seq_printf(m, "Number of unshared pages in kernel page tables:  %16lu\n",
+		   found);
+	seq_printf(m, "Number of pages accounted as unshared:           %16ld\n",
+		   atomic_long_read(&nr_shared));
+	return 0;
+}
+
+static int tdx_shared_memory_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, tdx_shared_memory_show, NULL);
+}
+
+static const struct file_operations tdx_shared_memory_fops = {
+	.open           = tdx_shared_memory_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static __init int debug_tdx_shared_memory(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return 0;
+
+	debugfs_create_file("tdx_shared_memory", S_IRUSR, arch_debugfs_dir,
+			    NULL, &tdx_shared_memory_fops);
+	return 0;
+}
+fs_initcall(debug_tdx_shared_memory);
+#endif
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 09/13] x86/tdx: Account shared memory
@ 2023-10-05 13:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

The kernel will convert all shared memory back to private during kexec.
The direct mapping page tables will provide information on which memory
is shared.

It is extremely important to convert all shared memory. If a page is
missed, it will cause the target kernel to crash when it accesses it.

Keep track of the number of shared pages. This will allow for
cross-checking against the shared information in the direct mapping and
reporting if the shared bit is lost.

Include a debugfs interface that allows for the check to be performed at
any point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 67 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 46022283d955..56e152126f20 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/debugfs.h>
 #include <linux/export.h>
 #include <linux/io.h>
 #include <asm/coco.h>
@@ -37,6 +38,13 @@
 
 #define TDREPORT_SUBTYPE_0	0
 
+static atomic_long_t nr_shared;
+
+static inline bool pte_decrypted(pte_t pte)
+{
+	return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __noreturn __tdx_hypercall_failed(void)
 {
@@ -799,6 +807,11 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
 		return -EIO;
 
+	if (enc)
+		atomic_long_sub(numpages, &nr_shared);
+	else
+		atomic_long_add(numpages, &nr_shared);
+
 	return 0;
 }
 
@@ -871,3 +884,57 @@ void __init tdx_early_init(void)
 
 	pr_info("Guest detected\n");
 }
+
+#ifdef CONFIG_DEBUG_FS
+static int tdx_shared_memory_show(struct seq_file *m, void *p)
+{
+	unsigned long addr, end;
+	unsigned long found = 0;
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte))
+			found += size / PAGE_SIZE;
+
+		addr += size;
+	}
+
+	seq_printf(m, "Number of unshared pages in kernel page tables:  %16lu\n",
+		   found);
+	seq_printf(m, "Number of pages accounted as unshared:           %16ld\n",
+		   atomic_long_read(&nr_shared));
+	return 0;
+}
+
+static int tdx_shared_memory_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, tdx_shared_memory_show, NULL);
+}
+
+static const struct file_operations tdx_shared_memory_fops = {
+	.open           = tdx_shared_memory_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static __init int debug_tdx_shared_memory(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return 0;
+
+	debugfs_create_file("tdx_shared_memory", S_IRUSR, arch_debugfs_dir,
+			    NULL, &tdx_shared_memory_fops);
+	return 0;
+}
+fs_initcall(debug_tdx_shared_memory);
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 13:13 ` Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  (?)
@ 2023-10-05 13:13 ` Kirill A. Shutemov
  2023-10-05 18:41     ` Kalra, Ashish
                     ` (2 more replies)
  -1 siblings, 3 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The target kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On TD shutdown (also covers kexec), walk direct mapping and convert all
shared memory back to private. It makes all RAM private again and target
kernel may use it normally.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig          |   1 +
 arch/x86/coco/tdx/kexec.c |   0
 arch/x86/coco/tdx/tdx.c   | 137 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 136 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/coco/tdx/kexec.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7368d254d01f..b5acf9fb4c70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
 	select X86_MEM_ENCRYPT
 	select X86_MCE
 	select UNACCEPTED_MEMORY
+	select EMERGENCY_VIRT_CALLBACK
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 56e152126f20..ac0745303983 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -6,6 +6,7 @@
 
 #include <linux/cpufeature.h>
 #include <linux/debugfs.h>
+#include <linux/delay.h>
 #include <linux/export.h>
 #include <linux/io.h>
 #include <asm/coco.h>
@@ -14,6 +15,8 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -40,6 +43,9 @@
 
 static atomic_long_t nr_shared;
 
+static atomic_t conversions_in_progress;
+static bool conversion_allowed = true;
+
 static inline bool pte_decrypted(pte_t pte)
 {
 	return cc_mkdec(pte_val(pte)) == pte_val(pte);
@@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
 
 static bool tdx_cache_flush_required(void)
 {
+	/*
+	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
+	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
+	 * implicit conversion to shared.
+	 */
+	if (!conversion_allowed)
+		return false;
+
 	/*
 	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
 	 * TDX doesn't have such capability.
@@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
 					 bool enc)
 {
+	atomic_inc(&conversions_in_progress);
+
+	/*
+	 * Check after bumping conversions_in_progress to serialize
+	 * against tdx_shutdown().
+	 */
+	if (!conversion_allowed) {
+		atomic_dec(&conversions_in_progress);
+		return -EBUSY;
+	}
+
 	/*
 	 * Only handle shared->private conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	return 0;
 }
@@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	 * Only handle private->shared conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	if (enc)
 		atomic_long_sub(numpages, &nr_shared);
 	else
 		atomic_long_add(numpages, &nr_shared);
 
+	atomic_dec(&conversions_in_progress);
+
 	return 0;
 }
 
+static void unshare_all_memory(bool unmap)
+{
+	unsigned long addr, end;
+	long found = 0, shared;
+
+	/*
+	 * Walk direct mapping and convert all shared memory back to private,
+	 */
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte)) {
+			int pages = size / PAGE_SIZE;
+
+			/*
+			 * Touching memory with shared bit set triggers implicit
+			 * conversion to shared.
+			 *
+			 * Make sure nobody touches the shared range from
+			 * now on.
+			 *
+			 * Bypass unmapping for crash scenario. Unmapping
+			 * requires sleepable context, but in crash case kernel
+			 * hits the code path with interrupts disabled.
+			 * It shouldn't be a problem as all secondary CPUs are
+			 * down and kernel runs with interrupts disabled, so
+			 * there is no room for race.
+			 */
+			if (unmap)
+				set_memory_np(addr, pages);
+
+			if (!tdx_enc_status_changed(addr, pages, true)) {
+				pr_err("Failed to unshare range %#lx-%#lx\n",
+				       addr, addr + size);
+			}
+
+			found += pages;
+		}
+
+		addr += size;
+	}
+
+	shared = atomic_long_read(&nr_shared);
+	if (shared != found) {
+		pr_err("shared page accounting is off\n");
+		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+	}
+}
+
+static void tdx_shutdown(void)
+{
+	unsigned long timeout;
+
+	/*
+	 * Stop new private<->shared conversions and wait for in-flight
+	 * conversions to complete.
+	 *
+	 * Do not wait more than 30 seconds.
+	 */
+	timeout = 30 * USEC_PER_SEC;
+	conversion_allowed = false;
+	while (atomic_read(&conversions_in_progress) && timeout--)
+		udelay(1);
+
+	if (!timeout)
+		pr_warn("Failed to finish shared<->private conversions\n");
+
+	unshare_all_memory(true);
+
+	native_machine_shutdown();
+}
+
+static void tdx_crash_shutdown(void)
+{
+	/*
+	 * Crash can race with private<->shared conversion.
+	 *
+	 * There's no clean way out: report and proceed.
+	 */
+	if (atomic_read(&conversions_in_progress))
+		pr_warn("Failed to finish shared<->private conversions\n");
+
+	unshare_all_memory(false);
+}
+
 void __init tdx_early_init(void)
 {
 	struct tdx_module_args args = {
@@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
 	 */
 	x86_cpuinit.parallel_bringup = false;
 
+	machine_ops.shutdown = tdx_shutdown;
+
+	/*
+	 * KVM overrides machine_ops.crash_shutdown, use emergency
+	 * virt callback instead.
+	 */
+	cpu_emergency_register_virt_callback(tdx_crash_shutdown);
+
 	pr_info("Guest detected\n");
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 11/13] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:14   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

e820__end_of_ram_pfn() is used to calculate max_pfn which, among other
things, guides where direct mapping ends. Any memory above max_pfn is
not going to be present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the ram based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI
tables and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory
above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash
memory as E820_TYPE_RAM on boot. If the preallocated range is small, it
can fit under the last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX
guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass
it between the kernels on kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/e820.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..99c80680dc9e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -827,7 +827,7 @@ u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type type)
+static unsigned long __init e820_end_ram_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -838,7 +838,8 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (entry->type != type)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_ACPI)
 			continue;
 
 		start_pfn = entry->addr >> PAGE_SHIFT;
@@ -864,12 +865,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 
 unsigned long __init e820__end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
+	return e820_end_ram_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820__end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM);
+	return e820_end_ram_pfn(1UL << (32 - PAGE_SHIFT));
 }
 
 static void __init early_panic(char *msg)
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 11/13] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
@ 2023-10-05 13:14   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

e820__end_of_ram_pfn() is used to calculate max_pfn which, among other
things, guides where direct mapping ends. Any memory above max_pfn is
not going to be present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the ram based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI
tables and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory
above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash
memory as E820_TYPE_RAM on boot. If the preallocated range is small, it
can fit under the last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX
guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass
it between the kernels on kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/e820.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..99c80680dc9e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -827,7 +827,7 @@ u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type type)
+static unsigned long __init e820_end_ram_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -838,7 +838,8 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (entry->type != type)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_ACPI)
 			continue;
 
 		start_pfn = entry->addr >> PAGE_SHIFT;
@@ -864,12 +865,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 
 unsigned long __init e820__end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
+	return e820_end_ram_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820__end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM);
+	return e820_end_ram_pfn(1UL << (32 - PAGE_SHIFT));
 }
 
 static void __init early_panic(char *msg)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-05 13:14   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
kexec: target kernel won't be able to use more than one CPU.

Zero out mailbox address in the ACPI MADT wakeup structure to indicate
that the mailbox is not usable.

This is Linux-specific protocol and not reflected in ACPI spec.

Booting the target kernel with signle CPU is enough to cover the most
common case for kexec -- kdump.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 15bdf10b1393..4e92d1d4a5fa 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
 
 static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
 {
+	if (!acpi_mp_wake_mailbox_paddr) {
+		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
+		return -EOPNOTSUPP;
+	}
+
 	/*
 	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
 	 *
@@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 	/* Disable CPU onlining/offlining */
 	cpu_hotplug_not_supported();
 
+	/*
+	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
+	 * It limits kexec: target kernel won't be able to use more than
+	 * one CPU.
+	 *
+	 * Zero out mailbox address in the ACPI MADT wakeup structure to
+	 * indicate that the mailbox is not usable.
+	 *
+	 * This is Linux-specific protocol and not reflected in ACPI spec.
+	 */
+	mp_wake->base_address = 0;
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
@ 2023-10-05 13:14   ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
kexec: target kernel won't be able to use more than one CPU.

Zero out mailbox address in the ACPI MADT wakeup structure to indicate
that the mailbox is not usable.

This is Linux-specific protocol and not reflected in ACPI spec.

Booting the target kernel with signle CPU is enough to cover the most
common case for kexec -- kdump.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 15bdf10b1393..4e92d1d4a5fa 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
 
 static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
 {
+	if (!acpi_mp_wake_mailbox_paddr) {
+		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
+		return -EOPNOTSUPP;
+	}
+
 	/*
 	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
 	 *
@@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 	/* Disable CPU onlining/offlining */
 	cpu_hotplug_not_supported();
 
+	/*
+	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
+	 * It limits kexec: target kernel won't be able to use more than
+	 * one CPU.
+	 *
+	 * Zero out mailbox address in the ACPI MADT wakeup structure to
+	 * indicate that the mailbox is not usable.
+	 *
+	 * This is Linux-specific protocol and not reflected in ACPI spec.
+	 */
+	mp_wake->base_address = 0;
+
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
 	return 0;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-10-05 13:13 ` Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  (?)
@ 2023-10-05 13:14 ` Kirill A. Shutemov
  2023-10-20  9:49     ` Huang, Kai
  2023-10-20 11:21     ` Huang, Kai
  -1 siblings, 2 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

MADT mailbox version 1 brings support of CPU offlining: BIOS provides
a reset vector where the CPU has to jump to offline itself. The new TEST
mailbox command can be used to test the CPU offlined successfully and
BIOS has control over it.

Add CPU offling support for ACPI MADT wakeup method by implementing
custom cpu_die, play_dead and stop_other_cpus SMP operations.

CPU offlining makes possible to hand over secondary CPUs over kexec, not
limiting the target kernel with single CPU.

The change conforms to the approved ACPI spec change proposal. See the
Link.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
---
 arch/x86/kernel/acpi/Makefile      |   2 +-
 arch/x86/kernel/acpi/boot.c        |   2 +
 arch/x86/kernel/acpi/madt.S        |  28 +++++
 arch/x86/kernel/acpi/madt_wakeup.c | 191 ++++++++++++++++++++++++++---
 include/acpi/actbl2.h              |  19 ++-
 5 files changed, 223 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/kernel/acpi/madt.S

diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index 8c7329c88a75..ccb8198dd8d1 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI)			+= boot.o
 obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
 obj-$(CONFIG_ACPI_APEI)			+= apei.o
 obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
-obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o madt.o
 
 ifneq ($(CONFIG_ACPI_PROCESSOR),)
 obj-y					+= cstate.o
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 111bd226ad99..d537dbffa697 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include <linux/efi-bgrt.h>
 #include <linux/serial_core.h>
 #include <linux/pgtable.h>
+#include <linux/sched/hotplug.h>
 
 #include <asm/e820/api.h>
 #include <asm/irqdomain.h>
@@ -33,6 +34,7 @@
 #include <asm/smp.h>
 #include <asm/i8259.h>
 #include <asm/setup.h>
+#include <asm/init.h>
 
 #include "sleep.h" /* To include x86_acpi_suspend_lowlevel */
 static int __initdata acpi_force = 0;
diff --git a/arch/x86/kernel/acpi/madt.S b/arch/x86/kernel/acpi/madt.S
new file mode 100644
index 000000000000..5d00d315e44e
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt.S
@@ -0,0 +1,28 @@
+#include <linux/linkage.h>
+#include <asm/nospec-branch.h>
+#include <asm/page_types.h>
+#include <asm/processor-flags.h>
+
+	.text
+	.align PAGE_SIZE
+SYM_FUNC_START(asm_acpi_mp_play_dead)
+	/* Load address of reset vector into RCX to jump when kernel is ready */
+	movq	acpi_mp_reset_vector_paddr(%rip), %rcx
+
+	/* zero out flags, and disable interrupts */
+	pushq	$0
+	popfq
+
+	/* Turn off global entries. Following CR3 write will flush them. */
+	movq	%cr4, %rdx
+	andq	$~(X86_CR4_PGE), %rdx
+	movq	%rdx, %cr4
+
+	/* Switch to identity mapping */
+	movq	acpi_mp_pgd(%rip), %rax
+	movq	%rax, %cr3
+
+	/* Jump to reset vector */
+	ANNOTATE_RETPOLINE_SAFE
+	jmp	*%rcx
+SYM_FUNC_END(asm_acpi_mp_play_dead)
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 4e92d1d4a5fa..2cc8590ec7a5 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,12 +1,162 @@
 #include <linux/acpi.h>
 #include <linux/cpu.h>
+#include <linux/delay.h>
+#include <linux/memblock.h>
+#include <linux/pgtable.h>
+#include <linux/sched/hotplug.h>
 #include <asm/apic.h>
+#include <asm/init.h>
 
 /* Physical address of the Multiprocessor Wakeup Structure mailbox */
 static u64 acpi_mp_wake_mailbox_paddr;
 /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
 static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
 
+unsigned long acpi_mp_pgd;
+u64 acpi_mp_reset_vector_paddr;
+
+void asm_acpi_mp_play_dead(void);
+
+static void __init *alloc_pgt_page(void *context)
+{
+	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+/*
+ * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
+ * the same place as in the kernel page tables. The function switches to
+ * the identity mapping and has be present at the same spot in before and
+ * after transition.
+ */
+static int __init init_transition_pgtable(pgd_t *pgd)
+{
+	pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+	unsigned long vaddr, paddr;
+	int result = -ENOMEM;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	vaddr = (unsigned long)asm_acpi_mp_play_dead;
+	pgd += pgd_index(vaddr);
+	if (!pgd_present(*pgd)) {
+		p4d = (p4d_t *)alloc_pgt_page(NULL);
+		if (!p4d)
+			goto err;
+		set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+	}
+	p4d = p4d_offset(pgd, vaddr);
+	if (!p4d_present(*p4d)) {
+		pud = (pud_t *)alloc_pgt_page(NULL);
+		if (!pud)
+			goto err;
+		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+	}
+	pud = pud_offset(p4d, vaddr);
+	if (!pud_present(*pud)) {
+		pmd = (pmd_t *)alloc_pgt_page(NULL);
+		if (!pmd)
+			goto err;
+		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+	}
+	pmd = pmd_offset(pud, vaddr);
+	if (!pmd_present(*pmd)) {
+		pte = (pte_t *)alloc_pgt_page(NULL);
+		if (!pte)
+			goto err;
+		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+	}
+	pte = pte_offset_kernel(pmd, vaddr);
+
+	paddr = __pa(vaddr);
+	set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+
+	return 0;
+err:
+	return result;
+}
+
+static void acpi_mp_play_dead(void)
+{
+	idle_task_exit();
+	cpuhp_ap_report_dead();
+	asm_acpi_mp_play_dead();
+}
+
+static void acpi_mp_cpu_die(unsigned int cpu)
+{
+	int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+	unsigned long timeout;
+
+	/*
+	 * Use TEST mailbox command to prove that BIOS got control over
+	 * the CPU before declaring it dead.
+	 *
+	 * BIOS has to clear 'command' field of the mailbox.
+	 */
+	acpi_mp_wake_mailbox->apic_id = apicid;
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_TEST);
+
+	/* Don't wait longer than a second. */
+	timeout = USEC_PER_SEC;
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+		udelay(1);
+}
+
+static void acpi_mp_stop_other_cpus(int wait)
+{
+	smp_shutdown_nonboot_cpus(smp_processor_id());
+}
+
+static void acpi_mp_crash_stop_other_cpus(void)
+{
+	smp_shutdown_nonboot_cpus(smp_processor_id());
+
+	/* The kernel is broken so disable interrupts */
+	local_irq_disable();
+}
+
+static int __init acpi_mp_setup_reset(u64 reset_vector)
+{
+	pgd_t *pgd;
+	struct x86_mapping_info info = {
+		.alloc_pgt_page = alloc_pgt_page,
+		.page_flag      = __PAGE_KERNEL_LARGE_EXEC,
+		.kernpg_flag    = _KERNPG_TABLE_NOENC,
+	};
+
+	pgd = alloc_pgt_page(NULL);
+
+	for (int i = 0; i < nr_pfn_mapped; i++) {
+		unsigned long mstart, mend;
+		mstart = pfn_mapped[i].start << PAGE_SHIFT;
+		mend   = pfn_mapped[i].end << PAGE_SHIFT;
+		if (kernel_ident_mapping_init(&info, pgd, mstart, mend))
+			return -ENOMEM;
+	}
+
+	if (kernel_ident_mapping_init(&info, pgd,
+				      PAGE_ALIGN_DOWN(reset_vector),
+				      PAGE_ALIGN(reset_vector + 1))) {
+		return -ENOMEM;
+	}
+
+	if (init_transition_pgtable(pgd))
+		return -ENOMEM;
+
+	smp_ops.play_dead = acpi_mp_play_dead;
+	smp_ops.cpu_die = acpi_mp_cpu_die;
+	smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
+	smp_ops.crash_stop_other_cpus = acpi_mp_crash_stop_other_cpus;
+
+	acpi_mp_reset_vector_paddr = reset_vector;
+	acpi_mp_pgd = __pa(pgd);
+
+	return 0;
+}
+
 static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
 {
 	if (!acpi_mp_wake_mailbox_paddr) {
@@ -73,27 +223,38 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 		return -ENODEV;
 
 	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
-	if (BAD_MADT_ENTRY(mp_wake, end))
+	if (!mp_wake)
+		return -EINVAL;
+
+	if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0)
+		return -EINVAL;
+	if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0)
 		return -EINVAL;
 
 	acpi_table_print_madt_entry(&header->common);
 
-	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+	acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
 
-	/* Disable CPU onlining/offlining */
-	cpu_hotplug_not_supported();
+	if (mp_wake->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1 &&
+	    mp_wake->header.length >= ACPI_MADT_MP_WAKEUP_SIZE_V1) {
+		acpi_mp_setup_reset(mp_wake->reset_vector);
+	} else {
+		/* Disable CPU onlining/offlining */
+		cpu_hotplug_not_supported();
 
-	/*
-	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
-	 * It limits kexec: target kernel won't be able to use more than
-	 * one CPU.
-	 *
-	 * Zero out mailbox address in the ACPI MADT wakeup structure to
-	 * indicate that the mailbox is not usable.
-	 *
-	 * This is Linux-specific protocol and not reflected in ACPI spec.
-	 */
-	mp_wake->base_address = 0;
+		/*
+		 * Without reset vector support, ACPI MADT doesn't allow to
+		 * offline CPU after it got woke up. It limits kexec: target
+		 * kernel won't be able to use more than one CPU.
+		 *
+		 * Zero out mailbox address in the ACPI MADT wakeup structure
+		 * to indicate that the mailbox is not usable.
+		 *
+		 * This is Linux-specific protocol and not reflected in ACPI
+		 * spec.
+		 */
+		mp_wake->mailbox_address = 0;
+	}
 
 	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
 
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index 3751ae69432f..8348bf46a648 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1109,11 +1109,23 @@ struct acpi_madt_generic_translator {
 
 struct acpi_madt_multiproc_wakeup {
 	struct acpi_subtable_header header;
-	u16 mailbox_version;
+	u16 version;
 	u32 reserved;		/* reserved - must be zero */
-	u64 base_address;
+	u64 mailbox_address;
+	u64 reset_vector;
 };
 
+/* Values for Version field above */
+
+enum acpi_madt_multiproc_wakeup_version {
+	ACPI_MADT_MP_WAKEUP_VERSION_NONE = 0,
+	ACPI_MADT_MP_WAKEUP_VERSION_V1 = 1,
+	ACPI_MADT_MP_WAKEUP_VERSION_RESERVED = 2, /* 2 and greater are reserved */
+};
+
+#define ACPI_MADT_MP_WAKEUP_SIZE_V0	16
+#define ACPI_MADT_MP_WAKEUP_SIZE_V1	24
+
 #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE        2032
 #define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE  2048
 
@@ -1126,7 +1138,8 @@ struct acpi_madt_multiproc_wakeup_mailbox {
 	u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE];	/* reserved for firmware use */
 };
 
-#define ACPI_MP_WAKE_COMMAND_WAKEUP    1
+#define ACPI_MP_WAKE_COMMAND_WAKEUP	1
+#define ACPI_MP_WAKE_COMMAND_TEST	2
 
 /* 17: CPU Core Interrupt Controller (ACPI 6.5) */
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 13:13 ` [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
@ 2023-10-05 18:41     ` Kalra, Ashish
  2023-10-06 14:58     ` Sean Christopherson
  2023-10-08  8:35     ` Baoquan He
  2 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-05 18:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

Hello Kirill,

On 10/5/2023 8:13 AM, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
> 
> The target kernel has no idea what memory is converted this way. It only
> sees E820_TYPE_RAM.
> 
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
> 
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   arch/x86/Kconfig          |   1 +
>   arch/x86/coco/tdx/kexec.c |   0
>   arch/x86/coco/tdx/tdx.c   | 137 +++++++++++++++++++++++++++++++++++++-
>   3 files changed, 136 insertions(+), 2 deletions(-)
>   create mode 100644 arch/x86/coco/tdx/kexec.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>   	select X86_MEM_ENCRYPT
>   	select X86_MCE
>   	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>   	help
>   	  Support running as a guest under Intel TDX.  Without this support,
>   	  the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>   
>   #include <linux/cpufeature.h>
>   #include <linux/debugfs.h>
> +#include <linux/delay.h>
>   #include <linux/export.h>
>   #include <linux/io.h>
>   #include <asm/coco.h>
> @@ -14,6 +15,8 @@
>   #include <asm/insn.h>
>   #include <asm/insn-eval.h>
>   #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>   
>   /* MMIO direction */
>   #define EPT_READ	0
> @@ -40,6 +43,9 @@
>   
>   static atomic_long_t nr_shared;
>   
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
>   static inline bool pte_decrypted(pte_t pte)
>   {
>   	return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>   
>   static bool tdx_cache_flush_required(void)
>   {
> +	/*
> +	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> +	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
> +	 * implicit conversion to shared.
> +	 */
> +	if (!conversion_allowed)
> +		return false;
> +
>   	/*
>   	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
>   	 * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>   static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
>   					 bool enc)
>   {
> +	atomic_inc(&conversions_in_progress);
> +
> +	/*
> +	 * Check after bumping conversions_in_progress to serialize
> +	 * against tdx_shutdown().
> +	 */
> +	if (!conversion_allowed) {
> +		atomic_dec(&conversions_in_progress);
> +		return -EBUSY;
> +	}
> +
>   	/*
>   	 * Only handle shared->private conversion here.
>   	 * See the comment in tdx_early_init().
>   	 */
> -	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>   		return -EIO;
> +	}
>   
>   	return 0;
>   }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
>   	 * Only handle private->shared conversion here.
>   	 * See the comment in tdx_early_init().
>   	 */
> -	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>   		return -EIO;
> +	}
>   
>   	if (enc)
>   		atomic_long_sub(numpages, &nr_shared);
>   	else
>   		atomic_long_add(numpages, &nr_shared);
>   
> +	atomic_dec(&conversions_in_progress);
> +
>   	return 0;
>   }
>   
> +static void unshare_all_memory(bool unmap)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);

IIRC, you were earlier walking the direct mapping using 
walk_page_range_novma(), any particular reason to use lookup_address() 
instead ?

> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {

Additionally need to add check for pte_none() here to handle physical 
memory holes in direct mapping.

> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 *
> +			 * Bypass unmapping for crash scenario. Unmapping
> +			 * requires sleepable context, but in crash case kernel
> +			 * hits the code path with interrupts disabled.

In case of SNP we will need to temporarily enable interrupts during this 
unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() 
in cpa_flush() if interrupts are disabled.

Thanks,
Ashish

> +			 * It shouldn't be a problem as all secondary CPUs are
> +			 * down and kernel runs with interrupts disabled, so
> +			 * there is no room for race.
> +			 */
> +			if (unmap)
> +				set_memory_np(addr, pages);
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}
> +
> +static void tdx_shutdown(void)
> +{
> +	unsigned long timeout;
> +
> +	/*
> +	 * Stop new private<->shared conversions and wait for in-flight
> +	 * conversions to complete.
> +	 *
> +	 * Do not wait more than 30 seconds.
> +	 */
> +	timeout = 30 * USEC_PER_SEC;
> +	conversion_allowed = false;
> +	while (atomic_read(&conversions_in_progress) && timeout--)
> +		udelay(1);
> +
> +	if (!timeout)
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(true);
> +
> +	native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> +	/*
> +	 * Crash can race with private<->shared conversion.
> +	 *
> +	 * There's no clean way out: report and proceed.
> +	 */
> +	if (atomic_read(&conversions_in_progress))
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(false);
> +}
> +
>   void __init tdx_early_init(void)
>   {
>   	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>   	 */
>   	x86_cpuinit.parallel_bringup = false;
>   
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> +	 * virt callback instead.
> +	 */
> +	cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
>   	pr_info("Guest detected\n");
>   }
>   
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-05 18:41     ` Kalra, Ashish
  0 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-05 18:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

Hello Kirill,

On 10/5/2023 8:13 AM, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
> 
> The target kernel has no idea what memory is converted this way. It only
> sees E820_TYPE_RAM.
> 
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
> 
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   arch/x86/Kconfig          |   1 +
>   arch/x86/coco/tdx/kexec.c |   0
>   arch/x86/coco/tdx/tdx.c   | 137 +++++++++++++++++++++++++++++++++++++-
>   3 files changed, 136 insertions(+), 2 deletions(-)
>   create mode 100644 arch/x86/coco/tdx/kexec.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>   	select X86_MEM_ENCRYPT
>   	select X86_MCE
>   	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>   	help
>   	  Support running as a guest under Intel TDX.  Without this support,
>   	  the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>   
>   #include <linux/cpufeature.h>
>   #include <linux/debugfs.h>
> +#include <linux/delay.h>
>   #include <linux/export.h>
>   #include <linux/io.h>
>   #include <asm/coco.h>
> @@ -14,6 +15,8 @@
>   #include <asm/insn.h>
>   #include <asm/insn-eval.h>
>   #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>   
>   /* MMIO direction */
>   #define EPT_READ	0
> @@ -40,6 +43,9 @@
>   
>   static atomic_long_t nr_shared;
>   
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
>   static inline bool pte_decrypted(pte_t pte)
>   {
>   	return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>   
>   static bool tdx_cache_flush_required(void)
>   {
> +	/*
> +	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> +	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
> +	 * implicit conversion to shared.
> +	 */
> +	if (!conversion_allowed)
> +		return false;
> +
>   	/*
>   	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
>   	 * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>   static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
>   					 bool enc)
>   {
> +	atomic_inc(&conversions_in_progress);
> +
> +	/*
> +	 * Check after bumping conversions_in_progress to serialize
> +	 * against tdx_shutdown().
> +	 */
> +	if (!conversion_allowed) {
> +		atomic_dec(&conversions_in_progress);
> +		return -EBUSY;
> +	}
> +
>   	/*
>   	 * Only handle shared->private conversion here.
>   	 * See the comment in tdx_early_init().
>   	 */
> -	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>   		return -EIO;
> +	}
>   
>   	return 0;
>   }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
>   	 * Only handle private->shared conversion here.
>   	 * See the comment in tdx_early_init().
>   	 */
> -	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>   		return -EIO;
> +	}
>   
>   	if (enc)
>   		atomic_long_sub(numpages, &nr_shared);
>   	else
>   		atomic_long_add(numpages, &nr_shared);
>   
> +	atomic_dec(&conversions_in_progress);
> +
>   	return 0;
>   }
>   
> +static void unshare_all_memory(bool unmap)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);

IIRC, you were earlier walking the direct mapping using 
walk_page_range_novma(), any particular reason to use lookup_address() 
instead ?

> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {

Additionally need to add check for pte_none() here to handle physical 
memory holes in direct mapping.

> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 *
> +			 * Bypass unmapping for crash scenario. Unmapping
> +			 * requires sleepable context, but in crash case kernel
> +			 * hits the code path with interrupts disabled.

In case of SNP we will need to temporarily enable interrupts during this 
unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() 
in cpa_flush() if interrupts are disabled.

Thanks,
Ashish

> +			 * It shouldn't be a problem as all secondary CPUs are
> +			 * down and kernel runs with interrupts disabled, so
> +			 * there is no room for race.
> +			 */
> +			if (unmap)
> +				set_memory_np(addr, pages);
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}
> +
> +static void tdx_shutdown(void)
> +{
> +	unsigned long timeout;
> +
> +	/*
> +	 * Stop new private<->shared conversions and wait for in-flight
> +	 * conversions to complete.
> +	 *
> +	 * Do not wait more than 30 seconds.
> +	 */
> +	timeout = 30 * USEC_PER_SEC;
> +	conversion_allowed = false;
> +	while (atomic_read(&conversions_in_progress) && timeout--)
> +		udelay(1);
> +
> +	if (!timeout)
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(true);
> +
> +	native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> +	/*
> +	 * Crash can race with private<->shared conversion.
> +	 *
> +	 * There's no clean way out: report and proceed.
> +	 */
> +	if (atomic_read(&conversions_in_progress))
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(false);
> +}
> +
>   void __init tdx_early_init(void)
>   {
>   	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>   	 */
>   	x86_cpuinit.parallel_bringup = false;
>   
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> +	 * virt callback instead.
> +	 */
> +	cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
>   	pr_info("Guest detected\n");
>   }
>   
> 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 18:41     ` Kalra, Ashish
@ 2023-10-05 21:28       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 21:28 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > +static void unshare_all_memory(bool unmap)
> > +{
> > +	unsigned long addr, end;
> > +	long found = 0, shared;
> > +
> > +	/*
> > +	 * Walk direct mapping and convert all shared memory back to private,
> > +	 */
> > +
> > +	addr = PAGE_OFFSET;
> > +	end  = PAGE_OFFSET + get_max_mapped();
> > +
> > +	while (addr < end) {
> > +		unsigned long size;
> > +		unsigned int level;
> > +		pte_t *pte;
> > +
> > +		pte = lookup_address(addr, &level);
> 
> IIRC, you were earlier walking the direct mapping using
> walk_page_range_novma(), any particular reason to use lookup_address()
> instead ?

walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
we run here from atomic context in case of crash.

I considered using trylock to bypass the limitation, but it is a hack.

> 
> > +		size = page_level_size(level);
> > +
> > +		if (pte && pte_decrypted(*pte)) {
> 
> Additionally need to add check for pte_none() here to handle physical memory
> holes in direct mapping.

lookup_address() returns NULL for none entries.

> > +			int pages = size / PAGE_SIZE;
> > +
> > +			/*
> > +			 * Touching memory with shared bit set triggers implicit
> > +			 * conversion to shared.
> > +			 *
> > +			 * Make sure nobody touches the shared range from
> > +			 * now on.
> > +			 *
> > +			 * Bypass unmapping for crash scenario. Unmapping
> > +			 * requires sleepable context, but in crash case kernel
> > +			 * hits the code path with interrupts disabled.
> 
> In case of SNP we will need to temporarily enable interrupts during this
> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> cpa_flush() if interrupts are disabled.

Do you really need full set_memory_encrypted()? Can't you do something
ligher?


-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-05 21:28       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 21:28 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > +static void unshare_all_memory(bool unmap)
> > +{
> > +	unsigned long addr, end;
> > +	long found = 0, shared;
> > +
> > +	/*
> > +	 * Walk direct mapping and convert all shared memory back to private,
> > +	 */
> > +
> > +	addr = PAGE_OFFSET;
> > +	end  = PAGE_OFFSET + get_max_mapped();
> > +
> > +	while (addr < end) {
> > +		unsigned long size;
> > +		unsigned int level;
> > +		pte_t *pte;
> > +
> > +		pte = lookup_address(addr, &level);
> 
> IIRC, you were earlier walking the direct mapping using
> walk_page_range_novma(), any particular reason to use lookup_address()
> instead ?

walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
we run here from atomic context in case of crash.

I considered using trylock to bypass the limitation, but it is a hack.

> 
> > +		size = page_level_size(level);
> > +
> > +		if (pte && pte_decrypted(*pte)) {
> 
> Additionally need to add check for pte_none() here to handle physical memory
> holes in direct mapping.

lookup_address() returns NULL for none entries.

> > +			int pages = size / PAGE_SIZE;
> > +
> > +			/*
> > +			 * Touching memory with shared bit set triggers implicit
> > +			 * conversion to shared.
> > +			 *
> > +			 * Make sure nobody touches the shared range from
> > +			 * now on.
> > +			 *
> > +			 * Bypass unmapping for crash scenario. Unmapping
> > +			 * requires sleepable context, but in crash case kernel
> > +			 * hits the code path with interrupts disabled.
> 
> In case of SNP we will need to temporarily enable interrupts during this
> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> cpa_flush() if interrupts are disabled.

Do you really need full set_memory_encrypted()? Can't you do something
ligher?


-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 21:28       ` Kirill A. Shutemov
@ 2023-10-05 22:01         ` Kalra, Ashish
  -1 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-05 22:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>> +static void unshare_all_memory(bool unmap)
>>> +{
>>> +	unsigned long addr, end;
>>> +	long found = 0, shared;
>>> +
>>> +	/*
>>> +	 * Walk direct mapping and convert all shared memory back to private,
>>> +	 */
>>> +
>>> +	addr = PAGE_OFFSET;
>>> +	end  = PAGE_OFFSET + get_max_mapped();
>>> +
>>> +	while (addr < end) {
>>> +		unsigned long size;
>>> +		unsigned int level;
>>> +		pte_t *pte;
>>> +
>>> +		pte = lookup_address(addr, &level);
>>
>> IIRC, you were earlier walking the direct mapping using
>> walk_page_range_novma(), any particular reason to use lookup_address()
>> instead ?
> 
> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> we run here from atomic context in case of crash.
> 
> I considered using trylock to bypass the limitation, but it is a hack.
> 
>>
>>> +		size = page_level_size(level);
>>> +
>>> +		if (pte && pte_decrypted(*pte)) {
>>
>> Additionally need to add check for pte_none() here to handle physical memory
>> holes in direct mapping.
> 
> lookup_address() returns NULL for none entries.
>

Looking at lookup_address_in_pgd(), at pte level it is simply returning
pte_offset_kernel() and there does not seem to be a check for returning 
NULL if pte_none() ?

>>> +			int pages = size / PAGE_SIZE;
>>> +
>>> +			/*
>>> +			 * Touching memory with shared bit set triggers implicit
>>> +			 * conversion to shared.
>>> +			 *
>>> +			 * Make sure nobody touches the shared range from
>>> +			 * now on.
>>> +			 *
>>> +			 * Bypass unmapping for crash scenario. Unmapping
>>> +			 * requires sleepable context, but in crash case kernel
>>> +			 * hits the code path with interrupts disabled.
>>
>> In case of SNP we will need to temporarily enable interrupts during this
>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>> cpa_flush() if interrupts are disabled.
> 
> Do you really need full set_memory_encrypted()? Can't you do something
> ligher?
> 
We need to modify the PTE for setting c-bit to 1 so that will require 
cpa_flush(), though probably can add something lighter to do
clflush_cache_range() directly ?

Thanks,
Ashish


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-05 22:01         ` Kalra, Ashish
  0 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-05 22:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>> +static void unshare_all_memory(bool unmap)
>>> +{
>>> +	unsigned long addr, end;
>>> +	long found = 0, shared;
>>> +
>>> +	/*
>>> +	 * Walk direct mapping and convert all shared memory back to private,
>>> +	 */
>>> +
>>> +	addr = PAGE_OFFSET;
>>> +	end  = PAGE_OFFSET + get_max_mapped();
>>> +
>>> +	while (addr < end) {
>>> +		unsigned long size;
>>> +		unsigned int level;
>>> +		pte_t *pte;
>>> +
>>> +		pte = lookup_address(addr, &level);
>>
>> IIRC, you were earlier walking the direct mapping using
>> walk_page_range_novma(), any particular reason to use lookup_address()
>> instead ?
> 
> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> we run here from atomic context in case of crash.
> 
> I considered using trylock to bypass the limitation, but it is a hack.
> 
>>
>>> +		size = page_level_size(level);
>>> +
>>> +		if (pte && pte_decrypted(*pte)) {
>>
>> Additionally need to add check for pte_none() here to handle physical memory
>> holes in direct mapping.
> 
> lookup_address() returns NULL for none entries.
>

Looking at lookup_address_in_pgd(), at pte level it is simply returning
pte_offset_kernel() and there does not seem to be a check for returning 
NULL if pte_none() ?

>>> +			int pages = size / PAGE_SIZE;
>>> +
>>> +			/*
>>> +			 * Touching memory with shared bit set triggers implicit
>>> +			 * conversion to shared.
>>> +			 *
>>> +			 * Make sure nobody touches the shared range from
>>> +			 * now on.
>>> +			 *
>>> +			 * Bypass unmapping for crash scenario. Unmapping
>>> +			 * requires sleepable context, but in crash case kernel
>>> +			 * hits the code path with interrupts disabled.
>>
>> In case of SNP we will need to temporarily enable interrupts during this
>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>> cpa_flush() if interrupts are disabled.
> 
> Do you really need full set_memory_encrypted()? Can't you do something
> ligher?
> 
We need to modify the PTE for setting c-bit to 1 so that will require 
cpa_flush(), though probably can add something lighter to do
clflush_cache_range() directly ?

Thanks,
Ashish


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 22:01         ` Kalra, Ashish
@ 2023-10-05 22:28           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 22:28 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > +static void unshare_all_memory(bool unmap)
> > > > +{
> > > > +	unsigned long addr, end;
> > > > +	long found = 0, shared;
> > > > +
> > > > +	/*
> > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > +	 */
> > > > +
> > > > +	addr = PAGE_OFFSET;
> > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > +
> > > > +	while (addr < end) {
> > > > +		unsigned long size;
> > > > +		unsigned int level;
> > > > +		pte_t *pte;
> > > > +
> > > > +		pte = lookup_address(addr, &level);
> > > 
> > > IIRC, you were earlier walking the direct mapping using
> > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > instead ?
> > 
> > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > we run here from atomic context in case of crash.
> > 
> > I considered using trylock to bypass the limitation, but it is a hack.
> > 
> > > 
> > > > +		size = page_level_size(level);
> > > > +
> > > > +		if (pte && pte_decrypted(*pte)) {
> > > 
> > > Additionally need to add check for pte_none() here to handle physical memory
> > > holes in direct mapping.
> > 
> > lookup_address() returns NULL for none entries.
> > 
> 
> Looking at lookup_address_in_pgd(), at pte level it is simply returning
> pte_offset_kernel() and there does not seem to be a check for returning NULL
> if pte_none() ?

Hm. You are right.

I think it yet another quirk in how lookup_address() implemented. We need
to make it straight too.

There's two options: either make lookup_address() return pointer for entry
even if it is NULL, or add check for pte_none() after pte_offset_kernel()
and return NULL if it is true.

I like the first option more as it allows caller to populate the entry if
it wants.

> > > > +			int pages = size / PAGE_SIZE;
> > > > +
> > > > +			/*
> > > > +			 * Touching memory with shared bit set triggers implicit
> > > > +			 * conversion to shared.
> > > > +			 *
> > > > +			 * Make sure nobody touches the shared range from
> > > > +			 * now on.
> > > > +			 *
> > > > +			 * Bypass unmapping for crash scenario. Unmapping
> > > > +			 * requires sleepable context, but in crash case kernel
> > > > +			 * hits the code path with interrupts disabled.
> > > 
> > > In case of SNP we will need to temporarily enable interrupts during this
> > > unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> > > cpa_flush() if interrupts are disabled.
> > 
> > Do you really need full set_memory_encrypted()? Can't you do something
> > ligher?
> > 
> We need to modify the PTE for setting c-bit to 1 so that will require
> cpa_flush(), though probably can add something lighter to do
> clflush_cache_range() directly ?

For TDX, I don't touch shared bit as nobody suppose to touch the memory
after the point (ans set_memory_np() enforces it for !crash case).

Can't SNP do the same?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-05 22:28           ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-05 22:28 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > +static void unshare_all_memory(bool unmap)
> > > > +{
> > > > +	unsigned long addr, end;
> > > > +	long found = 0, shared;
> > > > +
> > > > +	/*
> > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > +	 */
> > > > +
> > > > +	addr = PAGE_OFFSET;
> > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > +
> > > > +	while (addr < end) {
> > > > +		unsigned long size;
> > > > +		unsigned int level;
> > > > +		pte_t *pte;
> > > > +
> > > > +		pte = lookup_address(addr, &level);
> > > 
> > > IIRC, you were earlier walking the direct mapping using
> > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > instead ?
> > 
> > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > we run here from atomic context in case of crash.
> > 
> > I considered using trylock to bypass the limitation, but it is a hack.
> > 
> > > 
> > > > +		size = page_level_size(level);
> > > > +
> > > > +		if (pte && pte_decrypted(*pte)) {
> > > 
> > > Additionally need to add check for pte_none() here to handle physical memory
> > > holes in direct mapping.
> > 
> > lookup_address() returns NULL for none entries.
> > 
> 
> Looking at lookup_address_in_pgd(), at pte level it is simply returning
> pte_offset_kernel() and there does not seem to be a check for returning NULL
> if pte_none() ?

Hm. You are right.

I think it yet another quirk in how lookup_address() implemented. We need
to make it straight too.

There's two options: either make lookup_address() return pointer for entry
even if it is NULL, or add check for pte_none() after pte_offset_kernel()
and return NULL if it is true.

I like the first option more as it allows caller to populate the entry if
it wants.

> > > > +			int pages = size / PAGE_SIZE;
> > > > +
> > > > +			/*
> > > > +			 * Touching memory with shared bit set triggers implicit
> > > > +			 * conversion to shared.
> > > > +			 *
> > > > +			 * Make sure nobody touches the shared range from
> > > > +			 * now on.
> > > > +			 *
> > > > +			 * Bypass unmapping for crash scenario. Unmapping
> > > > +			 * requires sleepable context, but in crash case kernel
> > > > +			 * hits the code path with interrupts disabled.
> > > 
> > > In case of SNP we will need to temporarily enable interrupts during this
> > > unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> > > cpa_flush() if interrupts are disabled.
> > 
> > Do you really need full set_memory_encrypted()? Can't you do something
> > ligher?
> > 
> We need to modify the PTE for setting c-bit to 1 so that will require
> cpa_flush(), though probably can add something lighter to do
> clflush_cache_range() directly ?

For TDX, I don't touch shared bit as nobody suppose to touch the memory
after the point (ans set_memory_np() enforces it for !crash case).

Can't SNP do the same?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-06 10:22     ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-06 10:22 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.

Use imperative mood?  

- Move the relevant code into ...
- Introduce a new configuration option to ...
 
> 
> There have been no functional changes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig                   |  7 +++
>  arch/x86/include/asm/acpi.h        |  5 ++
>  arch/x86/kernel/acpi/Makefile      | 11 ++--
>  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
>  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
>  5 files changed, 99 insertions(+), 90 deletions(-)
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
>  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
>  	select IRQ_DOMAIN_HIERARCHY
>  
> +config X86_ACPI_MADT_WAKEUP
> +	def_bool y
> +	depends on X86_64
> +	depends on ACPI
> +	depends on SMP
> +	depends on X86_LOCAL_APIC
> +
>  config X86_IO_APIC
>  	def_bool y
>  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>  
>  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>  
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end);
> +
>  /*
>   * Check if the CPU can handle C2 and deeper
>   */
> diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> index fc17b3f136fe..8c7329c88a75 100644
> --- a/arch/x86/kernel/acpi/Makefile
> +++ b/arch/x86/kernel/acpi/Makefile
> @@ -1,11 +1,12 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
> -obj-$(CONFIG_ACPI)		+= boot.o
> -obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
> -obj-$(CONFIG_ACPI_APEI)		+= apei.o
> -obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
> +obj-$(CONFIG_ACPI)			+= boot.o
> +obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
> +obj-$(CONFIG_ACPI_APEI)			+= apei.o
> +obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
> +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
>  
>  ifneq ($(CONFIG_ACPI_PROCESSOR),)
> -obj-y				+= cstate.o
> +obj-y					+= cstate.o
>  endif

unintended code change?

[...]

> 
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> new file mode 100644
> index 000000000000..1b9747bfd5b9
> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -0,0 +1,80 @@
> +#include <linux/acpi.h>
> +#include <asm/apic.h>

Functions like memremap(), smp_store_release() and cpu_relax() are used in this
file.  Should we explicitly include the relevant headers?

> +
> +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> +static u64 acpi_mp_wake_mailbox_paddr;
> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> +
> +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> +{
> +	/*
> +	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> +	 *
> +	 * Wakeup of secondary CPUs is fully serialized in the core code.
> +	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> +	 */
> +	if (!acpi_mp_wake_mailbox) {
> +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> +						sizeof(*acpi_mp_wake_mailbox),
> +						MEMREMAP_WB);
> +	}
> +
> +	/*
> +	 * Mailbox memory is shared between the firmware and OS. Firmware will
> +	 * listen on mailbox command address, and once it receives the wakeup
> +	 * command, the CPU associated with the given apicid will be booted.
> +	 *
> +	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> +	 * firmware before the wakeup command is visible.  smp_store_release()
> +	 * ensures ordering and visibility.
> +	 */
> +	acpi_mp_wake_mailbox->apic_id	    = apicid;
> +	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> +	smp_store_release(&acpi_mp_wake_mailbox->command,
> +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
> +
> +	/*
> +	 * Wait for the CPU to wake up.
> +	 *
> +	 * The CPU being woken up is essentially in a spin loop waiting to be
> +	 * woken up. It should not take long for it wake up and acknowledge by
> +	 * zeroing out ->command.
> +	 *
> +	 * ACPI specification doesn't provide any guidance on how long kernel
> +	 * has to wait for a wake up acknowledgement. It also doesn't provide
> +	 * a way to cancel a wake up request if it takes too long.
> +	 *
> +	 * In TDX environment, the VMM has control over how long it takes to
> +	 * wake up secondary. It can postpone scheduling secondary vCPU
> +	 * indefinitely. Giving up on wake up request and reporting error opens
> +	 * possible attack vector for VMM: it can wake up a secondary CPU when
> +	 * kernel doesn't expect it. Wait until positive result of the wake up
> +	 * request.
> +	 */
> +	while (READ_ONCE(acpi_mp_wake_mailbox->command))
> +		cpu_relax();
> +
> +	return 0;
> +}
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end)
> +{
> +	struct acpi_madt_multiproc_wakeup *mp_wake;
> +
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		return -ENODEV;

Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
compiled when  X86_ACPI_MADT_WAKEUP is turned on.  IIUC this essentially means
IS_ENABLED(CONFIG_SMP) will always report true?

> +
> +	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> +	if (BAD_MADT_ENTRY(mp_wake, end))
> +		return -EINVAL;
> +
> +	acpi_table_print_madt_entry(&header->common);
> +
> +	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> +
> +	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> +
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
@ 2023-10-06 10:22     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-06 10:22 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.

Use imperative mood?  

- Move the relevant code into ...
- Introduce a new configuration option to ...
 
> 
> There have been no functional changes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig                   |  7 +++
>  arch/x86/include/asm/acpi.h        |  5 ++
>  arch/x86/kernel/acpi/Makefile      | 11 ++--
>  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
>  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
>  5 files changed, 99 insertions(+), 90 deletions(-)
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
>  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
>  	select IRQ_DOMAIN_HIERARCHY
>  
> +config X86_ACPI_MADT_WAKEUP
> +	def_bool y
> +	depends on X86_64
> +	depends on ACPI
> +	depends on SMP
> +	depends on X86_LOCAL_APIC
> +
>  config X86_IO_APIC
>  	def_bool y
>  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>  
>  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>  
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end);
> +
>  /*
>   * Check if the CPU can handle C2 and deeper
>   */
> diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> index fc17b3f136fe..8c7329c88a75 100644
> --- a/arch/x86/kernel/acpi/Makefile
> +++ b/arch/x86/kernel/acpi/Makefile
> @@ -1,11 +1,12 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
> -obj-$(CONFIG_ACPI)		+= boot.o
> -obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
> -obj-$(CONFIG_ACPI_APEI)		+= apei.o
> -obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
> +obj-$(CONFIG_ACPI)			+= boot.o
> +obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
> +obj-$(CONFIG_ACPI_APEI)			+= apei.o
> +obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
> +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
>  
>  ifneq ($(CONFIG_ACPI_PROCESSOR),)
> -obj-y				+= cstate.o
> +obj-y					+= cstate.o
>  endif

unintended code change?

[...]

> 
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> new file mode 100644
> index 000000000000..1b9747bfd5b9
> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -0,0 +1,80 @@
> +#include <linux/acpi.h>
> +#include <asm/apic.h>

Functions like memremap(), smp_store_release() and cpu_relax() are used in this
file.  Should we explicitly include the relevant headers?

> +
> +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> +static u64 acpi_mp_wake_mailbox_paddr;
> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> +
> +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> +{
> +	/*
> +	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> +	 *
> +	 * Wakeup of secondary CPUs is fully serialized in the core code.
> +	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> +	 */
> +	if (!acpi_mp_wake_mailbox) {
> +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> +						sizeof(*acpi_mp_wake_mailbox),
> +						MEMREMAP_WB);
> +	}
> +
> +	/*
> +	 * Mailbox memory is shared between the firmware and OS. Firmware will
> +	 * listen on mailbox command address, and once it receives the wakeup
> +	 * command, the CPU associated with the given apicid will be booted.
> +	 *
> +	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> +	 * firmware before the wakeup command is visible.  smp_store_release()
> +	 * ensures ordering and visibility.
> +	 */
> +	acpi_mp_wake_mailbox->apic_id	    = apicid;
> +	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> +	smp_store_release(&acpi_mp_wake_mailbox->command,
> +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
> +
> +	/*
> +	 * Wait for the CPU to wake up.
> +	 *
> +	 * The CPU being woken up is essentially in a spin loop waiting to be
> +	 * woken up. It should not take long for it wake up and acknowledge by
> +	 * zeroing out ->command.
> +	 *
> +	 * ACPI specification doesn't provide any guidance on how long kernel
> +	 * has to wait for a wake up acknowledgement. It also doesn't provide
> +	 * a way to cancel a wake up request if it takes too long.
> +	 *
> +	 * In TDX environment, the VMM has control over how long it takes to
> +	 * wake up secondary. It can postpone scheduling secondary vCPU
> +	 * indefinitely. Giving up on wake up request and reporting error opens
> +	 * possible attack vector for VMM: it can wake up a secondary CPU when
> +	 * kernel doesn't expect it. Wait until positive result of the wake up
> +	 * request.
> +	 */
> +	while (READ_ONCE(acpi_mp_wake_mailbox->command))
> +		cpu_relax();
> +
> +	return 0;
> +}
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end)
> +{
> +	struct acpi_madt_multiproc_wakeup *mp_wake;
> +
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		return -ENODEV;

Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
compiled when  X86_ACPI_MADT_WAKEUP is turned on.  IIUC this essentially means
IS_ENABLED(CONFIG_SMP) will always report true?

> +
> +	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> +	if (BAD_MADT_ENTRY(mp_wake, end))
> +		return -EINVAL;
> +
> +	acpi_table_print_madt_entry(&header->common);
> +
> +	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> +
> +	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> +
> +	return 0;
> +}

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-10-06 10:22     ` Huang, Kai
@ 2023-10-06 11:59       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-06 11:59 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 06, 2023 at 10:22:54AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
> 
> Use imperative mood?  
> 
> - Move the relevant code into ...
> - Introduce a new configuration option to ...

Okay.

> > There have been no functional changes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig                   |  7 +++
> >  arch/x86/include/asm/acpi.h        |  5 ++
> >  arch/x86/kernel/acpi/Makefile      | 11 ++--
> >  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
> >  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> >  5 files changed, 99 insertions(+), 90 deletions(-)
> >  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> >  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> >  	select IRQ_DOMAIN_HIERARCHY
> >  
> > +config X86_ACPI_MADT_WAKEUP
> > +	def_bool y
> > +	depends on X86_64
> > +	depends on ACPI
> > +	depends on SMP
> > +	depends on X86_LOCAL_APIC
> > +
> >  config X86_IO_APIC
> >  	def_bool y
> >  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >  
> >  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >  
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end);
> > +
> >  /*
> >   * Check if the CPU can handle C2 and deeper
> >   */
> > diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> > index fc17b3f136fe..8c7329c88a75 100644
> > --- a/arch/x86/kernel/acpi/Makefile
> > +++ b/arch/x86/kernel/acpi/Makefile
> > @@ -1,11 +1,12 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  
> > -obj-$(CONFIG_ACPI)		+= boot.o
> > -obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
> > -obj-$(CONFIG_ACPI_APEI)		+= apei.o
> > -obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
> > +obj-$(CONFIG_ACPI)			+= boot.o
> > +obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
> > +obj-$(CONFIG_ACPI_APEI)			+= apei.o
> > +obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
> > +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
> >  
> >  ifneq ($(CONFIG_ACPI_PROCESSOR),)
> > -obj-y				+= cstate.o
> > +obj-y					+= cstate.o
> >  endif
> 
> unintended code change?

No. It keeps += aligned as they are before the patch.
> 
> [...]
> 
> > 
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > new file mode 100644
> > index 000000000000..1b9747bfd5b9
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -0,0 +1,80 @@
> > +#include <linux/acpi.h>
> > +#include <asm/apic.h>
> 
> Functions like memremap(), smp_store_release() and cpu_relax() are used in this
> file.  Should we explicitly include the relevant headers?

Okay, will do.

> > +
> > +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > +static u64 acpi_mp_wake_mailbox_paddr;
> > +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> > +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> > +
> > +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> > +{
> > +	/*
> > +	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> > +	 *
> > +	 * Wakeup of secondary CPUs is fully serialized in the core code.
> > +	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> > +	 */
> > +	if (!acpi_mp_wake_mailbox) {
> > +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> > +						sizeof(*acpi_mp_wake_mailbox),
> > +						MEMREMAP_WB);
> > +	}
> > +
> > +	/*
> > +	 * Mailbox memory is shared between the firmware and OS. Firmware will
> > +	 * listen on mailbox command address, and once it receives the wakeup
> > +	 * command, the CPU associated with the given apicid will be booted.
> > +	 *
> > +	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> > +	 * firmware before the wakeup command is visible.  smp_store_release()
> > +	 * ensures ordering and visibility.
> > +	 */
> > +	acpi_mp_wake_mailbox->apic_id	    = apicid;
> > +	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> > +	smp_store_release(&acpi_mp_wake_mailbox->command,
> > +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
> > +
> > +	/*
> > +	 * Wait for the CPU to wake up.
> > +	 *
> > +	 * The CPU being woken up is essentially in a spin loop waiting to be
> > +	 * woken up. It should not take long for it wake up and acknowledge by
> > +	 * zeroing out ->command.
> > +	 *
> > +	 * ACPI specification doesn't provide any guidance on how long kernel
> > +	 * has to wait for a wake up acknowledgement. It also doesn't provide
> > +	 * a way to cancel a wake up request if it takes too long.
> > +	 *
> > +	 * In TDX environment, the VMM has control over how long it takes to
> > +	 * wake up secondary. It can postpone scheduling secondary vCPU
> > +	 * indefinitely. Giving up on wake up request and reporting error opens
> > +	 * possible attack vector for VMM: it can wake up a secondary CPU when
> > +	 * kernel doesn't expect it. Wait until positive result of the wake up
> > +	 * request.
> > +	 */
> > +	while (READ_ONCE(acpi_mp_wake_mailbox->command))
> > +		cpu_relax();
> > +
> > +	return 0;
> > +}
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end)
> > +{
> > +	struct acpi_madt_multiproc_wakeup *mp_wake;
> > +
> > +	if (!IS_ENABLED(CONFIG_SMP))
> > +		return -ENODEV;
> 
> Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
> compiled when  X86_ACPI_MADT_WAKEUP is turned on.  IIUC this essentially means
> IS_ENABLED(CONFIG_SMP) will always report true?

Good catch. I didn't have 'depends SMP' initially, but it triggered build
issue, so I added it, but forgot to drop IS_ENABLED().

> > +
> > +	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> > +	if (BAD_MADT_ENTRY(mp_wake, end))
> > +		return -EINVAL;
> > +
> > +	acpi_table_print_madt_entry(&header->common);
> > +
> > +	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > +
> > +	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> > +
> > +	return 0;
> > +}
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
@ 2023-10-06 11:59       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-06 11:59 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 06, 2023 at 10:22:54AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
> 
> Use imperative mood?  
> 
> - Move the relevant code into ...
> - Introduce a new configuration option to ...

Okay.

> > There have been no functional changes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig                   |  7 +++
> >  arch/x86/include/asm/acpi.h        |  5 ++
> >  arch/x86/kernel/acpi/Makefile      | 11 ++--
> >  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
> >  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> >  5 files changed, 99 insertions(+), 90 deletions(-)
> >  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> >  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> >  	select IRQ_DOMAIN_HIERARCHY
> >  
> > +config X86_ACPI_MADT_WAKEUP
> > +	def_bool y
> > +	depends on X86_64
> > +	depends on ACPI
> > +	depends on SMP
> > +	depends on X86_LOCAL_APIC
> > +
> >  config X86_IO_APIC
> >  	def_bool y
> >  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >  
> >  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >  
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end);
> > +
> >  /*
> >   * Check if the CPU can handle C2 and deeper
> >   */
> > diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> > index fc17b3f136fe..8c7329c88a75 100644
> > --- a/arch/x86/kernel/acpi/Makefile
> > +++ b/arch/x86/kernel/acpi/Makefile
> > @@ -1,11 +1,12 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  
> > -obj-$(CONFIG_ACPI)		+= boot.o
> > -obj-$(CONFIG_ACPI_SLEEP)	+= sleep.o wakeup_$(BITS).o
> > -obj-$(CONFIG_ACPI_APEI)		+= apei.o
> > -obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc.o
> > +obj-$(CONFIG_ACPI)			+= boot.o
> > +obj-$(CONFIG_ACPI_SLEEP)		+= sleep.o wakeup_$(BITS).o
> > +obj-$(CONFIG_ACPI_APEI)			+= apei.o
> > +obj-$(CONFIG_ACPI_CPPC_LIB)		+= cppc.o
> > +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP)	+= madt_wakeup.o
> >  
> >  ifneq ($(CONFIG_ACPI_PROCESSOR),)
> > -obj-y				+= cstate.o
> > +obj-y					+= cstate.o
> >  endif
> 
> unintended code change?

No. It keeps += aligned as they are before the patch.
> 
> [...]
> 
> > 
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > new file mode 100644
> > index 000000000000..1b9747bfd5b9
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -0,0 +1,80 @@
> > +#include <linux/acpi.h>
> > +#include <asm/apic.h>
> 
> Functions like memremap(), smp_store_release() and cpu_relax() are used in this
> file.  Should we explicitly include the relevant headers?

Okay, will do.

> > +
> > +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > +static u64 acpi_mp_wake_mailbox_paddr;
> > +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> > +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> > +
> > +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> > +{
> > +	/*
> > +	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> > +	 *
> > +	 * Wakeup of secondary CPUs is fully serialized in the core code.
> > +	 * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> > +	 */
> > +	if (!acpi_mp_wake_mailbox) {
> > +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> > +						sizeof(*acpi_mp_wake_mailbox),
> > +						MEMREMAP_WB);
> > +	}
> > +
> > +	/*
> > +	 * Mailbox memory is shared between the firmware and OS. Firmware will
> > +	 * listen on mailbox command address, and once it receives the wakeup
> > +	 * command, the CPU associated with the given apicid will be booted.
> > +	 *
> > +	 * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> > +	 * firmware before the wakeup command is visible.  smp_store_release()
> > +	 * ensures ordering and visibility.
> > +	 */
> > +	acpi_mp_wake_mailbox->apic_id	    = apicid;
> > +	acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> > +	smp_store_release(&acpi_mp_wake_mailbox->command,
> > +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
> > +
> > +	/*
> > +	 * Wait for the CPU to wake up.
> > +	 *
> > +	 * The CPU being woken up is essentially in a spin loop waiting to be
> > +	 * woken up. It should not take long for it wake up and acknowledge by
> > +	 * zeroing out ->command.
> > +	 *
> > +	 * ACPI specification doesn't provide any guidance on how long kernel
> > +	 * has to wait for a wake up acknowledgement. It also doesn't provide
> > +	 * a way to cancel a wake up request if it takes too long.
> > +	 *
> > +	 * In TDX environment, the VMM has control over how long it takes to
> > +	 * wake up secondary. It can postpone scheduling secondary vCPU
> > +	 * indefinitely. Giving up on wake up request and reporting error opens
> > +	 * possible attack vector for VMM: it can wake up a secondary CPU when
> > +	 * kernel doesn't expect it. Wait until positive result of the wake up
> > +	 * request.
> > +	 */
> > +	while (READ_ONCE(acpi_mp_wake_mailbox->command))
> > +		cpu_relax();
> > +
> > +	return 0;
> > +}
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end)
> > +{
> > +	struct acpi_madt_multiproc_wakeup *mp_wake;
> > +
> > +	if (!IS_ENABLED(CONFIG_SMP))
> > +		return -ENODEV;
> 
> Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
> compiled when  X86_ACPI_MADT_WAKEUP is turned on.  IIUC this essentially means
> IS_ENABLED(CONFIG_SMP) will always report true?

Good catch. I didn't have 'depends SMP' initially, but it triggered build
issue, so I added it, but forgot to drop IS_ENABLED().

> > +
> > +	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> > +	if (BAD_MADT_ENTRY(mp_wake, end))
> > +		return -EINVAL;
> > +
> > +	acpi_table_print_madt_entry(&header->common);
> > +
> > +	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > +
> > +	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> > +
> > +	return 0;
> > +}
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-06 14:36     ` Sean Christopherson
  -1 siblings, 0 replies; 106+ messages in thread
From: Sean Christopherson @ 2023-10-06 14:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Paolo Bonzini

+Paolo

Please use get_maintainers...

On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
> 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enumerated or disabled by user
> from kernel command line.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> ---
>  arch/x86/kernel/kvmclock.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..cba2e732e53f 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -22,7 +22,7 @@
>  #include <asm/x86_init.h>
>  #include <asm/kvmclock.h>
>  
> -static int kvmclock __initdata = 1;
> +static int kvmclock __ro_after_init = 1;
>  static int kvmclock_vsyscall __initdata = 1;
>  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
>  
>  void kvmclock_disable(void)
>  {
> -	native_write_msr(msr_kvm_system_time, 0, 0);
> +	if (!kvm_para_available() || !kvmclock)
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> +		native_write_msr(msr_kvm_system_time, 0, 0);

Rather than recheck everything and preserve kvmclock, what about leaving the MSR
indices '0' by default and then disable msr_kvm_system_time iff it's non-zero.
That way the disable path won't become stale if the conditions for enabling
kvmclock change.

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..f2fff625576d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -24,8 +24,8 @@
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
-static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static int msr_kvm_system_time __ro_after_init;
+static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
@@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-       native_write_msr(msr_kvm_system_time, 0, 0);
+       if (msr_kvm_system_time)
+               native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
@@ -294,7 +295,10 @@ void __init kvmclock_init(void)
        if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
                msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
                msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-       } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+       } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+               msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+               msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+       } else {
                return;
        }


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
@ 2023-10-06 14:36     ` Sean Christopherson
  0 siblings, 0 replies; 106+ messages in thread
From: Sean Christopherson @ 2023-10-06 14:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Paolo Bonzini

+Paolo

Please use get_maintainers...

On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
> 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enumerated or disabled by user
> from kernel command line.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> ---
>  arch/x86/kernel/kvmclock.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..cba2e732e53f 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -22,7 +22,7 @@
>  #include <asm/x86_init.h>
>  #include <asm/kvmclock.h>
>  
> -static int kvmclock __initdata = 1;
> +static int kvmclock __ro_after_init = 1;
>  static int kvmclock_vsyscall __initdata = 1;
>  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
>  
>  void kvmclock_disable(void)
>  {
> -	native_write_msr(msr_kvm_system_time, 0, 0);
> +	if (!kvm_para_available() || !kvmclock)
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> +		native_write_msr(msr_kvm_system_time, 0, 0);

Rather than recheck everything and preserve kvmclock, what about leaving the MSR
indices '0' by default and then disable msr_kvm_system_time iff it's non-zero.
That way the disable path won't become stale if the conditions for enabling
kvmclock change.

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..f2fff625576d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -24,8 +24,8 @@
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
-static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static int msr_kvm_system_time __ro_after_init;
+static int msr_kvm_wall_clock __ro_after_init;
 static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
@@ -195,7 +195,8 @@ static void kvm_setup_secondary_clock(void)
 
 void kvmclock_disable(void)
 {
-       native_write_msr(msr_kvm_system_time, 0, 0);
+       if (msr_kvm_system_time)
+               native_write_msr(msr_kvm_system_time, 0, 0);
 }
 
 static void __init kvmclock_init_mem(void)
@@ -294,7 +295,10 @@ void __init kvmclock_init(void)
        if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
                msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
                msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-       } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+       } else if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
+               msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+               msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+       } else {
                return;
        }


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-10-06 14:36     ` Sean Christopherson
@ 2023-10-06 14:50       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-06 14:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Paolo Bonzini

On Fri, Oct 06, 2023 at 07:36:54AM -0700, Sean Christopherson wrote:
> +Paolo
> 
> Please use get_maintainers...

Will do, sorry.

> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > present in the VM. It leads to write to a MSR that doesn't exist on some
> > configurations, namely in TDX guest:
> > 
> > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > 
> > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > features.
> > 
> > Do not disable kvmclock if it was not enumerated or disabled by user
> > from kernel command line.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> > ---
> >  arch/x86/kernel/kvmclock.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index fb8f52149be9..cba2e732e53f 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -22,7 +22,7 @@
> >  #include <asm/x86_init.h>
> >  #include <asm/kvmclock.h>
> >  
> > -static int kvmclock __initdata = 1;
> > +static int kvmclock __ro_after_init = 1;
> >  static int kvmclock_vsyscall __initdata = 1;
> >  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
> >  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> > @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
> >  
> >  void kvmclock_disable(void)
> >  {
> > -	native_write_msr(msr_kvm_system_time, 0, 0);
> > +	if (!kvm_para_available() || !kvmclock)
> > +		return;
> > +
> > +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> > +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> > +		native_write_msr(msr_kvm_system_time, 0, 0);
> 
> Rather than recheck everything and preserve kvmclock, what about leaving the MSR
> indices '0' by default and then disable msr_kvm_system_time iff it's non-zero.
> That way the disable path won't become stale if the conditions for enabling
> kvmclock change.

Okay, works for me too.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
@ 2023-10-06 14:50       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-06 14:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Paolo Bonzini

On Fri, Oct 06, 2023 at 07:36:54AM -0700, Sean Christopherson wrote:
> +Paolo
> 
> Please use get_maintainers...

Will do, sorry.

> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > present in the VM. It leads to write to a MSR that doesn't exist on some
> > configurations, namely in TDX guest:
> > 
> > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > 
> > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > features.
> > 
> > Do not disable kvmclock if it was not enumerated or disabled by user
> > from kernel command line.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> > ---
> >  arch/x86/kernel/kvmclock.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index fb8f52149be9..cba2e732e53f 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -22,7 +22,7 @@
> >  #include <asm/x86_init.h>
> >  #include <asm/kvmclock.h>
> >  
> > -static int kvmclock __initdata = 1;
> > +static int kvmclock __ro_after_init = 1;
> >  static int kvmclock_vsyscall __initdata = 1;
> >  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
> >  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> > @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
> >  
> >  void kvmclock_disable(void)
> >  {
> > -	native_write_msr(msr_kvm_system_time, 0, 0);
> > +	if (!kvm_para_available() || !kvmclock)
> > +		return;
> > +
> > +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> > +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> > +		native_write_msr(msr_kvm_system_time, 0, 0);
> 
> Rather than recheck everything and preserve kvmclock, what about leaving the MSR
> indices '0' by default and then disable msr_kvm_system_time iff it's non-zero.
> That way the disable path won't become stale if the conditions for enabling
> kvmclock change.

Okay, works for me too.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 13:13 ` [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
@ 2023-10-06 14:58     ` Sean Christopherson
  2023-10-06 14:58     ` Sean Christopherson
  2023-10-08  8:35     ` Baoquan He
  2 siblings, 0 replies; 106+ messages in thread
From: Sean Christopherson @ 2023-10-06 14:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>  	select X86_MEM_ENCRYPT
>  	select X86_MCE
>  	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.

...

>  void __init tdx_early_init(void)
>  {
>  	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>  	 */
>  	x86_cpuinit.parallel_bringup = false;
>  
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency

This is going to be super confusing.  KVM utilizes the emergency virt callback.
The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
familiar with virt and know what KVM is, but don't already know the difference
between the two are going to be all kinds of confused.

I also feel like you're playing with fire, e.g. what's to stop the hypervisor
specific paravirt guest support from using .shutdown() in the future?

And the callback is invoked for far more than just kexec().  I don't see how the
host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
stuff register state to emulate INIT or RESET.  Unless I'm missing something,
converting shared memory back to private for a shutdown or reboot is undesirable
as adds one more thing that can go wrong and prevent the system from cleanly
shutting down ASAP (for some definitions of "cleanly").

Lastly, doesn't SEV need similar behavior?  This seems like core functionality
for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
"unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-06 14:58     ` Sean Christopherson
  0 siblings, 0 replies; 106+ messages in thread
From: Sean Christopherson @ 2023-10-06 14:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>  	select X86_MEM_ENCRYPT
>  	select X86_MCE
>  	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.

...

>  void __init tdx_early_init(void)
>  {
>  	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>  	 */
>  	x86_cpuinit.parallel_bringup = false;
>  
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency

This is going to be super confusing.  KVM utilizes the emergency virt callback.
The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
familiar with virt and know what KVM is, but don't already know the difference
between the two are going to be all kinds of confused.

I also feel like you're playing with fire, e.g. what's to stop the hypervisor
specific paravirt guest support from using .shutdown() in the future?

And the callback is invoked for far more than just kexec().  I don't see how the
host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
stuff register state to emulate INIT or RESET.  Unless I'm missing something,
converting shared memory back to private for a shutdown or reboot is undesirable
as adds one more thing that can go wrong and prevent the system from cleanly
shutting down ASAP (for some definitions of "cleanly").

Lastly, doesn't SEV need similar behavior?  This seems like core functionality
for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
"unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-06 14:58     ` Sean Christopherson
@ 2023-10-06 15:11       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-06 15:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kalra, Ashish

On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7368d254d01f..b5acf9fb4c70 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> >  	select X86_MEM_ENCRYPT
> >  	select X86_MCE
> >  	select UNACCEPTED_MEMORY
> > +	select EMERGENCY_VIRT_CALLBACK
> >  	help
> >  	  Support running as a guest under Intel TDX.  Without this support,
> >  	  the guest kernel can not boot or run under TDX.
> 
> ...
> 
> >  void __init tdx_early_init(void)
> >  {
> >  	struct tdx_module_args args = {
> > @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> >  	 */
> >  	x86_cpuinit.parallel_bringup = false;
> >  
> > +	machine_ops.shutdown = tdx_shutdown;
> > +
> > +	/*
> > +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> 
> This is going to be super confusing.  KVM utilizes the emergency virt callback.
> The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
> familiar with virt and know what KVM is, but don't already know the difference
> between the two are going to be all kinds of confused.
> 
> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
> specific paravirt guest support from using .shutdown() in the future?
> 
> And the callback is invoked for far more than just kexec().  I don't see how the
> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
> stuff register state to emulate INIT or RESET.  Unless I'm missing something,
> converting shared memory back to private for a shutdown or reboot is undesirable
> as adds one more thing that can go wrong and prevent the system from cleanly
> shutting down ASAP (for some definitions of "cleanly").

Okay, fair enough. I will look for better way to hookup into kexec
process. That was the best fit I found so far, but yes it is not ideal.

> Lastly, doesn't SEV need similar behavior?  This seems like core functionality
> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

I don't know SEV specifics. I am open to collaboration on this.

Tom, Ashish, let me know if you need this in generic code. I can arrange
that.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-06 15:11       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-06 15:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kalra, Ashish

On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7368d254d01f..b5acf9fb4c70 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> >  	select X86_MEM_ENCRYPT
> >  	select X86_MCE
> >  	select UNACCEPTED_MEMORY
> > +	select EMERGENCY_VIRT_CALLBACK
> >  	help
> >  	  Support running as a guest under Intel TDX.  Without this support,
> >  	  the guest kernel can not boot or run under TDX.
> 
> ...
> 
> >  void __init tdx_early_init(void)
> >  {
> >  	struct tdx_module_args args = {
> > @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> >  	 */
> >  	x86_cpuinit.parallel_bringup = false;
> >  
> > +	machine_ops.shutdown = tdx_shutdown;
> > +
> > +	/*
> > +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> 
> This is going to be super confusing.  KVM utilizes the emergency virt callback.
> The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
> familiar with virt and know what KVM is, but don't already know the difference
> between the two are going to be all kinds of confused.
> 
> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
> specific paravirt guest support from using .shutdown() in the future?
> 
> And the callback is invoked for far more than just kexec().  I don't see how the
> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
> stuff register state to emulate INIT or RESET.  Unless I'm missing something,
> converting shared memory back to private for a shutdown or reboot is undesirable
> as adds one more thing that can go wrong and prevent the system from cleanly
> shutting down ASAP (for some definitions of "cleanly").

Okay, fair enough. I will look for better way to hookup into kexec
process. That was the best fit I found so far, but yes it is not ideal.

> Lastly, doesn't SEV need similar behavior?  This seems like core functionality
> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

I don't know SEV specifics. I am open to collaboration on this.

Tom, Ashish, let me know if you need this in generic code. I can arrange
that.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-06 18:33     ` Kuppuswamy Sathyanarayanan
  -1 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-06 18:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

Hi Kirill,

On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.
> 
> There have been no functional changes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig                   |  7 +++
>  arch/x86/include/asm/acpi.h        |  5 ++
>  arch/x86/kernel/acpi/Makefile      | 11 ++--
>  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
>  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
>  5 files changed, 99 insertions(+), 90 deletions(-)
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
>  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
>  	select IRQ_DOMAIN_HIERARCHY
>  
> +config X86_ACPI_MADT_WAKEUP
> +	def_bool y
> +	depends on X86_64
> +	depends on ACPI
> +	depends on SMP
> +	depends on X86_LOCAL_APIC
> +
>  config X86_IO_APIC
>  	def_bool y
>  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>  
>  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>  
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end);
> +

IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
only user of this function is in arch/x86/kernel/acpi, you can either create
a header file there or re-use sleep.h.

If you want to leave it here, do you want to protect it with
CONFIG_X86_ACPI_MADT_WAKEUP?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
@ 2023-10-06 18:33     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-06 18:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

Hi Kirill,

On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.
> 
> There have been no functional changes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig                   |  7 +++
>  arch/x86/include/asm/acpi.h        |  5 ++
>  arch/x86/kernel/acpi/Makefile      | 11 ++--
>  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
>  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
>  5 files changed, 99 insertions(+), 90 deletions(-)
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
>  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
>  	select IRQ_DOMAIN_HIERARCHY
>  
> +config X86_ACPI_MADT_WAKEUP
> +	def_bool y
> +	depends on X86_64
> +	depends on ACPI
> +	depends on SMP
> +	depends on X86_LOCAL_APIC
> +
>  config X86_IO_APIC
>  	def_bool y
>  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>  
>  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>  
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> +			      const unsigned long end);
> +

IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
only user of this function is in arch/x86/kernel/acpi, you can either create
a header file there or re-use sleep.h.

If you want to leave it here, do you want to protect it with
CONFIG_X86_ACPI_MADT_WAKEUP?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 22:28           ` Kirill A. Shutemov
@ 2023-10-06 19:24             ` Kalra, Ashish
  -1 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-06 19:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel


On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
>> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
>>> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>>>> +static void unshare_all_memory(bool unmap)
>>>>> +{
>>>>> +	unsigned long addr, end;
>>>>> +	long found = 0, shared;
>>>>> +
>>>>> +	/*
>>>>> +	 * Walk direct mapping and convert all shared memory back to private,
>>>>> +	 */
>>>>> +
>>>>> +	addr = PAGE_OFFSET;
>>>>> +	end  = PAGE_OFFSET + get_max_mapped();
>>>>> +
>>>>> +	while (addr < end) {
>>>>> +		unsigned long size;
>>>>> +		unsigned int level;
>>>>> +		pte_t *pte;
>>>>> +
>>>>> +		pte = lookup_address(addr, &level);
>>>>
>>>> IIRC, you were earlier walking the direct mapping using
>>>> walk_page_range_novma(), any particular reason to use lookup_address()
>>>> instead ?
>>>
>>> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
>>> we run here from atomic context in case of crash.
>>>
>>> I considered using trylock to bypass the limitation, but it is a hack.
>>>
>>>>
>>>>> +		size = page_level_size(level);
>>>>> +
>>>>> +		if (pte && pte_decrypted(*pte)) {
>>>>
>>>> Additionally need to add check for pte_none() here to handle physical memory
>>>> holes in direct mapping.
>>>
>>> lookup_address() returns NULL for none entries.
>>>
>>
>> Looking at lookup_address_in_pgd(), at pte level it is simply returning
>> pte_offset_kernel() and there does not seem to be a check for returning NULL
>> if pte_none() ?
> 
> Hm. You are right.
> 
> I think it yet another quirk in how lookup_address() implemented. We need
> to make it straight too.
> 
> There's two options: either make lookup_address() return pointer for entry
> even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> and return NULL if it is true.
> 
> I like the first option more as it allows caller to populate the entry if
> it wants.

Yes, i like the first option.

> 
>>>>> +			int pages = size / PAGE_SIZE;
>>>>> +
>>>>> +			/*
>>>>> +			 * Touching memory with shared bit set triggers implicit
>>>>> +			 * conversion to shared.
>>>>> +			 *
>>>>> +			 * Make sure nobody touches the shared range from
>>>>> +			 * now on.
>>>>> +			 *
>>>>> +			 * Bypass unmapping for crash scenario. Unmapping
>>>>> +			 * requires sleepable context, but in crash case kernel
>>>>> +			 * hits the code path with interrupts disabled.
>>>>
>>>> In case of SNP we will need to temporarily enable interrupts during this
>>>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>>>> cpa_flush() if interrupts are disabled.
>>>
>>> Do you really need full set_memory_encrypted()? Can't you do something
>>> ligher?
>>>
>> We need to modify the PTE for setting c-bit to 1 so that will require
>> cpa_flush(), though probably can add something lighter to do
>> clflush_cache_range() directly ?
> 
> For TDX, I don't touch shared bit as nobody suppose to touch the memory
> after the point (ans set_memory_np() enforces it for !crash case).
> 
> Can't SNP do the same?
> 

No, we need to make the PSC call for HV to update the RMP, then set 
C-bit=1 in the PTE and then do a PVALIDATE to switch the page back to 
private, so it needs something like a full set_memory_encrypted().

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-06 19:24             ` Kalra, Ashish
  0 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-06 19:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel


On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
>> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
>>> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>>>> +static void unshare_all_memory(bool unmap)
>>>>> +{
>>>>> +	unsigned long addr, end;
>>>>> +	long found = 0, shared;
>>>>> +
>>>>> +	/*
>>>>> +	 * Walk direct mapping and convert all shared memory back to private,
>>>>> +	 */
>>>>> +
>>>>> +	addr = PAGE_OFFSET;
>>>>> +	end  = PAGE_OFFSET + get_max_mapped();
>>>>> +
>>>>> +	while (addr < end) {
>>>>> +		unsigned long size;
>>>>> +		unsigned int level;
>>>>> +		pte_t *pte;
>>>>> +
>>>>> +		pte = lookup_address(addr, &level);
>>>>
>>>> IIRC, you were earlier walking the direct mapping using
>>>> walk_page_range_novma(), any particular reason to use lookup_address()
>>>> instead ?
>>>
>>> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
>>> we run here from atomic context in case of crash.
>>>
>>> I considered using trylock to bypass the limitation, but it is a hack.
>>>
>>>>
>>>>> +		size = page_level_size(level);
>>>>> +
>>>>> +		if (pte && pte_decrypted(*pte)) {
>>>>
>>>> Additionally need to add check for pte_none() here to handle physical memory
>>>> holes in direct mapping.
>>>
>>> lookup_address() returns NULL for none entries.
>>>
>>
>> Looking at lookup_address_in_pgd(), at pte level it is simply returning
>> pte_offset_kernel() and there does not seem to be a check for returning NULL
>> if pte_none() ?
> 
> Hm. You are right.
> 
> I think it yet another quirk in how lookup_address() implemented. We need
> to make it straight too.
> 
> There's two options: either make lookup_address() return pointer for entry
> even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> and return NULL if it is true.
> 
> I like the first option more as it allows caller to populate the entry if
> it wants.

Yes, i like the first option.

> 
>>>>> +			int pages = size / PAGE_SIZE;
>>>>> +
>>>>> +			/*
>>>>> +			 * Touching memory with shared bit set triggers implicit
>>>>> +			 * conversion to shared.
>>>>> +			 *
>>>>> +			 * Make sure nobody touches the shared range from
>>>>> +			 * now on.
>>>>> +			 *
>>>>> +			 * Bypass unmapping for crash scenario. Unmapping
>>>>> +			 * requires sleepable context, but in crash case kernel
>>>>> +			 * hits the code path with interrupts disabled.
>>>>
>>>> In case of SNP we will need to temporarily enable interrupts during this
>>>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>>>> cpa_flush() if interrupts are disabled.
>>>
>>> Do you really need full set_memory_encrypted()? Can't you do something
>>> ligher?
>>>
>> We need to modify the PTE for setting c-bit to 1 so that will require
>> cpa_flush(), though probably can add something lighter to do
>> clflush_cache_range() directly ?
> 
> For TDX, I don't touch shared bit as nobody suppose to touch the memory
> after the point (ans set_memory_np() enforces it for !crash case).
> 
> Can't SNP do the same?
> 

No, we need to make the PSC call for HV to update the RMP, then set 
C-bit=1 in the PTE and then do a PVALIDATE to switch the page back to 
private, so it needs something like a full set_memory_encrypted().

Thanks,
Ashish

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-06 15:11       ` Kirill A. Shutemov
@ 2023-10-06 22:15         ` Kalra, Ashish
  -1 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-06 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel


On 10/6/2023 10:11 AM, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
>> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 7368d254d01f..b5acf9fb4c70 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>>>   	select X86_MEM_ENCRYPT
>>>   	select X86_MCE
>>>   	select UNACCEPTED_MEMORY
>>> +	select EMERGENCY_VIRT_CALLBACK
>>>   	help
>>>   	  Support running as a guest under Intel TDX.  Without this support,
>>>   	  the guest kernel can not boot or run under TDX.
>>
>> ...
>>
>>>   void __init tdx_early_init(void)
>>>   {
>>>   	struct tdx_module_args args = {
>>> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>>>   	 */
>>>   	x86_cpuinit.parallel_bringup = false;
>>>   
>>> +	machine_ops.shutdown = tdx_shutdown;
>>> +
>>> +	/*
>>> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
>>
>> This is going to be super confusing.  KVM utilizes the emergency virt callback.
>> The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
>> familiar with virt and know what KVM is, but don't already know the difference
>> between the two are going to be all kinds of confused.
>>
>> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
>> specific paravirt guest support from using .shutdown() in the future?
>>
>> And the callback is invoked for far more than just kexec().  I don't see how the
>> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
>> stuff register state to emulate INIT or RESET.  Unless I'm missing something,
>> converting shared memory back to private for a shutdown or reboot is undesirable
>> as adds one more thing that can go wrong and prevent the system from cleanly
>> shutting down ASAP (for some definitions of "cleanly").
> 
> Okay, fair enough. I will look for better way to hookup into kexec
> process. That was the best fit I found so far, but yes it is not ideal.
> 
>> Lastly, doesn't SEV need similar behavior?  This seems like core functionality
>> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
>> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?
> 
> I don't know SEV specifics. I am open to collaboration on this.
> 
> Tom, Ashish, let me know if you need this in generic code. I can arrange
> that.
> 

Yes, some kind of a generic interface like unshare_on_kexec() gated with 
CC_ATTR_GUEST_MEM_ENCRYPT is needed, we can then add SNP specific kexec 
functionality as part of this.

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-06 22:15         ` Kalra, Ashish
  0 siblings, 0 replies; 106+ messages in thread
From: Kalra, Ashish @ 2023-10-06 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel


On 10/6/2023 10:11 AM, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
>> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 7368d254d01f..b5acf9fb4c70 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>>>   	select X86_MEM_ENCRYPT
>>>   	select X86_MCE
>>>   	select UNACCEPTED_MEMORY
>>> +	select EMERGENCY_VIRT_CALLBACK
>>>   	help
>>>   	  Support running as a guest under Intel TDX.  Without this support,
>>>   	  the guest kernel can not boot or run under TDX.
>>
>> ...
>>
>>>   void __init tdx_early_init(void)
>>>   {
>>>   	struct tdx_module_args args = {
>>> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>>>   	 */
>>>   	x86_cpuinit.parallel_bringup = false;
>>>   
>>> +	machine_ops.shutdown = tdx_shutdown;
>>> +
>>> +	/*
>>> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
>>
>> This is going to be super confusing.  KVM utilizes the emergency virt callback.
>> The KVM paravirt guest code uses .crash_shutdown().  People that are passingly
>> familiar with virt and know what KVM is, but don't already know the difference
>> between the two are going to be all kinds of confused.
>>
>> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
>> specific paravirt guest support from using .shutdown() in the future?
>>
>> And the callback is invoked for far more than just kexec().  I don't see how the
>> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
>> stuff register state to emulate INIT or RESET.  Unless I'm missing something,
>> converting shared memory back to private for a shutdown or reboot is undesirable
>> as adds one more thing that can go wrong and prevent the system from cleanly
>> shutting down ASAP (for some definitions of "cleanly").
> 
> Okay, fair enough. I will look for better way to hookup into kexec
> process. That was the best fit I found so far, but yes it is not ideal.
> 
>> Lastly, doesn't SEV need similar behavior?  This seems like core functionality
>> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).  Why not make the
>> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?
> 
> I don't know SEV specifics. I am open to collaboration on this.
> 
> Tom, Ashish, let me know if you need this in generic code. I can arrange
> that.
> 

Yes, some kind of a generic interface like unshare_on_kexec() gated with 
CC_ATTR_GUEST_MEM_ENCRYPT is needed, we can then add SNP specific kexec 
functionality as part of this.

Thanks,
Ashish

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-05 13:13 ` [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
@ 2023-10-08  8:35     ` Baoquan He
  2023-10-06 14:58     ` Sean Christopherson
  2023-10-08  8:35     ` Baoquan He
  2 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-08  8:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
> 
> The target kernel has no idea what memory is converted this way. It only
      ~~~~~~~~~~~~~
> sees E820_TYPE_RAM.

I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
can call it 2nd kernel always, it works for both kexec and kdump
jumping. 

> 
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
> 
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig          |   1 +
>  arch/x86/coco/tdx/kexec.c |   0
>  arch/x86/coco/tdx/tdx.c   | 137 +++++++++++++++++++++++++++++++++++++-
>  3 files changed, 136 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/coco/tdx/kexec.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>  	select X86_MEM_ENCRYPT
>  	select X86_MCE
>  	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>  
>  #include <linux/cpufeature.h>
>  #include <linux/debugfs.h>
> +#include <linux/delay.h>
>  #include <linux/export.h>
>  #include <linux/io.h>
>  #include <asm/coco.h>
> @@ -14,6 +15,8 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -40,6 +43,9 @@
>  
>  static atomic_long_t nr_shared;
>  
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
>  static inline bool pte_decrypted(pte_t pte)
>  {
>  	return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>  
>  static bool tdx_cache_flush_required(void)
>  {
> +	/*
> +	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> +	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
> +	 * implicit conversion to shared.
> +	 */
> +	if (!conversion_allowed)
> +		return false;
> +
>  	/*
>  	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
>  	 * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
>  					 bool enc)
>  {
> +	atomic_inc(&conversions_in_progress);
> +
> +	/*
> +	 * Check after bumping conversions_in_progress to serialize
> +	 * against tdx_shutdown().
> +	 */
> +	if (!conversion_allowed) {
> +		atomic_dec(&conversions_in_progress);
> +		return -EBUSY;
> +	}
> +
>  	/*
>  	 * Only handle shared->private conversion here.
>  	 * See the comment in tdx_early_init().
>  	 */
> -	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>  		return -EIO;
> +	}
>  
>  	return 0;
>  }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
>  	 * Only handle private->shared conversion here.
>  	 * See the comment in tdx_early_init().
>  	 */
> -	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>  		return -EIO;
> +	}
>  
>  	if (enc)
>  		atomic_long_sub(numpages, &nr_shared);
>  	else
>  		atomic_long_add(numpages, &nr_shared);
>  
> +	atomic_dec(&conversions_in_progress);
> +
>  	return 0;
>  }
>  
> +static void unshare_all_memory(bool unmap)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {
> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 *
> +			 * Bypass unmapping for crash scenario. Unmapping
> +			 * requires sleepable context, but in crash case kernel
> +			 * hits the code path with interrupts disabled.
> +			 * It shouldn't be a problem as all secondary CPUs are
> +			 * down and kernel runs with interrupts disabled, so
> +			 * there is no room for race.
> +			 */
> +			if (unmap)
> +				set_memory_np(addr, pages);
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}
> +
> +static void tdx_shutdown(void)
> +{
> +	unsigned long timeout;
> +
> +	/*
> +	 * Stop new private<->shared conversions and wait for in-flight
> +	 * conversions to complete.
> +	 *
> +	 * Do not wait more than 30 seconds.
> +	 */
> +	timeout = 30 * USEC_PER_SEC;
> +	conversion_allowed = false;
> +	while (atomic_read(&conversions_in_progress) && timeout--)
> +		udelay(1);
> +
> +	if (!timeout)
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(true);
> +
> +	native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> +	/*
> +	 * Crash can race with private<->shared conversion.
> +	 *
> +	 * There's no clean way out: report and proceed.
> +	 */
> +	if (atomic_read(&conversions_in_progress))
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(false);
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>  	 */
>  	x86_cpuinit.parallel_bringup = false;
>  
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> +	 * virt callback instead.
> +	 */
> +	cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
>  	pr_info("Guest detected\n");
>  }
>  
> -- 
> 2.41.0
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-08  8:35     ` Baoquan He
  0 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-08  8:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
> 
> The target kernel has no idea what memory is converted this way. It only
      ~~~~~~~~~~~~~
> sees E820_TYPE_RAM.

I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
can call it 2nd kernel always, it works for both kexec and kdump
jumping. 

> 
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
> 
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig          |   1 +
>  arch/x86/coco/tdx/kexec.c |   0
>  arch/x86/coco/tdx/tdx.c   | 137 +++++++++++++++++++++++++++++++++++++-
>  3 files changed, 136 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/coco/tdx/kexec.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>  	select X86_MEM_ENCRYPT
>  	select X86_MCE
>  	select UNACCEPTED_MEMORY
> +	select EMERGENCY_VIRT_CALLBACK
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>  
>  #include <linux/cpufeature.h>
>  #include <linux/debugfs.h>
> +#include <linux/delay.h>
>  #include <linux/export.h>
>  #include <linux/io.h>
>  #include <asm/coco.h>
> @@ -14,6 +15,8 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -40,6 +43,9 @@
>  
>  static atomic_long_t nr_shared;
>  
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
>  static inline bool pte_decrypted(pte_t pte)
>  {
>  	return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>  
>  static bool tdx_cache_flush_required(void)
>  {
> +	/*
> +	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> +	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
> +	 * implicit conversion to shared.
> +	 */
> +	if (!conversion_allowed)
> +		return false;
> +
>  	/*
>  	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
>  	 * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
>  					 bool enc)
>  {
> +	atomic_inc(&conversions_in_progress);
> +
> +	/*
> +	 * Check after bumping conversions_in_progress to serialize
> +	 * against tdx_shutdown().
> +	 */
> +	if (!conversion_allowed) {
> +		atomic_dec(&conversions_in_progress);
> +		return -EBUSY;
> +	}
> +
>  	/*
>  	 * Only handle shared->private conversion here.
>  	 * See the comment in tdx_early_init().
>  	 */
> -	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>  		return -EIO;
> +	}
>  
>  	return 0;
>  }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
>  	 * Only handle private->shared conversion here.
>  	 * See the comment in tdx_early_init().
>  	 */
> -	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> +	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> +		atomic_dec(&conversions_in_progress);
>  		return -EIO;
> +	}
>  
>  	if (enc)
>  		atomic_long_sub(numpages, &nr_shared);
>  	else
>  		atomic_long_add(numpages, &nr_shared);
>  
> +	atomic_dec(&conversions_in_progress);
> +
>  	return 0;
>  }
>  
> +static void unshare_all_memory(bool unmap)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {
> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 *
> +			 * Bypass unmapping for crash scenario. Unmapping
> +			 * requires sleepable context, but in crash case kernel
> +			 * hits the code path with interrupts disabled.
> +			 * It shouldn't be a problem as all secondary CPUs are
> +			 * down and kernel runs with interrupts disabled, so
> +			 * there is no room for race.
> +			 */
> +			if (unmap)
> +				set_memory_np(addr, pages);
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}
> +
> +static void tdx_shutdown(void)
> +{
> +	unsigned long timeout;
> +
> +	/*
> +	 * Stop new private<->shared conversions and wait for in-flight
> +	 * conversions to complete.
> +	 *
> +	 * Do not wait more than 30 seconds.
> +	 */
> +	timeout = 30 * USEC_PER_SEC;
> +	conversion_allowed = false;
> +	while (atomic_read(&conversions_in_progress) && timeout--)
> +		udelay(1);
> +
> +	if (!timeout)
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(true);
> +
> +	native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> +	/*
> +	 * Crash can race with private<->shared conversion.
> +	 *
> +	 * There's no clean way out: report and proceed.
> +	 */
> +	if (atomic_read(&conversions_in_progress))
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +
> +	unshare_all_memory(false);
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>  	 */
>  	x86_cpuinit.parallel_bringup = false;
>  
> +	machine_ops.shutdown = tdx_shutdown;
> +
> +	/*
> +	 * KVM overrides machine_ops.crash_shutdown, use emergency
> +	 * virt callback instead.
> +	 */
> +	cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
>  	pr_info("Guest detected\n");
>  }
>  
> -- 
> 2.41.0
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
  2023-10-05 13:13 ` Kirill A. Shutemov
@ 2023-10-08 23:49   ` Baoquan He
  -1 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-08 23:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> The patchset adds bits and pieces to get kexec (and crashkernel) work on
> TDX guest.
> 
> They bring kexec support to the point when we can start the new kernel,
> but it will only be able to use single CPU. It should be enough to cover
> the most common case: crashkernel.

Not sure if this question has been raised and answered in the past. Please
forgive my bad memory if it has. The one cpu is fine to kdump kernel most
of time, while we enable all CPUs by default when kexec rebooting. And kdump
kernel with multiple cpu is allowed too. Wondering if there's plan to
support the multiple cpu on TDX in the future.

Thanks
Baoquan

> 
> The last patch implements CPU offlining according to the approved ACPI
> spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
> kernel.
> 
> Please review. I would be glad for any feedback.
> 
> [1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
> 
> Kirill A. Shutemov (13):
>   x86/acpi: Extract ACPI MADT wakeup code into a separate file
>   kernel/cpu: Add support for declaring CPU hotplug not supported
>   cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
>   x86/kvm: Do not try to disable kvmclock if it was not enabled
>   x86/kexec: Keep CR4.MCE set during kexec for TDX guest
>   x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
>   x86/mm: Return correct level from lookup_address() if pte is none
>   KVM: x86: Add config option to gate emergency virt callback support
>   x86/tdx: Account shared memory
>   x86/tdx: Convert shared memory back to private on kexec
>   x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
>   x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
>   x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
> 
>  arch/x86/Kconfig                     |   8 +
>  arch/x86/coco/core.c                 |   1 -
>  arch/x86/coco/tdx/kexec.c            |   0
>  arch/x86/coco/tdx/tdx.c              | 220 +++++++++++++++++++++-
>  arch/x86/hyperv/ivm.c                |   9 +-
>  arch/x86/include/asm/acpi.h          |   5 +
>  arch/x86/include/asm/pgtable_types.h |   1 +
>  arch/x86/include/asm/reboot.h        |   4 +-
>  arch/x86/include/asm/x86_init.h      |   4 +-
>  arch/x86/kernel/acpi/Makefile        |  11 +-
>  arch/x86/kernel/acpi/boot.c          |  88 +--------
>  arch/x86/kernel/acpi/madt.S          |  28 +++
>  arch/x86/kernel/acpi/madt_wakeup.c   | 262 +++++++++++++++++++++++++++
>  arch/x86/kernel/e820.c               |   9 +-
>  arch/x86/kernel/kvmclock.c           |   9 +-
>  arch/x86/kernel/reboot.c             |   4 +-
>  arch/x86/kernel/relocate_kernel_64.S |   5 +
>  arch/x86/kernel/x86_init.c           |   4 +-
>  arch/x86/kvm/Kconfig                 |   5 +
>  arch/x86/mm/mem_encrypt_amd.c        |   8 +-
>  arch/x86/mm/pat/set_memory.c         |  17 +-
>  include/acpi/actbl2.h                |  19 +-
>  include/linux/cc_platform.h          |  10 -
>  include/linux/cpu.h                  |   2 +
>  kernel/cpu.c                         |  17 +-
>  25 files changed, 604 insertions(+), 146 deletions(-)
>  create mode 100644 arch/x86/coco/tdx/kexec.c
>  create mode 100644 arch/x86/kernel/acpi/madt.S
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> -- 
> 2.41.0
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
@ 2023-10-08 23:49   ` Baoquan He
  0 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-08 23:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> The patchset adds bits and pieces to get kexec (and crashkernel) work on
> TDX guest.
> 
> They bring kexec support to the point when we can start the new kernel,
> but it will only be able to use single CPU. It should be enough to cover
> the most common case: crashkernel.

Not sure if this question has been raised and answered in the past. Please
forgive my bad memory if it has. The one cpu is fine to kdump kernel most
of time, while we enable all CPUs by default when kexec rebooting. And kdump
kernel with multiple cpu is allowed too. Wondering if there's plan to
support the multiple cpu on TDX in the future.

Thanks
Baoquan

> 
> The last patch implements CPU offlining according to the approved ACPI
> spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
> kernel.
> 
> Please review. I would be glad for any feedback.
> 
> [1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
> 
> Kirill A. Shutemov (13):
>   x86/acpi: Extract ACPI MADT wakeup code into a separate file
>   kernel/cpu: Add support for declaring CPU hotplug not supported
>   cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
>   x86/kvm: Do not try to disable kvmclock if it was not enabled
>   x86/kexec: Keep CR4.MCE set during kexec for TDX guest
>   x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
>   x86/mm: Return correct level from lookup_address() if pte is none
>   KVM: x86: Add config option to gate emergency virt callback support
>   x86/tdx: Account shared memory
>   x86/tdx: Convert shared memory back to private on kexec
>   x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
>   x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
>   x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
> 
>  arch/x86/Kconfig                     |   8 +
>  arch/x86/coco/core.c                 |   1 -
>  arch/x86/coco/tdx/kexec.c            |   0
>  arch/x86/coco/tdx/tdx.c              | 220 +++++++++++++++++++++-
>  arch/x86/hyperv/ivm.c                |   9 +-
>  arch/x86/include/asm/acpi.h          |   5 +
>  arch/x86/include/asm/pgtable_types.h |   1 +
>  arch/x86/include/asm/reboot.h        |   4 +-
>  arch/x86/include/asm/x86_init.h      |   4 +-
>  arch/x86/kernel/acpi/Makefile        |  11 +-
>  arch/x86/kernel/acpi/boot.c          |  88 +--------
>  arch/x86/kernel/acpi/madt.S          |  28 +++
>  arch/x86/kernel/acpi/madt_wakeup.c   | 262 +++++++++++++++++++++++++++
>  arch/x86/kernel/e820.c               |   9 +-
>  arch/x86/kernel/kvmclock.c           |   9 +-
>  arch/x86/kernel/reboot.c             |   4 +-
>  arch/x86/kernel/relocate_kernel_64.S |   5 +
>  arch/x86/kernel/x86_init.c           |   4 +-
>  arch/x86/kvm/Kconfig                 |   5 +
>  arch/x86/mm/mem_encrypt_amd.c        |   8 +-
>  arch/x86/mm/pat/set_memory.c         |  17 +-
>  include/acpi/actbl2.h                |  19 +-
>  include/linux/cc_platform.h          |  10 -
>  include/linux/cpu.h                  |   2 +
>  kernel/cpu.c                         |  17 +-
>  25 files changed, 604 insertions(+), 146 deletions(-)
>  create mode 100644 arch/x86/coco/tdx/kexec.c
>  create mode 100644 arch/x86/kernel/acpi/madt.S
>  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> 
> -- 
> 2.41.0
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-09 12:30     ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-09 12:30 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
> 
> Use alternatives to keep the flag during kexec for TDX guests.
> 
> The change doesn't affect non-TDX environments.

Nit: non-TDX-guest environments. ?

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Kai Huang <kai.huang@intel.com>

> ---
>  arch/x86/kernel/relocate_kernel_64.S | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..bea89814b48e 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	 * Set cr4 to a known state:
>  	 *  - physical address extension enabled
>  	 *  - 5-level paging, if it was enabled before
> +	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
> +	 *    in TDX guests.
>  	 */
>  	movl	$X86_CR4_PAE, %eax
>  	testq	$X86_CR4_LA57, %r13
>  	jz	1f
>  	orl	$X86_CR4_LA57, %eax
> +1:
> +	ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
> +	orl	$X86_CR4_MCE, %eax
>  1:
>  	movq	%rax, %cr4
>  


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
@ 2023-10-09 12:30     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-09 12:30 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
> 
> Use alternatives to keep the flag during kexec for TDX guests.
> 
> The change doesn't affect non-TDX environments.

Nit: non-TDX-guest environments. ?

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Kai Huang <kai.huang@intel.com>

> ---
>  arch/x86/kernel/relocate_kernel_64.S | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..bea89814b48e 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	 * Set cr4 to a known state:
>  	 *  - physical address extension enabled
>  	 *  - 5-level paging, if it was enabled before
> +	 *  - Machine check exception on TDX guest. Clearing MCE is not allowed
> +	 *    in TDX guests.
>  	 */
>  	movl	$X86_CR4_PAE, %eax
>  	testq	$X86_CR4_LA57, %r13
>  	jz	1f
>  	orl	$X86_CR4_LA57, %eax
> +1:
> +	ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
> +	orl	$X86_CR4_MCE, %eax
>  1:
>  	movq	%rax, %cr4
>  

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
  2023-10-06 18:33     ` Kuppuswamy Sathyanarayanan
@ 2023-10-09 13:32       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:32 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Fri, Oct 06, 2023 at 11:33:47AM -0700, Kuppuswamy Sathyanarayanan wrote:
> Hi Kirill,
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
> > 
> > There have been no functional changes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig                   |  7 +++
> >  arch/x86/include/asm/acpi.h        |  5 ++
> >  arch/x86/kernel/acpi/Makefile      | 11 ++--
> >  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
> >  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> >  5 files changed, 99 insertions(+), 90 deletions(-)
> >  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> >  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> >  	select IRQ_DOMAIN_HIERARCHY
> >  
> > +config X86_ACPI_MADT_WAKEUP
> > +	def_bool y
> > +	depends on X86_64
> > +	depends on ACPI
> > +	depends on SMP
> > +	depends on X86_LOCAL_APIC
> > +
> >  config X86_IO_APIC
> >  	def_bool y
> >  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >  
> >  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >  
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end);
> > +
> 
> IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
> only user of this function is in arch/x86/kernel/acpi, you can either create
> a header file there or re-use sleep.h.

Is it a really a bid deal? I don't see how it fits into sleep.h and
introducing one more header file for one declaration seems excessive.

> If you want to leave it here, do you want to protect it with
> CONFIG_X86_ACPI_MADT_WAKEUP?

Declarations are harmless if nobody uses them. Needless ifdeffery hurts
eyes :P

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file
@ 2023-10-09 13:32       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:32 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Fri, Oct 06, 2023 at 11:33:47AM -0700, Kuppuswamy Sathyanarayanan wrote:
> Hi Kirill,
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
> > 
> > There have been no functional changes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig                   |  7 +++
> >  arch/x86/include/asm/acpi.h        |  5 ++
> >  arch/x86/kernel/acpi/Makefile      | 11 ++--
> >  arch/x86/kernel/acpi/boot.c        | 86 +-----------------------------
> >  arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> >  5 files changed, 99 insertions(+), 90 deletions(-)
> >  create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> >  	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> >  	select IRQ_DOMAIN_HIERARCHY
> >  
> > +config X86_ACPI_MADT_WAKEUP
> > +	def_bool y
> > +	depends on X86_64
> > +	depends on ACPI
> > +	depends on SMP
> > +	depends on X86_LOCAL_APIC
> > +
> >  config X86_IO_APIC
> >  	def_bool y
> >  	depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >  
> >  #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >  
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > +			      const unsigned long end);
> > +
> 
> IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
> only user of this function is in arch/x86/kernel/acpi, you can either create
> a header file there or re-use sleep.h.

Is it a really a bid deal? I don't see how it fits into sleep.h and
introducing one more header file for one declaration seems excessive.

> If you want to leave it here, do you want to protect it with
> CONFIG_X86_ACPI_MADT_WAKEUP?

Declarations are harmless if nobody uses them. Needless ifdeffery hurts
eyes :P

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
  2023-10-09 12:30     ` Huang, Kai
@ 2023-10-09 13:32       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-09 13:32 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Mon, Oct 09, 2023 at 12:30:55PM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> > to #VE.
> > 
> > Use alternatives to keep the flag during kexec for TDX guests.
> > 
> > The change doesn't affect non-TDX environments.
> 
> Nit: non-TDX-guest environments. ?

Okay, will fix.

> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Reviewed-by: Kai Huang <kai.huang@intel.com>

Thanks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
@ 2023-10-09 13:32       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-09 13:32 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Mon, Oct 09, 2023 at 12:30:55PM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> > to #VE.
> > 
> > Use alternatives to keep the flag during kexec for TDX guests.
> > 
> > The change doesn't affect non-TDX environments.
> 
> Nit: non-TDX-guest environments. ?

Okay, will fix.

> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Reviewed-by: Kai Huang <kai.huang@intel.com>

Thanks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-08  8:35     ` Baoquan He
@ 2023-10-09 13:35       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:35 UTC (permalink / raw)
  To: Baoquan He
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Sun, Oct 08, 2023 at 04:35:27PM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > TDX guests allocate shared buffers to perform I/O. It is done by
> > allocating pages normally from the buddy allocator and converting them
> > to shared with set_memory_decrypted().
> > 
> > The target kernel has no idea what memory is converted this way. It only
>       ~~~~~~~~~~~~~
> > sees E820_TYPE_RAM.
> 
> I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
> can call it 2nd kernel always, it works for both kexec and kdump
> jumping. 

Okay. Will fix. I am new to kexec and I don't know proper terminology :)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-09 13:35       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:35 UTC (permalink / raw)
  To: Baoquan He
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Sun, Oct 08, 2023 at 04:35:27PM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > TDX guests allocate shared buffers to perform I/O. It is done by
> > allocating pages normally from the buddy allocator and converting them
> > to shared with set_memory_decrypted().
> > 
> > The target kernel has no idea what memory is converted this way. It only
>       ~~~~~~~~~~~~~
> > sees E820_TYPE_RAM.
> 
> I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
> can call it 2nd kernel always, it works for both kexec and kdump
> jumping. 

Okay. Will fix. I am new to kexec and I don't know proper terminology :)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
  2023-10-08 23:49   ` Baoquan He
@ 2023-10-09 13:36     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:36 UTC (permalink / raw)
  To: Baoquan He
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > TDX guest.
> > 
> > They bring kexec support to the point when we can start the new kernel,
> > but it will only be able to use single CPU. It should be enough to cover
> > the most common case: crashkernel.
> 
> Not sure if this question has been raised and answered in the past. Please
> forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> of time, while we enable all CPUs by default when kexec rebooting. And kdump
> kernel with multiple cpu is allowed too. Wondering if there's plan to
> support the multiple cpu on TDX in the future.

Sorry, I didn't update this part of cover letter properly, but the last
patch of the patchset makes possible to kexec with multiple CPUs and the
2nd kernel will see them all. It requires support on BIOS side, otherwise
we fallback to single CPU kexec.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
@ 2023-10-09 13:36     ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-09 13:36 UTC (permalink / raw)
  To: Baoquan He
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > TDX guest.
> > 
> > They bring kexec support to the point when we can start the new kernel,
> > but it will only be able to use single CPU. It should be enough to cover
> > the most common case: crashkernel.
> 
> Not sure if this question has been raised and answered in the past. Please
> forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> of time, while we enable all CPUs by default when kexec rebooting. And kdump
> kernel with multiple cpu is allowed too. Wondering if there's plan to
> support the multiple cpu on TDX in the future.

Sorry, I didn't update this part of cover letter properly, but the last
patch of the patchset makes possible to kexec with multiple CPUs and the
2nd kernel will see them all. It requires support on BIOS side, otherwise
we fallback to single CPU kexec.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
  2023-10-09 13:36     ` Kirill A. Shutemov
@ 2023-10-09 14:13       ` Baoquan He
  -1 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-09 14:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/09/23 at 04:36pm, Kirill A. Shutemov wrote:
> On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> > On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > > TDX guest.
> > > 
> > > They bring kexec support to the point when we can start the new kernel,
> > > but it will only be able to use single CPU. It should be enough to cover
> > > the most common case: crashkernel.
> > 
> > Not sure if this question has been raised and answered in the past. Please
> > forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> > of time, while we enable all CPUs by default when kexec rebooting. And kdump
> > kernel with multiple cpu is allowed too. Wondering if there's plan to
> > support the multiple cpu on TDX in the future.
> 
> Sorry, I didn't update this part of cover letter properly, but the last
> patch of the patchset makes possible to kexec with multiple CPUs and the
> 2nd kernel will see them all. It requires support on BIOS side, otherwise
> we fallback to single CPU kexec.

Oops, I didn't read them carefully. You have mentioned that in the last
paragraph of cover letter. That's a great news, thanks.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 00/13] x86/tdx: Add kexec support
@ 2023-10-09 14:13       ` Baoquan He
  0 siblings, 0 replies; 106+ messages in thread
From: Baoquan He @ 2023-10-09 14:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On 10/09/23 at 04:36pm, Kirill A. Shutemov wrote:
> On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> > On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > > TDX guest.
> > > 
> > > They bring kexec support to the point when we can start the new kernel,
> > > but it will only be able to use single CPU. It should be enough to cover
> > > the most common case: crashkernel.
> > 
> > Not sure if this question has been raised and answered in the past. Please
> > forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> > of time, while we enable all CPUs by default when kexec rebooting. And kdump
> > kernel with multiple cpu is allowed too. Wondering if there's plan to
> > support the multiple cpu on TDX in the future.
> 
> Sorry, I didn't update this part of cover letter properly, but the last
> patch of the patchset makes possible to kexec with multiple CPUs and the
> 2nd kernel will see them all. It requires support on BIOS side, otherwise
> we fallback to single CPU kexec.

Oops, I didn't read them carefully. You have mentioned that in the last
paragraph of cover letter. That's a great news, thanks.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 09/13] x86/tdx: Account shared memory
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-10 10:05     ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-10 10:05 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


> +#ifdef CONFIG_DEBUG_FS
> +static int tdx_shared_memory_show(struct seq_file *m, void *p)
> +{
> +	unsigned long addr, end;
> +	unsigned long found = 0;
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte))
> +			found += size / PAGE_SIZE;
> +
> +		addr += size;

This could be a long loop, perhaps add cond_resched() here?

> +	}
> +
> +	seq_printf(m, "Number of unshared pages in kernel page tables:  %16lu\n",
> +		   found);
> +	seq_printf(m, "Number of pages accounted as unshared:           %16ld\n",
> +		   atomic_long_read(&nr_shared));
> +	return 0;
> +}
> +


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 09/13] x86/tdx: Account shared memory
@ 2023-10-10 10:05     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-10 10:05 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


> +#ifdef CONFIG_DEBUG_FS
> +static int tdx_shared_memory_show(struct seq_file *m, void *p)
> +{
> +	unsigned long addr, end;
> +	unsigned long found = 0;
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte))
> +			found += size / PAGE_SIZE;
> +
> +		addr += size;

This could be a long loop, perhaps add cond_resched() here?

> +	}
> +
> +	seq_printf(m, "Number of unshared pages in kernel page tables:  %16lu\n",
> +		   found);
> +	seq_printf(m, "Number of pages accounted as unshared:           %16ld\n",
> +		   atomic_long_read(&nr_shared));
> +	return 0;
> +}
> +

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-10 10:24     ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-10 10:24 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  
>  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
>  
> +	/* Disable CPU onlining/offlining */
> +	cpu_hotplug_not_supported();
> +

Both onlining/offlining are prevented, or just offlining?

The previous patch says:

	It does not prevent the initial bring up of the CPU, but it stops 
	subsequent offlining.

And ...

[...]


> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> +	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

... here cpu_down_maps_locked() only prevents offlining if I am reading
correctly.

Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
behaviour better?

Anyway, isn't it a little bit odd to have:

	if (!cpu_hotplug_supported)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;

?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-10 10:24     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-10 10:24 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  
>  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
>  
> +	/* Disable CPU onlining/offlining */
> +	cpu_hotplug_not_supported();
> +

Both onlining/offlining are prevented, or just offlining?

The previous patch says:

	It does not prevent the initial bring up of the CPU, but it stops 
	subsequent offlining.

And ...

[...]


> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> +	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

... here cpu_down_maps_locked() only prevents offlining if I am reading
correctly.

Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
behaviour better?

Anyway, isn't it a little bit odd to have:

	if (!cpu_hotplug_supported)
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;

?
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
  2023-10-05 13:13 ` [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported Kirill A. Shutemov
@ 2023-10-10 13:35     ` Kuppuswamy Sathyanarayanan
  2023-10-11 13:08     ` Thomas Gleixner
  1 sibling, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> The function cpu_hotplug_not_supported() can be called to indicate that
> CPU hotplug should be disabled. It does not prevent the initial bring up
> of the CPU, but it stops subsequent offlining.
> 
> This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.
> 

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/cpu.h |  2 ++
>  kernel/cpu.c        | 17 ++++++++++++++++-
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index f19f56501809..aab3887cadbc 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
>  extern void cpus_read_unlock(void);
>  extern int  cpus_read_trylock(void);
>  extern void lockdep_assert_cpus_held(void);
> +extern void cpu_hotplug_not_supported(void);
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  void clear_tasks_mm_cpumask(int cpu);
> @@ -147,6 +148,7 @@ static inline void cpus_read_lock(void) { }
>  static inline void cpus_read_unlock(void) { }
>  static inline int  cpus_read_trylock(void) { return true; }
>  static inline void lockdep_assert_cpus_held(void) { }
> +static inline void cpu_hotplug_not_supported(void) { }
>  static inline void cpu_hotplug_disable(void) { }
>  static inline void cpu_hotplug_enable(void) { }
>  static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6de7c6bb74ee..cf536fe1a88a 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -484,6 +484,9 @@ static int cpu_hotplug_disabled;
>  
>  DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
>  
> +/* Cleared if platform declares CPU hotplug not supported */
> +static bool cpu_hotplug_supported = true;
> +
>  void cpus_read_lock(void)
>  {
>  	percpu_down_read(&cpu_hotplug_lock);
> @@ -543,6 +546,18 @@ static void lockdep_release_cpus_lock(void)
>  	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
>  }
>  
> +/*
> + * Declare CPU hotplug not supported.
> + *
> + * It doesn't prevent initial bring up of the CPU, but stops offlining.
> + */
> +void cpu_hotplug_not_supported(void)
> +{
> +	cpu_maps_update_begin();
> +	cpu_hotplug_supported = false;
> +	cpu_maps_update_done();
> +}

Since this function is not used in this patch, do you need to add __maybe_unused to
avoid warnings?

> +
>  /*
>   * Wait for currently running CPU hotplug operations to complete (if any) and
>   * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
> @@ -1507,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
> +	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
@ 2023-10-10 13:35     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> The function cpu_hotplug_not_supported() can be called to indicate that
> CPU hotplug should be disabled. It does not prevent the initial bring up
> of the CPU, but it stops subsequent offlining.
> 
> This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.
> 

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/cpu.h |  2 ++
>  kernel/cpu.c        | 17 ++++++++++++++++-
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index f19f56501809..aab3887cadbc 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
>  extern void cpus_read_unlock(void);
>  extern int  cpus_read_trylock(void);
>  extern void lockdep_assert_cpus_held(void);
> +extern void cpu_hotplug_not_supported(void);
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  void clear_tasks_mm_cpumask(int cpu);
> @@ -147,6 +148,7 @@ static inline void cpus_read_lock(void) { }
>  static inline void cpus_read_unlock(void) { }
>  static inline int  cpus_read_trylock(void) { return true; }
>  static inline void lockdep_assert_cpus_held(void) { }
> +static inline void cpu_hotplug_not_supported(void) { }
>  static inline void cpu_hotplug_disable(void) { }
>  static inline void cpu_hotplug_enable(void) { }
>  static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6de7c6bb74ee..cf536fe1a88a 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -484,6 +484,9 @@ static int cpu_hotplug_disabled;
>  
>  DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
>  
> +/* Cleared if platform declares CPU hotplug not supported */
> +static bool cpu_hotplug_supported = true;
> +
>  void cpus_read_lock(void)
>  {
>  	percpu_down_read(&cpu_hotplug_lock);
> @@ -543,6 +546,18 @@ static void lockdep_release_cpus_lock(void)
>  	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
>  }
>  
> +/*
> + * Declare CPU hotplug not supported.
> + *
> + * It doesn't prevent initial bring up of the CPU, but stops offlining.
> + */
> +void cpu_hotplug_not_supported(void)
> +{
> +	cpu_maps_update_begin();
> +	cpu_hotplug_supported = false;
> +	cpu_maps_update_done();
> +}

Since this function is not used in this patch, do you need to add __maybe_unused to
avoid warnings?

> +
>  /*
>   * Wait for currently running CPU hotplug operations to complete (if any) and
>   * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
> @@ -1507,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
> +	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-10 13:39     ` Kuppuswamy Sathyanarayanan
  -1 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up.
> 

I think you can use the term "CPU hotplug" instead of just offline.

> Currently hotplug prevented based on the confidential computing
> attribute which is set for Intel TDX. But TDX is not the only possible
> user of the wake up method.
> 
> Mark CPU hotplug as "not supported" on ACPI MADT wakeup enumeration.

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/coco/core.c               |  1 -
>  arch/x86/kernel/acpi/madt_wakeup.c |  4 ++++
>  include/linux/cc_platform.h        | 10 ----------
>  kernel/cpu.c                       |  2 +-
>  4 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index eeec9986570e..f07c3bb7deab 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
>  {
>  	switch (attr) {
>  	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> -	case CC_ATTR_HOTPLUG_DISABLED:
>  	case CC_ATTR_GUEST_MEM_ENCRYPT:
>  	case CC_ATTR_MEM_ENCRYPT:
>  		return true;
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 1b9747bfd5b9..15bdf10b1393 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,4 +1,5 @@
>  #include <linux/acpi.h>
> +#include <linux/cpu.h>
>  #include <asm/apic.h>
>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  
>  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
>  
> +	/* Disable CPU onlining/offlining */
> +	cpu_hotplug_not_supported();
> +
>  	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>  	return 0;
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index cb0d6cd1c12f..d08dd65b5c43 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -80,16 +80,6 @@ enum cc_attr {
>  	 * using AMD SEV-SNP features.
>  	 */
>  	CC_ATTR_GUEST_SEV_SNP,
> -
> -	/**
> -	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
> -	 *
> -	 * The platform/OS is running as a guest/virtual machine does not
> -	 * support CPU hotplug feature.
> -	 *
> -	 * Examples include TDX Guest.
> -	 */
> -	CC_ATTR_HOTPLUG_DISABLED,
>  };
>  
>  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index cf536fe1a88a..9d4279476b40 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> +	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-10 13:39     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up.
> 

I think you can use the term "CPU hotplug" instead of just offline.

> Currently hotplug prevented based on the confidential computing
> attribute which is set for Intel TDX. But TDX is not the only possible
> user of the wake up method.
> 
> Mark CPU hotplug as "not supported" on ACPI MADT wakeup enumeration.

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/coco/core.c               |  1 -
>  arch/x86/kernel/acpi/madt_wakeup.c |  4 ++++
>  include/linux/cc_platform.h        | 10 ----------
>  kernel/cpu.c                       |  2 +-
>  4 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index eeec9986570e..f07c3bb7deab 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -20,7 +20,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
>  {
>  	switch (attr) {
>  	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> -	case CC_ATTR_HOTPLUG_DISABLED:
>  	case CC_ATTR_GUEST_MEM_ENCRYPT:
>  	case CC_ATTR_MEM_ENCRYPT:
>  		return true;
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 1b9747bfd5b9..15bdf10b1393 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,4 +1,5 @@
>  #include <linux/acpi.h>
> +#include <linux/cpu.h>
>  #include <asm/apic.h>
>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  
>  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
>  
> +	/* Disable CPU onlining/offlining */
> +	cpu_hotplug_not_supported();
> +
>  	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>  	return 0;
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index cb0d6cd1c12f..d08dd65b5c43 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -80,16 +80,6 @@ enum cc_attr {
>  	 * using AMD SEV-SNP features.
>  	 */
>  	CC_ATTR_GUEST_SEV_SNP,
> -
> -	/**
> -	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
> -	 *
> -	 * The platform/OS is running as a guest/virtual machine does not
> -	 * support CPU hotplug feature.
> -	 *
> -	 * Examples include TDX Guest.
> -	 */
> -	CC_ATTR_HOTPLUG_DISABLED,
>  };
>  
>  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index cf536fe1a88a..9d4279476b40 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>  	 * If the platform does not support hotplug, report it explicitly to
>  	 * differentiate it from a transient offlining failure.
>  	 */
> -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> +	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-10 13:53     ` Kuppuswamy Sathyanarayanan
  -1 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
> 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enumerated or disabled by user
> from kernel command line.

For the above warning,  check for CLOCKSOURCE and CLOCKSOURCE2
feature is sufficient, right? Do we need to include user/command-line
disable check here?

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> ---
>  arch/x86/kernel/kvmclock.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..cba2e732e53f 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -22,7 +22,7 @@
>  #include <asm/x86_init.h>
>  #include <asm/kvmclock.h>
>  
> -static int kvmclock __initdata = 1;
> +static int kvmclock __ro_after_init = 1;
>  static int kvmclock_vsyscall __initdata = 1;
>  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
>  
>  void kvmclock_disable(void)
>  {
> -	native_write_msr(msr_kvm_system_time, 0, 0);
> +	if (!kvm_para_available() || !kvmclock)
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> +		native_write_msr(msr_kvm_system_time, 0, 0);
>  }
>  
>  static void __init kvmclock_init_mem(void)

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
@ 2023-10-10 13:53     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 106+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2023-10-10 13:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel



On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> present in the VM. It leads to write to a MSR that doesn't exist on some
> configurations, namely in TDX guest:
> 
> 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> 
> kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> features.
> 
> Do not disable kvmclock if it was not enumerated or disabled by user
> from kernel command line.

For the above warning,  check for CLOCKSOURCE and CLOCKSOURCE2
feature is sufficient, right? Do we need to include user/command-line
disable check here?

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> ---
>  arch/x86/kernel/kvmclock.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..cba2e732e53f 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -22,7 +22,7 @@
>  #include <asm/x86_init.h>
>  #include <asm/kvmclock.h>
>  
> -static int kvmclock __initdata = 1;
> +static int kvmclock __ro_after_init = 1;
>  static int kvmclock_vsyscall __initdata = 1;
>  static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
>  static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
> @@ -195,7 +195,12 @@ static void kvm_setup_secondary_clock(void)
>  
>  void kvmclock_disable(void)
>  {
> -	native_write_msr(msr_kvm_system_time, 0, 0);
> +	if (!kvm_para_available() || !kvmclock)
> +		return;
> +
> +	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE) ||
> +	    kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
> +		native_write_msr(msr_kvm_system_time, 0, 0);
>  }
>  
>  static void __init kvmclock_init_mem(void)

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
  2023-10-10 13:35     ` Kuppuswamy Sathyanarayanan
@ 2023-10-11 13:07       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-11 13:07 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Tue, Oct 10, 2023 at 06:35:59AM -0700, Kuppuswamy Sathyanarayanan wrote:
> 
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > The function cpu_hotplug_not_supported() can be called to indicate that
> > CPU hotplug should be disabled. It does not prevent the initial bring up
> > of the CPU, but it stops subsequent offlining.
> > 
> > This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.
> > 
> 
> Looks good to me.
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Thanks.

> > @@ -543,6 +546,18 @@ static void lockdep_release_cpus_lock(void)
> >  	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
> >  }
> >  
> > +/*
> > + * Declare CPU hotplug not supported.
> > + *
> > + * It doesn't prevent initial bring up of the CPU, but stops offlining.
> > + */
> > +void cpu_hotplug_not_supported(void)
> > +{
> > +	cpu_maps_update_begin();
> > +	cpu_hotplug_supported = false;
> > +	cpu_maps_update_done();
> > +}
> 
> Since this function is not used in this patch, do you need to add __maybe_unused to
> avoid warnings?

Hm? I don't think compiler complains about non-static unused functions. It
has no visibility if it is used.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
@ 2023-10-11 13:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-11 13:07 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Tue, Oct 10, 2023 at 06:35:59AM -0700, Kuppuswamy Sathyanarayanan wrote:
> 
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > The function cpu_hotplug_not_supported() can be called to indicate that
> > CPU hotplug should be disabled. It does not prevent the initial bring up
> > of the CPU, but it stops subsequent offlining.
> > 
> > This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.
> > 
> 
> Looks good to me.
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Thanks.

> > @@ -543,6 +546,18 @@ static void lockdep_release_cpus_lock(void)
> >  	rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_);
> >  }
> >  
> > +/*
> > + * Declare CPU hotplug not supported.
> > + *
> > + * It doesn't prevent initial bring up of the CPU, but stops offlining.
> > + */
> > +void cpu_hotplug_not_supported(void)
> > +{
> > +	cpu_maps_update_begin();
> > +	cpu_hotplug_supported = false;
> > +	cpu_maps_update_done();
> > +}
> 
> Since this function is not used in this patch, do you need to add __maybe_unused to
> avoid warnings?

Hm? I don't think compiler complains about non-static unused functions. It
has no visibility if it is used.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
  2023-10-05 13:13 ` [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported Kirill A. Shutemov
@ 2023-10-11 13:08     ` Thomas Gleixner
  2023-10-11 13:08     ` Thomas Gleixner
  1 sibling, 0 replies; 106+ messages in thread
From: Thomas Gleixner @ 2023-10-11 13:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
> The function cpu_hotplug_not_supported() can be called to indicate that
> CPU hotplug should be disabled. It does not prevent the initial bring up
> of the CPU, but it stops subsequent offlining.

This tells me what the patch is doing, but not the why.

> This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
>  extern void cpus_read_unlock(void);
>  extern int  cpus_read_trylock(void);
>  extern void lockdep_assert_cpus_held(void);
> +extern void cpu_hotplug_not_supported(void);

This function name sucks.

The point is as you explained to prevent offlining, but not onlining. So
can we please make this very clear? Something like:

    cpu_hotplug_disable_offlining()

> +/* Cleared if platform declares CPU hotplug not supported */
> +static bool cpu_hotplug_supported = true;

Again. This is not about disabling hotplug all together. Something like:

static bool cpu_hotplug_offline_disabled;

Which expresses clearly what this is about and does not require this
awkward negation.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported
@ 2023-10-11 13:08     ` Thomas Gleixner
  0 siblings, 0 replies; 106+ messages in thread
From: Thomas Gleixner @ 2023-10-11 13:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
> The function cpu_hotplug_not_supported() can be called to indicate that
> CPU hotplug should be disabled. It does not prevent the initial bring up
> of the CPU, but it stops subsequent offlining.

This tells me what the patch is doing, but not the why.

> This function is intended to replace CC_ATTR_HOTPLUG_DISABLED.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -132,6 +132,7 @@ extern void cpus_read_lock(void);
>  extern void cpus_read_unlock(void);
>  extern int  cpus_read_trylock(void);
>  extern void lockdep_assert_cpus_held(void);
> +extern void cpu_hotplug_not_supported(void);

This function name sucks.

The point is as you explained to prevent offlining, but not onlining. So
can we please make this very clear? Something like:

    cpu_hotplug_disable_offlining()

> +/* Cleared if platform declares CPU hotplug not supported */
> +static bool cpu_hotplug_supported = true;

Again. This is not about disabling hotplug all together. Something like:

static bool cpu_hotplug_offline_disabled;

Which expresses clearly what this is about and does not require this
awkward negation.

Thanks,

        tglx

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-05 13:13   ` Kirill A. Shutemov
@ 2023-10-11 13:09     ` Thomas Gleixner
  -1 siblings, 0 replies; 106+ messages in thread
From: Thomas Gleixner @ 2023-10-11 13:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
>  
> +	/* Disable CPU onlining/offlining */

That's not what the function does.

> +	cpu_hotplug_not_supported();

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-11 13:09     ` Thomas Gleixner
  0 siblings, 0 replies; 106+ messages in thread
From: Thomas Gleixner @ 2023-10-11 13:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Oct 05 2023 at 16:13, Kirill A. Shutemov wrote:
>  
> +	/* Disable CPU onlining/offlining */

That's not what the function does.

> +	cpu_hotplug_not_supported();

Thanks,

        tglx

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
  2023-10-10 13:53     ` Kuppuswamy Sathyanarayanan
@ 2023-10-11 13:11       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-11 13:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Tue, Oct 10, 2023 at 06:53:27AM -0700, Kuppuswamy Sathyanarayanan wrote:
> 
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > present in the VM. It leads to write to a MSR that doesn't exist on some
> > configurations, namely in TDX guest:
> > 
> > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > 
> > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > features.
> > 
> > Do not disable kvmclock if it was not enumerated or disabled by user
> > from kernel command line.
> 
> For the above warning,  check for CLOCKSOURCE and CLOCKSOURCE2
> feature is sufficient, right? Do we need to include user/command-line
> disable check here?

The command line disables kvmclock, even if it is enumerated, so disabling
it is not needed.

Anyway, I reworked the patch already based on Sean's feedback. No need in
taking parameter into account directly now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled
@ 2023-10-11 13:11       ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-11 13:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	kexec, linux-coco, linux-kernel

On Tue, Oct 10, 2023 at 06:53:27AM -0700, Kuppuswamy Sathyanarayanan wrote:
> 
> 
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
> > present in the VM. It leads to write to a MSR that doesn't exist on some
> > configurations, namely in TDX guest:
> > 
> > 	unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
> > 	at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
> > 
> > kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
> > features.
> > 
> > Do not disable kvmclock if it was not enumerated or disabled by user
> > from kernel command line.
> 
> For the above warning,  check for CLOCKSOURCE and CLOCKSOURCE2
> feature is sufficient, right? Do we need to include user/command-line
> disable check here?

The command line disables kvmclock, even if it is enumerated, so disabling
it is not needed.

Anyway, I reworked the patch already based on Sean's feedback. No need in
taking parameter into account directly now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 09/13] x86/tdx: Account shared memory
  2023-10-10 10:05     ` Huang, Kai
@ 2023-10-11 13:14       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-11 13:14 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Tue, Oct 10, 2023 at 10:05:21AM +0000, Huang, Kai wrote:
> 
> > +#ifdef CONFIG_DEBUG_FS
> > +static int tdx_shared_memory_show(struct seq_file *m, void *p)
> > +{
> > +	unsigned long addr, end;
> > +	unsigned long found = 0;
> > +
> > +	addr = PAGE_OFFSET;
> > +	end  = PAGE_OFFSET + get_max_mapped();
> > +
> > +	while (addr < end) {
> > +		unsigned long size;
> > +		unsigned int level;
> > +		pte_t *pte;
> > +
> > +		pte = lookup_address(addr, &level);
> > +		size = page_level_size(level);
> > +
> > +		if (pte && pte_decrypted(*pte))
> > +			found += size / PAGE_SIZE;
> > +
> > +		addr += size;
> 
> This could be a long loop, perhaps add cond_resched() here?

Sure.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 09/13] x86/tdx: Account shared memory
@ 2023-10-11 13:14       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-11 13:14 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Tue, Oct 10, 2023 at 10:05:21AM +0000, Huang, Kai wrote:
> 
> > +#ifdef CONFIG_DEBUG_FS
> > +static int tdx_shared_memory_show(struct seq_file *m, void *p)
> > +{
> > +	unsigned long addr, end;
> > +	unsigned long found = 0;
> > +
> > +	addr = PAGE_OFFSET;
> > +	end  = PAGE_OFFSET + get_max_mapped();
> > +
> > +	while (addr < end) {
> > +		unsigned long size;
> > +		unsigned int level;
> > +		pte_t *pte;
> > +
> > +		pte = lookup_address(addr, &level);
> > +		size = page_level_size(level);
> > +
> > +		if (pte && pte_decrypted(*pte))
> > +			found += size / PAGE_SIZE;
> > +
> > +		addr += size;
> 
> This could be a long loop, perhaps add cond_resched() here?

Sure.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  2023-10-05 13:14   ` Kirill A. Shutemov
@ 2023-10-20  3:29     ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20  3:29 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
> kexec: target kernel won't be able to use more than one CPU.
> 
> Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> that the mailbox is not usable.
> 
> This is Linux-specific protocol and not reflected in ACPI spec.
> 
> Booting the target kernel with signle CPU is enough to cover the most
> common case for kexec -- kdump.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 15bdf10b1393..4e92d1d4a5fa 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
>  static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
>  {
> +	if (!acpi_mp_wake_mailbox_paddr) {
> +		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	/*
>  	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
>  	 *
> @@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  	/* Disable CPU onlining/offlining */
>  	cpu_hotplug_not_supported();
>  
> +	/*
> +	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
> +	 * It limits kexec: target kernel won't be able to use more than
> +	 * one CPU.
> +	 *
> +	 * Zero out mailbox address in the ACPI MADT wakeup structure to
> +	 * indicate that the mailbox is not usable.

Nit:

It is better to explicitly say that this will only impact the second kernel
because the current kernel has already detected the  mailbox address?

	Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
	The acpi_wakeup_cpu() will use it to bring up secondary cpus.

	Zero out mailbox address in the ACPI MADT wakeup structure to
	indicate that the mailbox is not usable.  This prevents the
	kexec()-ed kernel from reading a vaild mailbox, which in turn
	makes the kexec()-ed kernel only be able to use the boot CPU. 

> +	 *
> +	 * This is Linux-specific protocol and not reflected in ACPI spec.
> +	 */
> +	mp_wake->base_address = 0;
> + 
>  	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
@ 2023-10-20  3:29     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20  3:29 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
> kexec: target kernel won't be able to use more than one CPU.
> 
> Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> that the mailbox is not usable.
> 
> This is Linux-specific protocol and not reflected in ACPI spec.
> 
> Booting the target kernel with signle CPU is enough to cover the most
> common case for kexec -- kdump.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 15bdf10b1393..4e92d1d4a5fa 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
>  static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
>  {
> +	if (!acpi_mp_wake_mailbox_paddr) {
> +		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	/*
>  	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
>  	 *
> @@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
>  	/* Disable CPU onlining/offlining */
>  	cpu_hotplug_not_supported();
>  
> +	/*
> +	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
> +	 * It limits kexec: target kernel won't be able to use more than
> +	 * one CPU.
> +	 *
> +	 * Zero out mailbox address in the ACPI MADT wakeup structure to
> +	 * indicate that the mailbox is not usable.

Nit:

It is better to explicitly say that this will only impact the second kernel
because the current kernel has already detected the  mailbox address?

	Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
	The acpi_wakeup_cpu() will use it to bring up secondary cpus.

	Zero out mailbox address in the ACPI MADT wakeup structure to
	indicate that the mailbox is not usable.  This prevents the
	kexec()-ed kernel from reading a vaild mailbox, which in turn
	makes the kexec()-ed kernel only be able to use the boot CPU. 

> +	 *
> +	 * This is Linux-specific protocol and not reflected in ACPI spec.
> +	 */
> +	mp_wake->base_address = 0;
> + 
>  	apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
>  
>  	return 0;

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-06 19:24             ` Kalra, Ashish
@ 2023-10-20  9:21               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-20  9:21 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> 
> On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > +{
> > > > > > +	unsigned long addr, end;
> > > > > > +	long found = 0, shared;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > > > +	 */
> > > > > > +
> > > > > > +	addr = PAGE_OFFSET;
> > > > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > > > +
> > > > > > +	while (addr < end) {
> > > > > > +		unsigned long size;
> > > > > > +		unsigned int level;
> > > > > > +		pte_t *pte;
> > > > > > +
> > > > > > +		pte = lookup_address(addr, &level);
> > > > > 
> > > > > IIRC, you were earlier walking the direct mapping using
> > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > instead ?
> > > > 
> > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > we run here from atomic context in case of crash.
> > > > 
> > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > 
> > > > > 
> > > > > > +		size = page_level_size(level);
> > > > > > +
> > > > > > +		if (pte && pte_decrypted(*pte)) {
> > > > > 
> > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > holes in direct mapping.
> > > > 
> > > > lookup_address() returns NULL for none entries.
> > > > 
> > > 
> > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > if pte_none() ?
> > 
> > Hm. You are right.
> > 
> > I think it yet another quirk in how lookup_address() implemented. We need
> > to make it straight too.
> > 
> > There's two options: either make lookup_address() return pointer for entry
> > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > and return NULL if it is true.
> > 
> > I like the first option more as it allows caller to populate the entry if
> > it wants.
> 
> Yes, i like the first option.

I tried to this, but lookup_address() has to many callers. It gets beyond
the scope of the patchset. I will add pte_none() check on unshare side for
now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-20  9:21               ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-20  9:21 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> 
> On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > +{
> > > > > > +	unsigned long addr, end;
> > > > > > +	long found = 0, shared;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > > > +	 */
> > > > > > +
> > > > > > +	addr = PAGE_OFFSET;
> > > > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > > > +
> > > > > > +	while (addr < end) {
> > > > > > +		unsigned long size;
> > > > > > +		unsigned int level;
> > > > > > +		pte_t *pte;
> > > > > > +
> > > > > > +		pte = lookup_address(addr, &level);
> > > > > 
> > > > > IIRC, you were earlier walking the direct mapping using
> > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > instead ?
> > > > 
> > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > we run here from atomic context in case of crash.
> > > > 
> > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > 
> > > > > 
> > > > > > +		size = page_level_size(level);
> > > > > > +
> > > > > > +		if (pte && pte_decrypted(*pte)) {
> > > > > 
> > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > holes in direct mapping.
> > > > 
> > > > lookup_address() returns NULL for none entries.
> > > > 
> > > 
> > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > if pte_none() ?
> > 
> > Hm. You are right.
> > 
> > I think it yet another quirk in how lookup_address() implemented. We need
> > to make it straight too.
> > 
> > There's two options: either make lookup_address() return pointer for entry
> > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > and return NULL if it is true.
> > 
> > I like the first option more as it allows caller to populate the entry if
> > it wants.
> 
> Yes, i like the first option.

I tried to this, but lookup_address() has to many callers. It gets beyond
the scope of the patchset. I will add pte_none() check on unshare side for
now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
  2023-10-20  3:29     ` Huang, Kai
@ 2023-10-20  9:29       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20  9:29 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 03:29:24AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> > ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
> > kexec: target kernel won't be able to use more than one CPU.
> > 
> > Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> > that the mailbox is not usable.
> > 
> > This is Linux-specific protocol and not reflected in ACPI spec.
> > 
> > Booting the target kernel with signle CPU is enough to cover the most
> > common case for kexec -- kdump.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 15bdf10b1393..4e92d1d4a5fa 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >  
> >  static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> >  {
> > +	if (!acpi_mp_wake_mailbox_paddr) {
> > +		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> >  	/*
> >  	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> >  	 *
> > @@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> >  	/* Disable CPU onlining/offlining */
> >  	cpu_hotplug_not_supported();
> >  
> > +	/*
> > +	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
> > +	 * It limits kexec: target kernel won't be able to use more than
> > +	 * one CPU.
> > +	 *
> > +	 * Zero out mailbox address in the ACPI MADT wakeup structure to
> > +	 * indicate that the mailbox is not usable.
> 
> Nit:
> 
> It is better to explicitly say that this will only impact the second kernel
> because the current kernel has already detected the  mailbox address?
> 
> 	Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> 	The acpi_wakeup_cpu() will use it to bring up secondary cpus.
> 
> 	Zero out mailbox address in the ACPI MADT wakeup structure to
> 	indicate that the mailbox is not usable.  This prevents the
> 	kexec()-ed kernel from reading a vaild mailbox, which in turn
> 	makes the kexec()-ed kernel only be able to use the boot CPU. 

Okay. Looks good.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
@ 2023-10-20  9:29       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20  9:29 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 03:29:24AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> > ACPI MADT doesn't allow to offline CPU after it got woke up. It limits
> > kexec: target kernel won't be able to use more than one CPU.
> > 
> > Zero out mailbox address in the ACPI MADT wakeup structure to indicate
> > that the mailbox is not usable.
> > 
> > This is Linux-specific protocol and not reflected in ACPI spec.
> > 
> > Booting the target kernel with signle CPU is enough to cover the most
> > common case for kexec -- kdump.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/kernel/acpi/madt_wakeup.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 15bdf10b1393..4e92d1d4a5fa 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -9,6 +9,11 @@ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >  
> >  static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> >  {
> > +	if (!acpi_mp_wake_mailbox_paddr) {
> > +		pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n");
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> >  	/*
> >  	 * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> >  	 *
> > @@ -78,6 +83,18 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> >  	/* Disable CPU onlining/offlining */
> >  	cpu_hotplug_not_supported();
> >  
> > +	/*
> > +	 * ACPI MADT doesn't allow to offline CPU after it got woke up.
> > +	 * It limits kexec: target kernel won't be able to use more than
> > +	 * one CPU.
> > +	 *
> > +	 * Zero out mailbox address in the ACPI MADT wakeup structure to
> > +	 * indicate that the mailbox is not usable.
> 
> Nit:
> 
> It is better to explicitly say that this will only impact the second kernel
> because the current kernel has already detected the  mailbox address?
> 
> 	Now acpi_mp_wake_mailbox_paddr already has the mailbox address.
> 	The acpi_wakeup_cpu() will use it to bring up secondary cpus.
> 
> 	Zero out mailbox address in the ACPI MADT wakeup structure to
> 	indicate that the mailbox is not usable.  This prevents the
> 	kexec()-ed kernel from reading a vaild mailbox, which in turn
> 	makes the kexec()-ed kernel only be able to use the boot CPU. 

Okay. Looks good.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
  2023-10-20  9:21               ` Kirill A. Shutemov
@ 2023-10-20  9:39                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-20  9:39 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Fri, Oct 20, 2023 at 12:21:11PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> > 
> > On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > > +{
> > > > > > > +	unsigned long addr, end;
> > > > > > > +	long found = 0, shared;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > > > > +	 */
> > > > > > > +
> > > > > > > +	addr = PAGE_OFFSET;
> > > > > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > > > > +
> > > > > > > +	while (addr < end) {
> > > > > > > +		unsigned long size;
> > > > > > > +		unsigned int level;
> > > > > > > +		pte_t *pte;
> > > > > > > +
> > > > > > > +		pte = lookup_address(addr, &level);
> > > > > > 
> > > > > > IIRC, you were earlier walking the direct mapping using
> > > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > > instead ?
> > > > > 
> > > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > > we run here from atomic context in case of crash.
> > > > > 
> > > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > > 
> > > > > > 
> > > > > > > +		size = page_level_size(level);
> > > > > > > +
> > > > > > > +		if (pte && pte_decrypted(*pte)) {
> > > > > > 
> > > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > > holes in direct mapping.
> > > > > 
> > > > > lookup_address() returns NULL for none entries.
> > > > > 
> > > > 
> > > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > > if pte_none() ?
> > > 
> > > Hm. You are right.
> > > 
> > > I think it yet another quirk in how lookup_address() implemented. We need
> > > to make it straight too.
> > > 
> > > There's two options: either make lookup_address() return pointer for entry
> > > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > > and return NULL if it is true.
> > > 
> > > I like the first option more as it allows caller to populate the entry if
> > > it wants.
> > 
> > Yes, i like the first option.
> 
> I tried to this, but lookup_address() has to many callers. It gets beyond
> the scope of the patchset. I will add pte_none() check on unshare side for
> now.

Ah. pte_none() is not need for TDX implementation, as pte_decrypted()
check will fail for it. SEV implementation would need an additional check.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec
@ 2023-10-20  9:39                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: Kirill A. Shutemov @ 2023-10-20  9:39 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, kexec, linux-coco, linux-kernel

On Fri, Oct 20, 2023 at 12:21:11PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> > 
> > On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > > +{
> > > > > > > +	unsigned long addr, end;
> > > > > > > +	long found = 0, shared;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Walk direct mapping and convert all shared memory back to private,
> > > > > > > +	 */
> > > > > > > +
> > > > > > > +	addr = PAGE_OFFSET;
> > > > > > > +	end  = PAGE_OFFSET + get_max_mapped();
> > > > > > > +
> > > > > > > +	while (addr < end) {
> > > > > > > +		unsigned long size;
> > > > > > > +		unsigned int level;
> > > > > > > +		pte_t *pte;
> > > > > > > +
> > > > > > > +		pte = lookup_address(addr, &level);
> > > > > > 
> > > > > > IIRC, you were earlier walking the direct mapping using
> > > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > > instead ?
> > > > > 
> > > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > > we run here from atomic context in case of crash.
> > > > > 
> > > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > > 
> > > > > > 
> > > > > > > +		size = page_level_size(level);
> > > > > > > +
> > > > > > > +		if (pte && pte_decrypted(*pte)) {
> > > > > > 
> > > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > > holes in direct mapping.
> > > > > 
> > > > > lookup_address() returns NULL for none entries.
> > > > > 
> > > > 
> > > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > > if pte_none() ?
> > > 
> > > Hm. You are right.
> > > 
> > > I think it yet another quirk in how lookup_address() implemented. We need
> > > to make it straight too.
> > > 
> > > There's two options: either make lookup_address() return pointer for entry
> > > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > > and return NULL if it is true.
> > > 
> > > I like the first option more as it allows caller to populate the entry if
> > > it wants.
> > 
> > Yes, i like the first option.
> 
> I tried to this, but lookup_address() has to many callers. It gets beyond
> the scope of the patchset. I will add pte_none() check on unshare side for
> now.

Ah. pte_none() is not need for TDX implementation, as pte_decrypted()
check will fail for it. SEV implementation would need an additional check.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-10-05 13:14 ` [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
@ 2023-10-20  9:49     ` Huang, Kai
  2023-10-20 11:21     ` Huang, Kai
  1 sibling, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20  9:49 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
>  struct acpi_madt_multiproc_wakeup {
>  	struct acpi_subtable_header header;
> -	u16 mailbox_version;
> +	u16 version;
>  	u32 reserved;		/* reserved - must be zero */
> -	u64 base_address;
> +	u64 mailbox_address;
> +	u64 reset_vector;
>  };

I don't quite understand the connection between the renaming and what this patch
wants to achieve?  What's the reason to rename?  If needed, perhaps put into a
separate patch with proper justification?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
@ 2023-10-20  9:49     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20  9:49 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco

On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
>  struct acpi_madt_multiproc_wakeup {
>  	struct acpi_subtable_header header;
> -	u16 mailbox_version;
> +	u16 version;
>  	u32 reserved;		/* reserved - must be zero */
> -	u64 base_address;
> +	u64 mailbox_address;
> +	u64 reset_vector;
>  };

I don't quite understand the connection between the renaming and what this patch
wants to achieve?  What's the reason to rename?  If needed, perhaps put into a
separate patch with proper justification?

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-10-20  9:49     ` Huang, Kai
@ 2023-10-20 10:42       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 10:42 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 09:49:59AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> >  struct acpi_madt_multiproc_wakeup {
> >  	struct acpi_subtable_header header;
> > -	u16 mailbox_version;
> > +	u16 version;
> >  	u32 reserved;		/* reserved - must be zero */
> > -	u64 base_address;
> > +	u64 mailbox_address;
> > +	u64 reset_vector;
> >  };
> 
> I don't quite understand the connection between the renaming and what this patch
> wants to achieve?  What's the reason to rename?

Names are bad: the version field guides version of the structure, not
version of the mailbox. And it is not clear what base base_address
specifies.

> If needed, perhaps put into a separate patch with proper justification?

Hm. Okay...

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
@ 2023-10-20 10:42       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 10:42 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 09:49:59AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> >  struct acpi_madt_multiproc_wakeup {
> >  	struct acpi_subtable_header header;
> > -	u16 mailbox_version;
> > +	u16 version;
> >  	u32 reserved;		/* reserved - must be zero */
> > -	u64 base_address;
> > +	u64 mailbox_address;
> > +	u64 reset_vector;
> >  };
> 
> I don't quite understand the connection between the renaming and what this patch
> wants to achieve?  What's the reason to rename?

Names are bad: the version field guides version of the structure, not
version of the mailbox. And it is not clear what base base_address
specifies.

> If needed, perhaps put into a separate patch with proper justification?

Hm. Okay...

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-10-05 13:14 ` [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
@ 2023-10-20 11:21     ` Huang, Kai
  2023-10-20 11:21     ` Huang, Kai
  1 sibling, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20 11:21 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt.S
> @@ -0,0 +1,28 @@
> +#include <linux/linkage.h>
> +#include <asm/nospec-branch.h>
> +#include <asm/page_types.h>
> +#include <asm/processor-flags.h>
> +
> +	.text
> +	.align PAGE_SIZE
> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> +	/* Load address of reset vector into RCX to jump when kernel is ready */
> +	movq	acpi_mp_reset_vector_paddr(%rip), %rcx
> +
> +	/* zero out flags, and disable interrupts */
> +	pushq	$0
> +	popfq
> +
> +	/* Turn off global entries. Following CR3 write will flush them. */
> +	movq	%cr4, %rdx
> +	andq	$~(X86_CR4_PGE), %rdx
> +	movq	%rdx, %cr4
> +
> +	/* Switch to identity mapping */
> +	movq	acpi_mp_pgd(%rip), %rax
> +	movq	%rax, %cr3
> +
> +	/* Jump to reset vector */
> +	ANNOTATE_RETPOLINE_SAFE
> +	jmp	*%rcx
> +SYM_FUNC_END(asm_acpi_mp_play_dead)
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 4e92d1d4a5fa..2cc8590ec7a5 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,12 +1,162 @@
>  #include <linux/acpi.h>
>  #include <linux/cpu.h>
> +#include <linux/delay.h>
> +#include <linux/memblock.h>
> +#include <linux/pgtable.h>
> +#include <linux/sched/hotplug.h>
>  #include <asm/apic.h>
> +#include <asm/init.h>
>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
>  static u64 acpi_mp_wake_mailbox_paddr;
>  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
> +unsigned long acpi_mp_pgd;
> +u64 acpi_mp_reset_vector_paddr;

Nit: 

They are both physical address.  It's a little bit odd for one to use 'unsigned
long' and the other to use 'u64'.

> +
> +void asm_acpi_mp_play_dead(void);
> +
> +static void __init *alloc_pgt_page(void *context)
> +{
> +	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}

If I am reading correclty, @context is never used.  It's not used inside this
function, and all the callers call this with @context = NULL.


[...]

> +
> +static void acpi_mp_play_dead(void)
> +{
> +	idle_task_exit();
> +	cpuhp_ap_report_dead();
> +	asm_acpi_mp_play_dead();
> +}
> +

Looks you can use play_dead_common() here, if you take IRQ disable part out from
the assembly, because play_dead_common() does:

void play_dead_common(void)
{               
        idle_task_exit();
                
        cpuhp_ap_report_dead();                                                
        
        local_irq_disable();                                                   
}

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
@ 2023-10-20 11:21     ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20 11:21 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, x86, bp, dave.hansen
  Cc: Edgecombe, Rick P, Reshetova, Elena, Nakajima, Jun, rafael,
	peterz, sathyanarayanan.kuppuswamy, Hunter, Adrian,
	thomas.lendacky, linux-kernel, kexec, linux-coco


> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt.S
> @@ -0,0 +1,28 @@
> +#include <linux/linkage.h>
> +#include <asm/nospec-branch.h>
> +#include <asm/page_types.h>
> +#include <asm/processor-flags.h>
> +
> +	.text
> +	.align PAGE_SIZE
> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> +	/* Load address of reset vector into RCX to jump when kernel is ready */
> +	movq	acpi_mp_reset_vector_paddr(%rip), %rcx
> +
> +	/* zero out flags, and disable interrupts */
> +	pushq	$0
> +	popfq
> +
> +	/* Turn off global entries. Following CR3 write will flush them. */
> +	movq	%cr4, %rdx
> +	andq	$~(X86_CR4_PGE), %rdx
> +	movq	%rdx, %cr4
> +
> +	/* Switch to identity mapping */
> +	movq	acpi_mp_pgd(%rip), %rax
> +	movq	%rax, %cr3
> +
> +	/* Jump to reset vector */
> +	ANNOTATE_RETPOLINE_SAFE
> +	jmp	*%rcx
> +SYM_FUNC_END(asm_acpi_mp_play_dead)
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 4e92d1d4a5fa..2cc8590ec7a5 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,12 +1,162 @@
>  #include <linux/acpi.h>
>  #include <linux/cpu.h>
> +#include <linux/delay.h>
> +#include <linux/memblock.h>
> +#include <linux/pgtable.h>
> +#include <linux/sched/hotplug.h>
>  #include <asm/apic.h>
> +#include <asm/init.h>
>  
>  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
>  static u64 acpi_mp_wake_mailbox_paddr;
>  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>  
> +unsigned long acpi_mp_pgd;
> +u64 acpi_mp_reset_vector_paddr;

Nit: 

They are both physical address.  It's a little bit odd for one to use 'unsigned
long' and the other to use 'u64'.

> +
> +void asm_acpi_mp_play_dead(void);
> +
> +static void __init *alloc_pgt_page(void *context)
> +{
> +	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}

If I am reading correclty, @context is never used.  It's not used inside this
function, and all the callers call this with @context = NULL.


[...]

> +
> +static void acpi_mp_play_dead(void)
> +{
> +	idle_task_exit();
> +	cpuhp_ap_report_dead();
> +	asm_acpi_mp_play_dead();
> +}
> +

Looks you can use play_dead_common() here, if you take IRQ disable part out from
the assembly, because play_dead_common() does:

void play_dead_common(void)
{               
        idle_task_exit();
                
        cpuhp_ap_report_dead();                                                
        
        local_irq_disable();                                                   
}
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-10 10:24     ` Huang, Kai
@ 2023-10-20 11:58       ` Huang, Kai
  -1 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20 11:58 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, dave.hansen, x86, bp
  Cc: Edgecombe, Rick P, Reshetova, Elena, rafael, Nakajima, Jun,
	peterz, sathyanarayanan.kuppuswamy, thomas.lendacky, Hunter,
	Adrian, linux-kernel, kexec, linux-coco

On Tue, 2023-10-10 at 10:24 +0000, Huang, Kai wrote:
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> >  
> > 
> >  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> >  
> > 
> > +	/* Disable CPU onlining/offlining */
> > +	cpu_hotplug_not_supported();
> > +
> 
> Both onlining/offlining are prevented, or just offlining?
> 
> The previous patch says:
> 
> 	It does not prevent the initial bring up of the CPU, but it stops 
> 	subsequent offlining.
> 
> And ...
> 
> [...]
> 
> 
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
> >  	 * If the platform does not support hotplug, report it explicitly to
> >  	 * differentiate it from a transient offlining failure.
> >  	 */
> > -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > +	if (!cpu_hotplug_supported)
> >  		return -EOPNOTSUPP;
> >  	if (cpu_hotplug_disabled)
> >  		return -EBUSY;
> 
> ... here cpu_down_maps_locked() only prevents offlining if I am reading
> correctly.
> 
> Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
> behaviour better?
> 
> Anyway, isn't it a little bit odd to have:
> 
> 	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;
> 
> ?

I probably have missed something important, but I don't quite understand what's
the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
replace it with cpu_hotplug_not_supported()?

From the changelog:

	Currently hotplug prevented based on the confidential computing
	attribute which is set for Intel TDX. But TDX is not the only possible
	user of the wake up method.

CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by non-
TDX guest too.

Anyway, if the purpose is just to prevent CPU from going offline, can this be
done by registering a cpuhp callback?

	static int madt_wakeup_offline_cpu(unsigned int cpu)
	{
		return -EOPNOTSUPP;
	}

	...

	err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
			NULL, madt_wakeup_offline_cpu);
	if (err) {
		pr_err("Register CPU hotplug callback failed.\n");
		/* BUG() ??? */
	}

This doesn't pollute the common CPU hotplug code, thus to me looks more clear?



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-20 11:58       ` Huang, Kai
  0 siblings, 0 replies; 106+ messages in thread
From: Huang, Kai @ 2023-10-20 11:58 UTC (permalink / raw)
  To: kirill.shutemov, tglx, mingo, dave.hansen, x86, bp
  Cc: Edgecombe, Rick P, Reshetova, Elena, rafael, Nakajima, Jun,
	peterz, sathyanarayanan.kuppuswamy, thomas.lendacky, Hunter,
	Adrian, linux-kernel, kexec, linux-coco

On Tue, 2023-10-10 at 10:24 +0000, Huang, Kai wrote:
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> >  
> > 
> >  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> >  
> > 
> > +	/* Disable CPU onlining/offlining */
> > +	cpu_hotplug_not_supported();
> > +
> 
> Both onlining/offlining are prevented, or just offlining?
> 
> The previous patch says:
> 
> 	It does not prevent the initial bring up of the CPU, but it stops 
> 	subsequent offlining.
> 
> And ...
> 
> [...]
> 
> 
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
> >  	 * If the platform does not support hotplug, report it explicitly to
> >  	 * differentiate it from a transient offlining failure.
> >  	 */
> > -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > +	if (!cpu_hotplug_supported)
> >  		return -EOPNOTSUPP;
> >  	if (cpu_hotplug_disabled)
> >  		return -EBUSY;
> 
> ... here cpu_down_maps_locked() only prevents offlining if I am reading
> correctly.
> 
> Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
> behaviour better?
> 
> Anyway, isn't it a little bit odd to have:
> 
> 	if (!cpu_hotplug_supported)
>  		return -EOPNOTSUPP;
>  	if (cpu_hotplug_disabled)
>  		return -EBUSY;
> 
> ?

I probably have missed something important, but I don't quite understand what's
the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
replace it with cpu_hotplug_not_supported()?

From the changelog:

	Currently hotplug prevented based on the confidential computing
	attribute which is set for Intel TDX. But TDX is not the only possible
	user of the wake up method.

CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by non-
TDX guest too.

Anyway, if the purpose is just to prevent CPU from going offline, can this be
done by registering a cpuhp callback?

	static int madt_wakeup_offline_cpu(unsigned int cpu)
	{
		return -EOPNOTSUPP;
	}

	...

	err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
			NULL, madt_wakeup_offline_cpu);
	if (err) {
		pr_err("Register CPU hotplug callback failed.\n");
		/* BUG() ??? */
	}

This doesn't pollute the common CPU hotplug code, thus to me looks more clear?


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
  2023-10-20 11:21     ` Huang, Kai
@ 2023-10-20 12:34       ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 12:34 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 11:21:34AM +0000, Huang, Kai wrote:
> 
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt.S
> > @@ -0,0 +1,28 @@
> > +#include <linux/linkage.h>
> > +#include <asm/nospec-branch.h>
> > +#include <asm/page_types.h>
> > +#include <asm/processor-flags.h>
> > +
> > +	.text
> > +	.align PAGE_SIZE
> > +SYM_FUNC_START(asm_acpi_mp_play_dead)
> > +	/* Load address of reset vector into RCX to jump when kernel is ready */
> > +	movq	acpi_mp_reset_vector_paddr(%rip), %rcx
> > +
> > +	/* zero out flags, and disable interrupts */
> > +	pushq	$0
> > +	popfq
> > +
> > +	/* Turn off global entries. Following CR3 write will flush them. */
> > +	movq	%cr4, %rdx
> > +	andq	$~(X86_CR4_PGE), %rdx
> > +	movq	%rdx, %cr4
> > +
> > +	/* Switch to identity mapping */
> > +	movq	acpi_mp_pgd(%rip), %rax
> > +	movq	%rax, %cr3
> > +
> > +	/* Jump to reset vector */
> > +	ANNOTATE_RETPOLINE_SAFE
> > +	jmp	*%rcx
> > +SYM_FUNC_END(asm_acpi_mp_play_dead)
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 4e92d1d4a5fa..2cc8590ec7a5 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -1,12 +1,162 @@
> >  #include <linux/acpi.h>
> >  #include <linux/cpu.h>
> > +#include <linux/delay.h>
> > +#include <linux/memblock.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/sched/hotplug.h>
> >  #include <asm/apic.h>
> > +#include <asm/init.h>
> >  
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> >  static u64 acpi_mp_wake_mailbox_paddr;
> >  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> >  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >  
> > +unsigned long acpi_mp_pgd;
> > +u64 acpi_mp_reset_vector_paddr;
> 
> Nit: 
> 
> They are both physical address.  It's a little bit odd for one to use 'unsigned
> long' and the other to use 'u64'.

Okay, I will make them both u64.

> > +
> > +void asm_acpi_mp_play_dead(void);
> > +
> > +static void __init *alloc_pgt_page(void *context)
> > +{
> > +	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > +}
> 
> If I am reading correclty, @context is never used.  It's not used inside this
> function, and all the callers call this with @context = NULL.

Yes. We need the argument to conform to x86_mapping_info::alloc_pgt_page
interface.

> > +
> > +static void acpi_mp_play_dead(void)
> > +{
> > +	idle_task_exit();
> > +	cpuhp_ap_report_dead();
> > +	asm_acpi_mp_play_dead();
> > +}
> > +
> 
> Looks you can use play_dead_common() here, if you take IRQ disable part out from
> the assembly, because play_dead_common() does:
> 
> void play_dead_common(void)
> {               
>         idle_task_exit();
>                 
>         cpuhp_ap_report_dead();                                                
>         
>         local_irq_disable();                                                   
> }

Okay, fair enough.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
@ 2023-10-20 12:34       ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 12:34 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, x86, bp, dave.hansen, Edgecombe, Rick P, Reshetova,
	Elena, Nakajima, Jun, rafael, peterz, sathyanarayanan.kuppuswamy,
	Hunter, Adrian, thomas.lendacky, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 11:21:34AM +0000, Huang, Kai wrote:
> 
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt.S
> > @@ -0,0 +1,28 @@
> > +#include <linux/linkage.h>
> > +#include <asm/nospec-branch.h>
> > +#include <asm/page_types.h>
> > +#include <asm/processor-flags.h>
> > +
> > +	.text
> > +	.align PAGE_SIZE
> > +SYM_FUNC_START(asm_acpi_mp_play_dead)
> > +	/* Load address of reset vector into RCX to jump when kernel is ready */
> > +	movq	acpi_mp_reset_vector_paddr(%rip), %rcx
> > +
> > +	/* zero out flags, and disable interrupts */
> > +	pushq	$0
> > +	popfq
> > +
> > +	/* Turn off global entries. Following CR3 write will flush them. */
> > +	movq	%cr4, %rdx
> > +	andq	$~(X86_CR4_PGE), %rdx
> > +	movq	%rdx, %cr4
> > +
> > +	/* Switch to identity mapping */
> > +	movq	acpi_mp_pgd(%rip), %rax
> > +	movq	%rax, %cr3
> > +
> > +	/* Jump to reset vector */
> > +	ANNOTATE_RETPOLINE_SAFE
> > +	jmp	*%rcx
> > +SYM_FUNC_END(asm_acpi_mp_play_dead)
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 4e92d1d4a5fa..2cc8590ec7a5 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -1,12 +1,162 @@
> >  #include <linux/acpi.h>
> >  #include <linux/cpu.h>
> > +#include <linux/delay.h>
> > +#include <linux/memblock.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/sched/hotplug.h>
> >  #include <asm/apic.h>
> > +#include <asm/init.h>
> >  
> >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> >  static u64 acpi_mp_wake_mailbox_paddr;
> >  /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> >  static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >  
> > +unsigned long acpi_mp_pgd;
> > +u64 acpi_mp_reset_vector_paddr;
> 
> Nit: 
> 
> They are both physical address.  It's a little bit odd for one to use 'unsigned
> long' and the other to use 'u64'.

Okay, I will make them both u64.

> > +
> > +void asm_acpi_mp_play_dead(void);
> > +
> > +static void __init *alloc_pgt_page(void *context)
> > +{
> > +	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > +}
> 
> If I am reading correclty, @context is never used.  It's not used inside this
> function, and all the callers call this with @context = NULL.

Yes. We need the argument to conform to x86_mapping_info::alloc_pgt_page
interface.

> > +
> > +static void acpi_mp_play_dead(void)
> > +{
> > +	idle_task_exit();
> > +	cpuhp_ap_report_dead();
> > +	asm_acpi_mp_play_dead();
> > +}
> > +
> 
> Looks you can use play_dead_common() here, if you take IRQ disable part out from
> the assembly, because play_dead_common() does:
> 
> void play_dead_common(void)
> {               
>         idle_task_exit();
>                 
>         cpuhp_ap_report_dead();                                                
>         
>         local_irq_disable();                                                   
> }

Okay, fair enough.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
  2023-10-20 11:58       ` Huang, Kai
@ 2023-10-20 12:42         ` kirill.shutemov
  -1 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 12:42 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, dave.hansen, x86, bp, Edgecombe, Rick P, Reshetova,
	Elena, rafael, Nakajima, Jun, peterz, sathyanarayanan.kuppuswamy,
	thomas.lendacky, Hunter, Adrian, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 11:58:58AM +0000, Huang, Kai wrote:
> On Tue, 2023-10-10 at 10:24 +0000, Huang, Kai wrote:
> > >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > >  
> > > 
> > >  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > >  
> > > 
> > > +	/* Disable CPU onlining/offlining */
> > > +	cpu_hotplug_not_supported();
> > > +
> > 
> > Both onlining/offlining are prevented, or just offlining?
> > 
> > The previous patch says:
> > 
> > 	It does not prevent the initial bring up of the CPU, but it stops 
> > 	subsequent offlining.
> > 
> > And ...
> > 
> > [...]
> > 
> > 
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
> > >  	 * If the platform does not support hotplug, report it explicitly to
> > >  	 * differentiate it from a transient offlining failure.
> > >  	 */
> > > -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > > +	if (!cpu_hotplug_supported)
> > >  		return -EOPNOTSUPP;
> > >  	if (cpu_hotplug_disabled)
> > >  		return -EBUSY;
> > 
> > ... here cpu_down_maps_locked() only prevents offlining if I am reading
> > correctly.
> > 
> > Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
> > behaviour better?
> > 
> > Anyway, isn't it a little bit odd to have:
> > 
> > 	if (!cpu_hotplug_supported)
> >  		return -EOPNOTSUPP;
> >  	if (cpu_hotplug_disabled)
> >  		return -EBUSY;
> > 
> > ?
> 
> I probably have missed something important, but I don't quite understand what's
> the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
> replace it with cpu_hotplug_not_supported()?

CC_ATTR_HOTPLUG_DISABLED was a mistake. And now obvious when we only need
to disable offlining dynamically, based on supported MADT MP WP version.

> From the changelog:
> 
> 	Currently hotplug prevented based on the confidential computing
> 	attribute which is set for Intel TDX. But TDX is not the only possible
> 	user of the wake up method.
> 
> CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by non-
> TDX guest too.
> 
> Anyway, if the purpose is just to prevent CPU from going offline, can this be
> done by registering a cpuhp callback?
> 
> 	static int madt_wakeup_offline_cpu(unsigned int cpu)
> 	{
> 		return -EOPNOTSUPP;
> 	}
> 
> 	...
> 
> 	err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
> 			NULL, madt_wakeup_offline_cpu);
> 	if (err) {
> 		pr_err("Register CPU hotplug callback failed.\n");
> 		/* BUG() ??? */
> 	}
> 
> This doesn't pollute the common CPU hotplug code, thus to me looks more clear?

Thomas seems fine with cpu_hotplug_disable_offlining().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
@ 2023-10-20 12:42         ` kirill.shutemov
  0 siblings, 0 replies; 106+ messages in thread
From: kirill.shutemov @ 2023-10-20 12:42 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, mingo, dave.hansen, x86, bp, Edgecombe, Rick P, Reshetova,
	Elena, rafael, Nakajima, Jun, peterz, sathyanarayanan.kuppuswamy,
	thomas.lendacky, Hunter, Adrian, linux-kernel, kexec, linux-coco

On Fri, Oct 20, 2023 at 11:58:58AM +0000, Huang, Kai wrote:
> On Tue, 2023-10-10 at 10:24 +0000, Huang, Kai wrote:
> > >  /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > > @@ -74,6 +75,9 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > >  
> > > 
> > >  	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > >  
> > > 
> > > +	/* Disable CPU onlining/offlining */
> > > +	cpu_hotplug_not_supported();
> > > +
> > 
> > Both onlining/offlining are prevented, or just offlining?
> > 
> > The previous patch says:
> > 
> > 	It does not prevent the initial bring up of the CPU, but it stops 
> > 	subsequent offlining.
> > 
> > And ...
> > 
> > [...]
> > 
> > 
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -1522,7 +1522,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
> > >  	 * If the platform does not support hotplug, report it explicitly to
> > >  	 * differentiate it from a transient offlining failure.
> > >  	 */
> > > -	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) || !cpu_hotplug_supported)
> > > +	if (!cpu_hotplug_supported)
> > >  		return -EOPNOTSUPP;
> > >  	if (cpu_hotplug_disabled)
> > >  		return -EBUSY;
> > 
> > ... here cpu_down_maps_locked() only prevents offlining if I am reading
> > correctly.
> > 
> > Also, can we rename cpu_hotplug_supported to cpu_offline_supported to match the
> > behaviour better?
> > 
> > Anyway, isn't it a little bit odd to have:
> > 
> > 	if (!cpu_hotplug_supported)
> >  		return -EOPNOTSUPP;
> >  	if (cpu_hotplug_disabled)
> >  		return -EBUSY;
> > 
> > ?
> 
> I probably have missed something important, but I don't quite understand what's
> the reason to have the CC_ATTR_HOTPLUG_DISABLED at the beginning, and now
> replace it with cpu_hotplug_not_supported()?

CC_ATTR_HOTPLUG_DISABLED was a mistake. And now obvious when we only need
to disable offlining dynamically, based on supported MADT MP WP version.

> From the changelog:
> 
> 	Currently hotplug prevented based on the confidential computing
> 	attribute which is set for Intel TDX. But TDX is not the only possible
> 	user of the wake up method.
> 
> CC_ATTR_HOTPLUG_DISABLED is only used by TDX guest, but MADT can be used by non-
> TDX guest too.
> 
> Anyway, if the purpose is just to prevent CPU from going offline, can this be
> done by registering a cpuhp callback?
> 
> 	static int madt_wakeup_offline_cpu(unsigned int cpu)
> 	{
> 		return -EOPNOTSUPP;
> 	}
> 
> 	...
> 
> 	err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "madt-wakeup",
> 			NULL, madt_wakeup_offline_cpu);
> 	if (err) {
> 		pr_err("Register CPU hotplug callback failed.\n");
> 		/* BUG() ??? */
> 	}
> 
> This doesn't pollute the common CPU hotplug code, thus to me looks more clear?

Thomas seems fine with cpu_hotplug_disable_offlining().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2023-10-20 12:42 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-05 13:13 [PATCH 00/13] x86/tdx: Add kexec support Kirill A. Shutemov
2023-10-05 13:13 ` Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-06 10:22   ` Huang, Kai
2023-10-06 10:22     ` Huang, Kai
2023-10-06 11:59     ` kirill.shutemov
2023-10-06 11:59       ` kirill.shutemov
2023-10-06 18:33   ` Kuppuswamy Sathyanarayanan
2023-10-06 18:33     ` Kuppuswamy Sathyanarayanan
2023-10-09 13:32     ` Kirill A. Shutemov
2023-10-09 13:32       ` Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 02/13] kernel/cpu: Add support for declaring CPU hotplug not supported Kirill A. Shutemov
2023-10-10 13:35   ` Kuppuswamy Sathyanarayanan
2023-10-10 13:35     ` Kuppuswamy Sathyanarayanan
2023-10-11 13:07     ` Kirill A. Shutemov
2023-10-11 13:07       ` Kirill A. Shutemov
2023-10-11 13:08   ` Thomas Gleixner
2023-10-11 13:08     ` Thomas Gleixner
2023-10-05 13:13 ` [PATCH 03/13] cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-10 10:24   ` Huang, Kai
2023-10-10 10:24     ` Huang, Kai
2023-10-20 11:58     ` Huang, Kai
2023-10-20 11:58       ` Huang, Kai
2023-10-20 12:42       ` kirill.shutemov
2023-10-20 12:42         ` kirill.shutemov
2023-10-10 13:39   ` Kuppuswamy Sathyanarayanan
2023-10-10 13:39     ` Kuppuswamy Sathyanarayanan
2023-10-11 13:09   ` Thomas Gleixner
2023-10-11 13:09     ` Thomas Gleixner
2023-10-05 13:13 ` [PATCH 04/13] x86/kvm: Do not try to disable kvmclock if it was not enabled Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-06 14:36   ` Sean Christopherson
2023-10-06 14:36     ` Sean Christopherson
2023-10-06 14:50     ` Kirill A. Shutemov
2023-10-06 14:50       ` Kirill A. Shutemov
2023-10-10 13:53   ` Kuppuswamy Sathyanarayanan
2023-10-10 13:53     ` Kuppuswamy Sathyanarayanan
2023-10-11 13:11     ` Kirill A. Shutemov
2023-10-11 13:11       ` Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-09 12:30   ` Huang, Kai
2023-10-09 12:30     ` Huang, Kai
2023-10-09 13:32     ` kirill.shutemov
2023-10-09 13:32       ` kirill.shutemov
2023-10-05 13:13 ` [PATCH 06/13] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 07/13] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 08/13] KVM: x86: Add config option to gate emergency virt callback support Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-05 13:13 ` [PATCH 09/13] x86/tdx: Account shared memory Kirill A. Shutemov
2023-10-05 13:13   ` Kirill A. Shutemov
2023-10-10 10:05   ` Huang, Kai
2023-10-10 10:05     ` Huang, Kai
2023-10-11 13:14     ` kirill.shutemov
2023-10-11 13:14       ` kirill.shutemov
2023-10-05 13:13 ` [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
2023-10-05 18:41   ` Kalra, Ashish
2023-10-05 18:41     ` Kalra, Ashish
2023-10-05 21:28     ` Kirill A. Shutemov
2023-10-05 21:28       ` Kirill A. Shutemov
2023-10-05 22:01       ` Kalra, Ashish
2023-10-05 22:01         ` Kalra, Ashish
2023-10-05 22:28         ` Kirill A. Shutemov
2023-10-05 22:28           ` Kirill A. Shutemov
2023-10-06 19:24           ` Kalra, Ashish
2023-10-06 19:24             ` Kalra, Ashish
2023-10-20  9:21             ` Kirill A. Shutemov
2023-10-20  9:21               ` Kirill A. Shutemov
2023-10-20  9:39               ` Kirill A. Shutemov
2023-10-20  9:39                 ` Kirill A. Shutemov
2023-10-06 14:58   ` Sean Christopherson
2023-10-06 14:58     ` Sean Christopherson
2023-10-06 15:11     ` Kirill A. Shutemov
2023-10-06 15:11       ` Kirill A. Shutemov
2023-10-06 22:15       ` Kalra, Ashish
2023-10-06 22:15         ` Kalra, Ashish
2023-10-08  8:35   ` Baoquan He
2023-10-08  8:35     ` Baoquan He
2023-10-09 13:35     ` Kirill A. Shutemov
2023-10-09 13:35       ` Kirill A. Shutemov
2023-10-05 13:14 ` [PATCH 11/13] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
2023-10-05 13:14   ` Kirill A. Shutemov
2023-10-05 13:14 ` [PATCH 12/13] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
2023-10-05 13:14   ` Kirill A. Shutemov
2023-10-20  3:29   ` Huang, Kai
2023-10-20  3:29     ` Huang, Kai
2023-10-20  9:29     ` kirill.shutemov
2023-10-20  9:29       ` kirill.shutemov
2023-10-05 13:14 ` [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
2023-10-20  9:49   ` Huang, Kai
2023-10-20  9:49     ` Huang, Kai
2023-10-20 10:42     ` kirill.shutemov
2023-10-20 10:42       ` kirill.shutemov
2023-10-20 11:21   ` Huang, Kai
2023-10-20 11:21     ` Huang, Kai
2023-10-20 12:34     ` kirill.shutemov
2023-10-20 12:34       ` kirill.shutemov
2023-10-08 23:49 ` [PATCH 00/13] x86/tdx: Add kexec support Baoquan He
2023-10-08 23:49   ` Baoquan He
2023-10-09 13:36   ` Kirill A. Shutemov
2023-10-09 13:36     ` Kirill A. Shutemov
2023-10-09 14:13     ` Baoquan He
2023-10-09 14:13       ` Baoquan He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.