linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/22] TDX host kernel support
@ 2022-06-22 11:15 Kai Huang
  2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
                   ` (22 more replies)
  0 siblings, 23 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, linux-acpi, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata, akpm, thomas.lendacky, Tianyu.Lan, rdunlap,
	Jason, juri.lelli, mark.rutland, frederic, yuehaibing,
	dongli.zhang, kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  This series provides support for
initializing TDX in the host kernel.

KVM support for TDX is being developed separately[1].  A new fd-based
approach to supporting TDX private memory is also being developed[2].
The KVM will only support the new fd-based approach as TD guest backend.

You can find TDX related specs here:
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

This series is rebased to latest tip/x86/tdx.  You can also find this
series in below repo in github:
https://github.com/intel/tdx/tree/host-upstream

I highly appreciate if anyone can help to review this series.

Hi Dave (and Intel reviewers),
   
Please kindly help to review, and I would appreciate reviewed-by or
acked-by tags if the patches look good to you.

Changelog history:

- v4 -> v5:

  This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

  Since there's no feedback on v4, please ignore reviewing on v4 and
  compare v5 to v3 directly.  For changes to comparing to v3, please see
  change from v3 -> v4.  Also, in changelog histroy of individual
  patches, I just used v3 -> v5.

- v3 -> v4 (addressed Dave's comments, and other comments from others):

 - Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

- V2 -> v3:

 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

v3:
https://lore.kernel.org/lkml/68484e168226037c3a25b6fb983b052b26ab3ec1.camel@intel.com/T/

V2:
https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/

RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX introduces a new CPU mode called
Secure Arbitration Mode (SEAM) and a new isolated range pointed by the
SEAM Ranger Register (SEAMRR).  A CPU-attested software module called
'the TDX module' implements the functionalities to manage and run
protected VMs.  The TDX module (and it's loader called the 'P-SEAMLDR')
runs inside the new isolated range and is protected from the untrusted
host VMM.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.
BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized using SEAMCALLs defined by TDX architecture.
This series assumes the TDX module are loaded by BIOS before the kernel
boots.

There's no CPUID or MSR to detect whether the TDX module has been loaded.
The SEAMCALL instruction fails with VMfailInvalid if the target SEAM
software (either the P-SEAMLDR or the TDX module) is not loaded.  It can
be used to directly detect the TDX module.

The TDX module is initialized in multiple steps:

  1) Global initialization;
  2) Logical-CPU scope initialization;
  3) Enumerate information of the TDX module and TDX capable memory.
  4) Configure the TDX module about TDX-usable memory ranges and a global
     TDX KeyID which protects the TDX module metadata.
  5) Package-scope configuration for the global TDX KeyID;
  6) Initialize TDX metadata for usable memory ranges based on 4).

Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
table) logical cpus, otherwise step 4) will fail.  Step 5) requires
calling SEAMCALL on at least one cpu on all packages.

Also, the TDX module could already have been initialized or in shutdown
mode for a kexec()-ed kernel (see below kexec() section).  In this case,
the first step of above process will fail immediately.

TDX module can also be shut down at any time during module's lifetime, by
calling SEAMCALL on all "BIOS-enabled" logical cpus.

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

This series adds a function tdx_init() to allow the caller to initialize
TDX at runtime:

        if (tdx_init())
                goto no_tdx;
	// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done) otherwise it causes #UD.  So far, KVM is the only user of TDX, and
it guarantees all online CPUs are in VMX operation when there's any VM.
Letting KVM to initialize TDX at runtime avoids handling VMXON/VMXOFF in
the core kernel.  Also, in the long term, more kernel components will
need to use TDX thus likely a reference-based approach to do VMXON/VMXOFF
is needed in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.  For the new module,
it's possible the metadata allocated for the old module cannot be reused
for the new module, and needs to be re-allocated again.

2. Kernel policy on TDX memory

The TDX architecture allows the VMM to designate specific memory as
usable for TDX private memory.  This series chooses to designate _all_
system RAM as TDX to avoid having to modify the page allocator to
distinguish TDX and non-TDX-capable memory.

3. CPU hotplug

TDX doesn't work with ACPI CPU hotplug.  To guarantee the security MCHECK
verifies all logical CPUs for all packages during platform boot.  Any
hot-added CPU is not verified thus cannot support TDX.  A non-buggy BIOS
should never deliver ACPI CPU hot-add event to the kernel.  Such event is
reported as BIOS bug and the hot-added CPU is rejected.

TDX requires all boot-time verified logical CPUs being present until
machine reset.  If kernel receives ACPI CPU hot-removal event, assume
kernel cannot continue to work normally and just BUG().

Note TDX works with CPU logical online/offline, thus the kernel still
allows to offline logical CPU and online it again.

4. Memory Hotplug

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  Those regions are
generated by BIOS and verified by the MCHECK so that they are truly
present during platform boot and can meet security guarantee.

This means TDX doesn't work with ACPI memory hot-add.  A non-buggy BIOS
should never deliver ACPI memory hot-add event to the kernel.  Such event
is reported as BIOS bug and the hot-added memory is rejected.

TDX also doesn't work with ACPI memory hot-removal.  If kernel receives
ACPI memory hot-removal event, assume the kernel cannot continue to work
normally so just BUG().

Also, the kernel needs to choose which TDX-capable regions to use as TDX
memory and pass those regions to the TDX module when it gets initialized.
Once they are passed to the TDX module, the TDX-usable memory regions are
fixed during module's lifetime.

This series guarantees all pages managed by the page allocator are TDX
memory.  This means any hot-added memory to the page allocator will break
such guarantee thus should be prevented.

There are basically two memory hot-add cases that need to be prevented:
ACPI memory hot-add and driver managed memory hot-add.  This series
rejectes the driver managed memory hot-add too when TDX is enabled by
BIOS.

However, adding new memory to ZONE_DEVICE should not be prevented as
those pages are not managed by the page allocator.  Therefore,
memremap_pages() variants are still allowed although they internally
also uses memory hotplug functions.

5. Kexec()

TDX (and MKTME) doesn't guarantee cache coherency among different KeyIDs.
If the TDX module is ever initialized, the kernel needs to flush dirty
cachelines associated with any TDX private KeyID, otherwise they may
slightly corrupt the new kernel.

Similar to SME support, the kernel uses wbinvd() to flush cache in
stop_this_cpu().

The current TDX module architecture doesn't play nicely with kexec().
The TDX module can only be initialized once during its lifetime, and
there is no SEAMCALL to reset the module to give a new clean slate to
the new kernel.  Therefore, ideally, if the module is ever initialized,
it's better to shut down the module.  The new kernel won't be able to
use TDX anyway (as it needs to go through the TDX module initialization
process which will fail immediately at the first step).

However, there's no guarantee CPU is in VMX operation during kexec(), so
it's impractical to shut down the module.  This series just leaves the
module in open state.

Reference:
[1]: https://lore.kernel.org/lkml/cover.1651774250.git.isaku.yamahata@intel.com/T/
[2]: https://lore.kernel.org/linux-mm/YofeZps9YXgtP3f1@google.com/t/


Kai Huang (22):
  x86/virt/tdx: Detect TDX during kernel boot
  cc_platform: Add new attribute to prevent ACPI CPU hotplug
  cc_platform: Add new attribute to prevent ACPI memory hotplug
  x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  x86/virt/tdx: Prevent hot-add driver managed memory
  x86/virt/tdx: Add skeleton to initialize TDX on demand
  x86/virt/tdx: Implement SEAMCALL function
  x86/virt/tdx: Shut down TDX module in case of error
  x86/virt/tdx: Detect TDX module by doing module global initialization
  x86/virt/tdx: Do logical-cpu scope TDX module initialization
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs based on memblock
  x86/virt/tdx: Create TDMRs to cover all memblock memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Set up reserved areas for all TDMRs
  x86/virt/tdx: Reserve TDX module global KeyID
  x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/virt/tdx: Support kexec()
  Documentation/x86: Add documentation for TDX host support

 Documentation/x86/tdx.rst        |  190 ++++-
 arch/x86/Kconfig                 |   16 +
 arch/x86/Makefile                |    2 +
 arch/x86/coco/core.c             |   34 +-
 arch/x86/include/asm/tdx.h       |    9 +
 arch/x86/kernel/process.c        |    9 +-
 arch/x86/mm/init_64.c            |   21 +
 arch/x86/virt/Makefile           |    2 +
 arch/x86/virt/vmx/Makefile       |    2 +
 arch/x86/virt/vmx/tdx/Makefile   |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
 arch/x86/virt/vmx/tdx/tdx.c      | 1333 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  153 ++++
 drivers/acpi/acpi_memhotplug.c   |   23 +
 drivers/acpi/acpi_processor.c    |   23 +
 include/linux/cc_platform.h      |   25 +-
 include/linux/memory_hotplug.h   |    2 +
 kernel/cpu.c                     |    2 +-
 mm/memory_hotplug.c              |   15 +
 19 files changed, 1898 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
@ 2022-06-22 11:15 ` Kai Huang
  2022-06-23  5:57   ` Chao Gao
  2022-08-02  2:01   ` [PATCH v5 1/22] " Wu, Binbin
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX introduces a new CPU mode called
Secure Arbitration Mode (SEAM) and a new isolated range pointed by the
SEAM Ranger Register (SEAMRR).  A CPU-attested software module called
'the TDX module' runs inside the new isolated range to implement the
functionalities to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  BIOS is
responsible for partitioning the "KeyID" space between legacy MKTME and
TDX.  The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or
'TDX KeyIDs' for short.

To enable TDX, BIOS needs to configure SEAMRR (core-scope) and TDX
private KeyIDs (package-scope) consistently for all packages.  TDX
doesn't trust BIOS.  TDX ensures all BIOS configurations are correct,
and if not, refuses to enable SEAMRR on any core.  This means detecting
SEAMRR alone on BSP is enough to check whether TDX has been enabled by
BIOS.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM is the only user of TDX.  Make the new config
option depend on KVM_INTEL.

Use early_initcall() to detect whether TDX is enabled by BIOS during
kernel boot, and add a function to report that.  Use a function instead
of a new CPU feature bit.  This is because the TDX module needs to be
initialized before it can be used to run any TDX guests, and the TDX
module is initialized at runtime by the caller who wants to use TDX.

Explicitly detect SEAMRR but not just only detect TDX private KeyIDs.
Theoretically, a misconfiguration of TDX private KeyIDs can result in
SEAMRR being disabled, but the BSP can still report the correct TDX
KeyIDs.  Such BIOS bug can be caught when initializing the TDX module,
but it's better to do more detection during boot to provide a more
accurate result.

Also detect the TDX KeyIDs.  This allows userspace to know how many TDX
guests the platform can run w/o needing to wait until TDX is fully
functional.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig               |  13 ++++
 arch/x86/Makefile              |   2 +
 arch/x86/include/asm/tdx.h     |   7 +++
 arch/x86/virt/Makefile         |   2 +
 arch/x86/virt/vmx/Makefile     |   2 +
 arch/x86/virt/vmx/tdx/Makefile |   2 +
 arch/x86/virt/vmx/tdx/tdx.c    | 109 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |  47 ++++++++++++++
 8 files changed, 184 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7021ec725dd3..23f21aa3a5c4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1967,6 +1967,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	default n
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	depends on KVM_INTEL
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 63d50f65b828..2ca3a2a36dc5 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -234,6 +234,8 @@ head-y += arch/x86/kernel/platform-quirks.o
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..97511b76c1ac 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -87,5 +87,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else	/* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..1bd688684716
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..8275007702e6
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+#include <asm/cpufeatures.h>
+#include <asm/cpufeature.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+#include "tdx.h"
+
+static u32 tdx_keyid_start __ro_after_init;
+static u32 tdx_keyid_num __ro_after_init;
+
+/* Detect whether CPU supports SEAM */
+static int detect_seam(void)
+{
+	u64 mtrrcap, mask;
+
+	/* SEAMRR is reported via MTRRcap */
+	if (!boot_cpu_has(X86_FEATURE_MTRR))
+		return -ENODEV;
+
+	rdmsrl(MSR_MTRRcap, mtrrcap);
+	if (!(mtrrcap & MTRR_CAP_SEAMRR))
+		return -ENODEV;
+
+	/* The MASK MSR reports whether SEAMRR is enabled */
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS)
+		return -ENODEV;
+
+	pr_info("SEAMRR enabled.\n");
+	return 0;
+}
+
+static int detect_tdx_keyids(void)
+{
+	u64 keyid_part;
+
+	rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);
+
+	tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
+	tdx_keyid_start = TDX_KEYID_START(keyid_part);
+
+	pr_info("TDX private KeyID range: [%u, %u).\n",
+			tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
+
+	/*
+	 * TDX guarantees at least two TDX KeyIDs are configured by
+	 * BIOS, otherwise SEAMRR is disabled.  Invalid TDX private
+	 * range means kernel bug (TDX is broken).
+	 */
+	if (WARN_ON(!tdx_keyid_start || tdx_keyid_num < 2)) {
+		tdx_keyid_start = tdx_keyid_num = 0;
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Detect TDX via detecting SEAMRR during kernel boot.
+ *
+ * To enable TDX, BIOS must configure SEAMRR consistently across all
+ * CPU cores.  TDX doesn't trust BIOS.  Instead, MCHECK verifies all
+ * configurations from BIOS are correct, and if not, it disables TDX
+ * (SEAMRR is disabled on all cores).  This means detecting SEAMRR on
+ * BSP is enough to determine whether TDX has been enabled by BIOS.
+ */
+static int __init tdx_early_detect(void)
+{
+	int ret;
+
+	ret = detect_seam();
+	if (ret)
+		return ret;
+
+	/*
+	 * TDX private KeyIDs is only accessible by SEAM software.
+	 * Only detect TDX KeyIDs when SEAMRR is enabled.
+	 */
+	ret = detect_tdx_keyids();
+	if (ret)
+		return ret;
+
+	pr_info("TDX enabled by BIOS.\n");
+	return 0;
+}
+early_initcall(tdx_early_detect);
+
+/**
+ * platform_tdx_enabled() - Return whether BIOS has enabled TDX
+ *
+ * Return whether BIOS has enabled TDX regardless whether the TDX module
+ * has been loaded or not.
+ */
+bool platform_tdx_enabled(void)
+{
+	return tdx_keyid_num >= 2;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..f16055cc25f4
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/bits.h>
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/*
+ * Intel Trusted Domain CPU Architecture Extension spec:
+ *
+ * IA32_MTRRCAP:
+ *   Bit 15:	The support of SEAMRR
+ *
+ * IA32_SEAMRR_PHYS_MASK (core-scope):
+ *   Bit 10:	Lock bit
+ *   Bit 11:	Enable bit
+ */
+#define MTRR_CAP_SEAMRR			BIT_ULL(15)
+
+#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
+
+#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
+#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
+#define SEAMRR_ENABLED_BITS	\
+	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
+
+/*
+ * IA32_MKTME_KEYID_PARTIONING:
+ *   Bit [31:0]:	Number of MKTME KeyIDs.
+ *   Bit [63:32]:	Number of TDX private KeyIDs.
+ *
+ * MKTME KeyIDs start from KeyID 1. TDX private KeyIDs start
+ * after the last MKTME KeyID.
+ */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
+#define TDX_KEYID_START(_keyid_part)	\
+		((u32)(((_keyid_part) & 0xffffffffull) + 1))
+#define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
+
+#endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
  2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2022-06-22 11:15 ` Kai Huang
  2022-06-22 11:42   ` Rafael J. Wysocki
                     ` (3 more replies)
  2022-06-22 11:15 ` [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
                   ` (20 subsequent siblings)
  22 siblings, 4 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang, kai.huang

Platforms with confidential computing technology may not support ACPI
CPU hotplug when such technology is enabled by the BIOS.  Examples
include Intel platforms which support Intel Trust Domain Extensions
(TDX).

If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
bug and reject the new CPU.  For hot-removal, for simplicity just assume
the kernel cannot continue to work normally, and BUG().

Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
platform doesn't support ACPI CPU hotplug, so that kernel can handle
ACPI CPU hotplug events for such platform.  The existing attribute
CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.

In acpi_processor_{add|remove}(), add early check against this attribute
and handle accordingly if it is set.

Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/coco/core.c          |  2 +-
 drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
 include/linux/cc_platform.h   | 15 +++++++++++++--
 kernel/cpu.c                  |  2 +-
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 4320fadae716..1bde1af75296 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
-	case CC_ATTR_HOTPLUG_DISABLED:
+	case CC_ATTR_CPU_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
 	case CC_ATTR_MEM_ENCRYPT:
 		return true;
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 6737b1cbf6d6..b960db864cd4 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -15,6 +15,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/cc_platform.h>
 
 #include <acpi/processor.h>
 
@@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
 	struct device *dev;
 	int result = 0;
 
+	/*
+	 * If the confidential computing platform doesn't support ACPI
+	 * memory hotplug, the BIOS should never deliver such event to
+	 * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
+	 * the new CPU.
+	 */
+	if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
+		dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI CPU hotplug.  New CPU ignored.\n");
+		return -EINVAL;
+	}
+
 	pr = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL);
 	if (!pr)
 		return -ENOMEM;
@@ -434,6 +446,17 @@ static void acpi_processor_remove(struct acpi_device *device)
 	if (!device || !acpi_driver_data(device))
 		return;
 
+	/*
+	 * The confidential computing platform is broken if ACPI memory
+	 * hot-removal isn't supported but it happened anyway.  Assume
+	 * it's not guaranteed that the kernel can continue to work
+	 * normally.  Just BUG().
+	 */
+	if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
+		dev_err(&device->dev, "Platform doesn't support ACPI CPU hotplug. BUG().\n");
+		BUG();
+	}
+
 	pr = acpi_driver_data(device);
 	if (pr->id >= nr_cpu_ids)
 		goto out;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 691494bbaf5a..9ce9256facc8 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -74,14 +74,25 @@ enum cc_attr {
 	CC_ATTR_GUEST_UNROLL_STRING_IO,
 
 	/**
-	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
+	 * @CC_ATTR_CPU_HOTPLUG_DISABLED: CPU hotplug is not supported or
+	 *				  disabled.
 	 *
 	 * The platform/OS is running as a guest/virtual machine does not
 	 * support CPU hotplug feature.
 	 *
 	 * Examples include TDX Guest.
 	 */
-	CC_ATTR_HOTPLUG_DISABLED,
+	CC_ATTR_CPU_HOTPLUG_DISABLED,
+
+	/**
+	 * @CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED: ACPI CPU hotplug is not
+	 *				       supported.
+	 *
+	 * The platform/OS does not support ACPI CPU hotplug.
+	 *
+	 * Examples include TDX platform.
+	 */
+	CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index edb8c199f6a3..966772cce063 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1191,7 +1191,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	 * If the platform does not support hotplug, report it explicitly to
 	 * differentiate it from a transient offlining failure.
 	 */
-	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+	if (cc_platform_has(CC_ATTR_CPU_HOTPLUG_DISABLED))
 		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
  2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
@ 2022-06-22 11:15 ` Kai Huang
  2022-06-22 11:45   ` Rafael J. Wysocki
  2022-06-22 11:16 ` [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, kai.huang

Platforms with confidential computing technology may not support ACPI
memory hotplug when such technology is enabled by the BIOS.  Examples
include Intel platforms which support Intel Trust Domain Extensions
(TDX).

If the kernel ever receives ACPI memory hotplug event, it is likely a
BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
a BIOS bug and reject the new memory.  For hot-removal, for simplicity
just assume the kernel cannot continue to work normally, and just BUG().

Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
platform doesn't support ACPI memory hotplug, so that kernel can handle
ACPI memory hotplug events for such platform.

In acpi_memory_device_{add|remove}(), add early check against this
attribute and handle accordingly if it is set.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
 include/linux/cc_platform.h    | 10 ++++++++++
 2 files changed, 33 insertions(+)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..94d6354ea453 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -15,6 +15,7 @@
 #include <linux/acpi.h>
 #include <linux/memory.h>
 #include <linux/memory_hotplug.h>
+#include <linux/cc_platform.h>
 
 #include "internal.h"
 
@@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
 	if (!device)
 		return -EINVAL;
 
+	/*
+	 * If the confidential computing platform doesn't support ACPI
+	 * memory hotplug, the BIOS should never deliver such event to
+	 * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
+	 * the memory device.
+	 */
+	if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED)) {
+		dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI memory hotplug. New memory device ignored.\n");
+		return -EINVAL;
+	}
+
 	mem_device = kzalloc(sizeof(struct acpi_memory_device), GFP_KERNEL);
 	if (!mem_device)
 		return -ENOMEM;
@@ -334,6 +346,17 @@ static void acpi_memory_device_remove(struct acpi_device *device)
 	if (!device || !acpi_driver_data(device))
 		return;
 
+	/*
+	 * The confidential computing platform is broken if ACPI memory
+	 * hot-removal isn't supported but it happened anyway.  Assume
+	 * it is not guaranteed that the kernel can continue to work
+	 * normally.  Just BUG().
+	 */
+	if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
+		dev_err(&device->dev, "Platform doesn't support ACPI memory hotplug. BUG().\n");
+		BUG();
+	}
+
 	mem_device = acpi_driver_data(device);
 	acpi_memory_remove_memory(mem_device);
 	acpi_memory_device_free(mem_device);
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 9ce9256facc8..b831c24bd7f6 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -93,6 +93,16 @@ enum cc_attr {
 	 * Examples include TDX platform.
 	 */
 	CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED,
+
+	/**
+	 * @CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED: ACPI memory hotplug is
+	 *					  not supported.
+	 *
+	 * The platform/os does not support ACPI memory hotplug.
+	 *
+	 * Examples include TDX platform.
+	 */
+	CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (2 preceding siblings ...)
  2022-06-22 11:15 ` [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-24  1:41   ` Chao Gao
  2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, kai.huang

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  To guarantee the security, TDX
imposes additional requirements on both CPU and memory.

TDX doesn't work with ACPI CPU hotplug.  During platform boot, MCHECK
verifies all logical CPUs on all packages are TDX compatible.  Any
hot-added CPU at runtime is not verified thus cannot support TDX.  And
TDX requires all boot-time verified CPUs being present during machine's
runtime, so TDX doesn't support ACPI CPU hot-removal either.

TDX doesn't work with ACPI memory hotplug either.  TDX also provides
increased levels of memory confidentiality and integrity.  During
platform boot, MCHECK also verifies all TDX-capable memory regions are
physically present and meet TDX's security requirements.  Any hot-added
memory is not verified thus cannot work with TDX.  TDX also assumes all
TDX-capable memory regions are present during machine's runtime thus it
doesn't support ACPI memory removal either.

Select ARCH_HAS_CC_PLATFORM when CONFIG_INTEL_TDX_HOST is on.  Set CC
vendor to CC_VENDOR_INTEL if TDX is enabled by BIOS, and report ACPI CPU
hotplug and ACPI memory hotplug attributes as disabled to prevent them.

Note TDX does allow CPU to go offline and then to be brought up again, so
software CPU hotplug attribute is not reported.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig            |  1 +
 arch/x86/coco/core.c        | 32 +++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c |  4 ++++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 23f21aa3a5c4..efa830853e98 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	depends on KVM_INTEL
+	select ARCH_HAS_CC_PLATFORM
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 1bde1af75296..e4c9e34c452f 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -12,11 +12,14 @@
 
 #include <asm/coco.h>
 #include <asm/processor.h>
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
 
 static enum cc_vendor vendor __ro_after_init;
 static u64 cc_mask __ro_after_init;
 
-static bool intel_cc_platform_has(enum cc_attr attr)
+#ifdef CONFIG_INTEL_TDX_GUEST
+static bool intel_tdx_guest_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
@@ -28,6 +31,33 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 		return false;
 	}
 }
+#endif
+
+#ifdef CONFIG_INTEL_TDX_HOST
+static bool intel_tdx_host_has(enum cc_attr attr)
+{
+	switch (attr) {
+	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
+	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
+		return true;
+	default:
+		return false;
+	}
+}
+#endif
+
+static bool intel_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_INTEL_TDX_GUEST
+	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+		return intel_tdx_guest_has(attr);
+#endif
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (platform_tdx_enabled())
+		return intel_tdx_host_has(attr);
+#endif
+	return false;
+}
 
 /*
  * SME and SEV are very similar but they are not the same, so there are
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8275007702e6..eb3294bf1b0a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
+#include <asm/coco.h>
 #include "tdx.h"
 
 static u32 tdx_keyid_start __ro_after_init;
@@ -92,6 +93,9 @@ static int __init tdx_early_detect(void)
 	if (ret)
 		return ret;
 
+	/* Set TDX enabled platform as confidential computing platform */
+	cc_set_vendor(CC_VENDOR_INTEL);
+
 	pr_info("TDX enabled by BIOS.\n");
 	return 0;
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (3 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-24  2:12   ` Chao Gao
  2022-06-24 19:01   ` Dave Hansen
  2022-06-22 11:16 ` [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	akpm, kai.huang

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, the TDX introduced the concept of a "Convertible Memory
Region" (CMR).  During boot, the firmware builds a list of all of the
memory ranges which can provide the TDX security guarantees.  The list
of these ranges is available to the kernel by querying the TDX module.

However those TDX-capable memory regions are not automatically usable to
the TDX module.  The kernel needs to choose which convertible memory
regions to be the TDX-usable memory and pass those regions to the TDX
module when initializing the module.  Once those ranges are passed to
the TDX module, the TDX-usable memory regions are fixed during module's
lifetime.

To avoid having to modify the page allocator to distinguish TDX and
non-TDX memory allocation, this implementation guarantees all pages
managed by the page allocator are TDX memory.  This means any hot-added
memory to the page allocator will break such guarantee thus should be
prevented.

There are basically two memory hot-add cases that need to be prevented:
ACPI memory hot-add and driver managed memory hot-add.  However, adding
new memory to ZONE_DEVICE should not be prevented as those pages are not
managed by the page allocator.  Therefore memremap_pages() variants
should be allowed although they internally also use memory hotplug
functions.

ACPI memory hotplug is already prevented.  To prevent driver managed
memory and still allow memremap_pages() variants to work, add a __weak
hook to do arch-specific check in add_memory_resource().  Implement the
x86 version to prevent new memory region from being added when TDX is
enabled by BIOS.

The __weak arch-specific hook is used instead of a new CC_ATTR similar
to disable software CPU hotplug.  It is because some driver managed
memory resources may actually be TDX-capable (such as legacy PMEM, which
is underneath indeed RAM), and the arch-specific hook can be further
enhanced to allow those when needed.

Note arch-specific hook for __remove_memory() is not required.  Both
ACPI hot-removal and driver managed memory removal cannot reach it.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/mm/init_64.c          | 21 +++++++++++++++++++++
 include/linux/memory_hotplug.h |  2 ++
 mm/memory_hotplug.c            | 15 +++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 96d34ebb20a9..ce89cf88a818 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -55,6 +55,7 @@
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
 #include <asm/ftrace.h>
+#include <asm/tdx.h>
 
 #include "mm_internal.h"
 
@@ -972,6 +973,26 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return add_pages(nid, start_pfn, nr_pages, params);
 }
 
+int arch_memory_add_precheck(int nid, u64 start, u64 size, mhp_t mhp_flags)
+{
+	if (!platform_tdx_enabled())
+		return 0;
+
+	/*
+	 * TDX needs to guarantee all pages managed by the page allocator
+	 * are TDX memory in order to not have to distinguish TDX and
+	 * non-TDX memory allocation.  The kernel needs to pass the
+	 * TDX-usable memory regions to the TDX module when it gets
+	 * initialized.  After that, the TDX-usable memory regions are
+	 * fixed.  This means any memory hot-add to the page allocator
+	 * will break above guarantee thus should be prevented.
+	 */
+	pr_err("Unable to add memory [0x%llx, 0x%llx) on TDX enabled platform.\n",
+			start, start + size);
+
+	return -EINVAL;
+}
+
 static void __meminit free_pagetable(struct page *page, int order)
 {
 	unsigned long magic;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1ce6f8044f1e..306ef4ceb419 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -325,6 +325,8 @@ extern int add_memory_resource(int nid, struct resource *resource,
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 				     const char *resource_name,
 				     mhp_t mhp_flags);
+extern int arch_memory_add_precheck(int nid, u64 start, u64 size,
+				    mhp_t mhp_flags);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 				   unsigned long nr_pages,
 				   struct vmem_altmap *altmap, int migratetype);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 416b38ca8def..2ad4b2603c7c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1296,6 +1296,17 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
+/*
+ * Pre-check whether hot-add memory is allowed before arch_add_memory().
+ *
+ * Arch to provide replacement version if required.
+ */
+int __weak arch_memory_add_precheck(int nid, u64 start, u64 size,
+				    mhp_t mhp_flags)
+{
+	return 0;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1319,6 +1330,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	if (ret)
 		return ret;
 
+	ret = arch_memory_add_precheck(nid, start, size, mhp_flags);
+	if (ret)
+		return ret;
+
 	if (mhp_flags & MHP_NID_IS_MGID) {
 		group = memory_group_find_by_id(nid);
 		if (!group)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (4 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-24  2:39   ` Chao Gao
  2022-06-22 11:16 ` [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Before the TDX module can be used to create and run TD guests, it must
be loaded into the isolated region pointed by the SEAMRR and properly
initialized.  The TDX module is expected to be loaded by BIOS before
booting to the kernel, and the kernel is expected to detect and
initialize it.

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
on-demand approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This avoids consuming the memory that must be
allocated by kernel and given to the TDX module as metadata (~1/256th of
the TDX-usable memory), and also saves the time of initializing the TDX
module (and the metadata) when TDX is not used at all.  Initializing the
TDX module at runtime on-demand also is more flexible to support TDX
module runtime updating in the future (after updating the TDX module, it
needs to be initialized again).

Add a placeholder tdx_init() to detect and initialize the TDX module on
demand, with a state machine protected by mutex to support concurrent
calls from multiple callers.

The TDX module will be initialized in multi-steps defined by the TDX
architecture:

  1) Global initialization;
  2) Logical-CPU scope initialization;
  3) Enumerate the TDX module capabilities and platform configuration;
  4) Configure the TDX module about usable memory ranges and global
     KeyID information;
  5) Package-scope configuration for the global KeyID;
  6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module.  It's pointless to leave the module in any intermediate state
during the initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3->v5 (no feedback on v4):

 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.

---
 arch/x86/include/asm/tdx.h  |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 153 ++++++++++++++++++++++++++++++++++++
 2 files changed, 155 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 97511b76c1ac..801f6e10b2db 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -90,8 +90,10 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_init(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_init(void)  { return -ENODEV; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index eb3294bf1b0a..1f9d8108eeea 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -10,17 +10,39 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/printk.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <asm/cpufeatures.h>
 #include <asm/cpufeature.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
+#include <asm/smp.h>
 #include <asm/tdx.h>
 #include <asm/coco.h>
 #include "tdx.h"
 
+/*
+ * TDX module status during initialization
+ */
+enum tdx_module_status_t {
+	/* TDX module hasn't been detected and initialized */
+	TDX_MODULE_UNKNOWN,
+	/* TDX module is not loaded */
+	TDX_MODULE_NONE,
+	/* TDX module is initialized */
+	TDX_MODULE_INITIALIZED,
+	/* TDX module is shut down due to initialization error */
+	TDX_MODULE_SHUTDOWN,
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 tdx_keyid_num __ro_after_init;
 
+static enum tdx_module_status_t tdx_module_status;
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
 /* Detect whether CPU supports SEAM */
 static int detect_seam(void)
 {
@@ -101,6 +123,84 @@ static int __init tdx_early_detect(void)
 }
 early_initcall(tdx_early_detect);
 
+/*
+ * Detect and initialize the TDX module.
+ *
+ * Return -ENODEV when the TDX module is not loaded, 0 when it
+ * is successfully initialized, or other error when it fails to
+ * initialize.
+ */
+static int init_tdx_module(void)
+{
+	/* The TDX module hasn't been detected */
+	return -ENODEV;
+}
+
+static void shutdown_tdx_module(void)
+{
+	/* TODO: Shut down the TDX module */
+	tdx_module_status = TDX_MODULE_SHUTDOWN;
+}
+
+static int __tdx_init(void)
+{
+	int ret;
+
+	/*
+	 * Initializing the TDX module requires running some code on
+	 * all MADT-enabled CPUs.  If not all MADT-enabled CPUs are
+	 * online, it's not possible to initialize the TDX module.
+	 *
+	 * For simplicity temporarily disable CPU hotplug to prevent
+	 * any CPU from going offline during the initialization.
+	 */
+	cpus_read_lock();
+
+	/*
+	 * Check whether all MADT-enabled CPUs are online and return
+	 * early with an explicit message so the user can be aware.
+	 *
+	 * Note ACPI CPU hotplug is prevented when TDX is enabled, so
+	 * num_processors always reflects all present MADT-enabled
+	 * CPUs during boot when disabled_cpus is 0.
+	 */
+	if (disabled_cpus || num_online_cpus() != num_processors) {
+		pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = init_tdx_module();
+	if (ret == -ENODEV) {
+		pr_info("TDX module is not loaded.\n");
+		goto out;
+	}
+
+	/*
+	 * Shut down the TDX module in case of any error during the
+	 * initialization process.  It's meaningless to leave the TDX
+	 * module in any middle state of the initialization process.
+	 *
+	 * Shutting down the module also requires running some code on
+	 * all MADT-enabled CPUs.  Do it while CPU hotplug is disabled.
+	 *
+	 * Return all errors during initialization as -EFAULT as
+	 * the TDX module is always shut down in such cases.
+	 */
+	if (ret) {
+		pr_info("Failed to initialize TDX module.  Shut it down.\n");
+		shutdown_tdx_module();
+		ret = -EFAULT;
+		goto out;
+	}
+
+	pr_info("TDX module initialized.\n");
+out:
+	cpus_read_unlock();
+
+	return ret;
+}
+
 /**
  * platform_tdx_enabled() - Return whether BIOS has enabled TDX
  *
@@ -111,3 +211,56 @@ bool platform_tdx_enabled(void)
 {
 	return tdx_keyid_num >= 2;
 }
+
+/**
+ * tdx_init - Initialize the TDX module
+ *
+ * Initialize the TDX module to make it ready to run TD guests.
+ *
+ * Caller to make sure all CPUs are online before calling this function.
+ * CPU hotplug is temporarily disabled internally to prevent any cpu
+ * from going offline.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * 0:		The TDX module has been successfully initialized.
+ * * -ENODEV:	The TDX module is not loaded, or TDX is not supported.
+ * * -EINVAL:	The TDX module cannot be initialized due to certain
+ *		conditions are not met (i.e. when not all MADT-enabled
+ *		CPUs are not online).
+ * * -EFAULT:	Other internal fatal errors, or the TDX module is in
+ *		shutdown mode due to it failed to initialize in previous
+ *		attempts.
+ */
+int tdx_init(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_init();
+		break;
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	default:
+		WARN_ON_ONCE(tdx_module_status != TDX_MODULE_SHUTDOWN);
+		ret = -EFAULT;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_init);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (5 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-24 18:38   ` Dave Hansen
  2022-06-22 11:16 ` [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

The TDX module defines SEAMCALL leaf functions to allow the host to
initialize it, and to create and run protected VMs.  SEAMCALL leaf
functions use an ABI different from the x86-64 system-v ABI.  Instead,
they share the same ABI with the TDCALL leaf functions.

Implement a function __seamcall() to allow the host to make SEAMCALL
to SEAM software using the TDX_MODULE_CALL macro which is the common
assembly for both SEAMCALL and TDCALL.

SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
before calling __seamcall().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):

 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.

---
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 11 +++++++
 3 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 1bd688684716..fd577619620e 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f322427e48c3
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ *		  (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function.  Additional
+ * output operands are saved in @out (if it is provided by caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f16055cc25f4..f1a2dfb978b1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -2,6 +2,7 @@
 #ifndef _X86_VIRT_TDX_H
 #define _X86_VIRT_TDX_H
 
+#include <linux/types.h>
 #include <linux/bits.h>
 
 /*
@@ -44,4 +45,14 @@
 		((u32)(((_keyid_part) & 0xffffffffull) + 1))
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
+
+/*
+ * Do not put any hardware-defined TDX structure representations below this
+ * comment!
+ */
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
+
 #endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (6 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-24 18:50   ` Dave Hansen
  2022-06-22 11:16 ` [PATCH v5 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization Kai Huang
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX supports shutting down the TDX module at any time during its
lifetime.  After the module is shut down, no further TDX module SEAMCALL
leaf functions can be made to the module on any logical cpu.

Shut down the TDX module in case of any error during the initialization
process.  It's pointless to leave the TDX module in some middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
CPUs and use it to shut down the module.  Later logical-cpu scope module
initialization will use it too.

Also add a wrapper of __seamcall() which additionally prints out the
error information if SEAMCALL fails.  It will be useful during the TDX
module initialization as it provides more error information to the user.

SEAMCALL instruction causes #UD if CPU is not in VMX operation (VMXON
has been done).  So far only KVM supports VMXON.  It guarantees all
online CPUs are in VMX operation when there's any VM still exists.  As
so far KVM is also the only user of TDX, choose to just let the caller
to guarantee all CPUs are in VMX operation during tdx_init().

Adding the support of VMXON/VMXOFF to the core kernel isn't trivial.
In the long term, more kernel components will likely need to use TDX so
a reference-based approach to do VMXON/VMXOFF will likely be needed.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):

 - Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
 - Made the seamcall_on_each_cpu() void.
 - Removed 'seamcall_ret' and 'tdx_module_out' from
   'struct seamcall_ctx', as they must be local variable.
 - Added the comments to tdx_init() and one paragraph to changelog to
   explain the caller should handle VMXON.
 - Called out after shut down, no "TDX module" SEAMCALL can be made.

---
 arch/x86/virt/vmx/tdx/tdx.c | 65 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  5 +++
 2 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1f9d8108eeea..31ce4522100a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,8 @@
 #include <linux/mutex.h>
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
 #include <asm/cpufeatures.h>
 #include <asm/cpufeature.h>
 #include <asm/msr-index.h>
@@ -123,6 +125,61 @@ static int __init tdx_early_detect(void)
 }
 early_initcall(tdx_early_detect);
 
+/*
+ * Data structure to make SEAMCALL on multiple CPUs concurrently.
+ * @err is set to -EFAULT when SEAMCALL fails on any cpu.
+ */
+struct seamcall_ctx {
+	u64 fn;
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	atomic_t err;
+};
+
+/*
+ * Wrapper of __seamcall().  It additionally prints out the error
+ * informationi if __seamcall() fails normally.  It is useful during
+ * the module initialization by providing more information to the user.
+ */
+static u64 seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		    struct tdx_module_output *out)
+{
+	u64 ret;
+
+	ret = __seamcall(fn, rcx, rdx, r8, r9, out);
+	if (ret == TDX_SEAMCALL_VMFAILINVALID || !ret)
+		return ret;
+
+	pr_err("SEAMCALL failed: leaf: 0x%llx, error: 0x%llx\n", fn, ret);
+	if (out)
+		pr_err("SEAMCALL additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
+			out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
+
+	return ret;
+}
+
+static void seamcall_smp_call_function(void *data)
+{
+	struct seamcall_ctx *sc = data;
+	struct tdx_module_output out;
+	u64 ret;
+
+	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, &out);
+	if (ret)
+		atomic_set(&sc->err, -EFAULT);
+}
+
+/*
+ * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
+ * @sc->err to determine whether any SEAMCALL failed on any cpu.
+ */
+static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+	on_each_cpu(seamcall_smp_call_function, sc, true);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -138,7 +195,10 @@ static int init_tdx_module(void)
 
 static void shutdown_tdx_module(void)
 {
-	/* TODO: Shut down the TDX module */
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+	seamcall_on_each_cpu(&sc);
+
 	tdx_module_status = TDX_MODULE_SHUTDOWN;
 }
 
@@ -221,6 +281,9 @@ bool platform_tdx_enabled(void)
  * CPU hotplug is temporarily disabled internally to prevent any cpu
  * from going offline.
  *
+ * Caller also needs to guarantee all CPUs are in VMX operation during
+ * this function, otherwise Oops may be triggered.
+ *
  * This function can be called in parallel by multiple callers.
  *
  * Return:
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f1a2dfb978b1..95d4eb884134 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -46,6 +46,11 @@
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN	44
+
 /*
  * Do not put any hardware-defined TDX structure representations below this
  * comment!
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (7 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-22 11:16 ` [PATCH v5 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

So far the TDX module hasn't been detected yet.  __seamcall() returns
TDX_SEAMCALL_VMFAILINVALID when the target SEAM software module is not
loaded.  Just use __seamcall() to the TDX module to detect the TDX
module.

The first step of initializing the module is to call TDH.SYS.INIT once
on any logical cpu to do module global initialization.  Just use it to
detect the module since it needs to be done anyway.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):
 - Add detecting TDX module.

---
 arch/x86/virt/vmx/tdx/tdx.c | 39 +++++++++++++++++++++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 31ce4522100a..de4efc16ed45 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -180,6 +180,21 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	on_each_cpu(seamcall_smp_call_function, sc, true);
 }
 
+/*
+ * Do TDX module global initialization.  It also detects whether the
+ * module has been loaded or not.
+ */
+static int tdx_module_init_global(void)
+{
+	u64 ret;
+
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL);
+	if (ret == TDX_SEAMCALL_VMFAILINVALID)
+		return -ENODEV;
+
+	return ret ? -EFAULT : 0;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -189,8 +204,28 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
  */
 static int init_tdx_module(void)
 {
-	/* The TDX module hasn't been detected */
-	return -ENODEV;
+	int ret;
+
+	/*
+	 * Whether the TDX module is loaded is still unknown.  SEAMCALL
+	 * instruction fails with VMfailInvalid if the target SEAM
+	 * software module is not loaded, so it can be used to detect the
+	 * module.
+	 *
+	 * The first step of initializing the TDX module is module global
+	 * initialization.  Just use it to detect the module.
+	 */
+	ret = tdx_module_init_global();
+	if (ret)
+		goto out;
+
+	/*
+	 * Return -EINVAL until all steps of TDX module initialization
+	 * process are done.
+	 */
+	ret = -EINVAL;
+out:
+	return ret;
 }
 
 static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 95d4eb884134..9e694789eb91 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -49,6 +49,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INIT		33
 #define TDH_SYS_LP_SHUTDOWN	44
 
 /*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (8 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization Kai Huang
@ 2022-06-22 11:16 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:16 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

After the global module initialization, the next step is logical-cpu
scope module initialization.  Logical-cpu initialization requires
calling TDH.SYS.LP.INIT on all BIOS-enabled CPUs.  This SEAMCALL can run
concurrently on all CPUs.

Use the helper introduced for shutting down the module to do logical-cpu
scope initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 15 +++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index de4efc16ed45..f3f6e20aa30e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -195,6 +195,15 @@ static int tdx_module_init_global(void)
 	return ret ? -EFAULT : 0;
 }
 
+static int tdx_module_init_cpus(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+	seamcall_on_each_cpu(&sc);
+
+	return atomic_read(&sc.err);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -219,6 +228,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Logical-cpu scope initialization */
+	ret = tdx_module_init_cpus();
+	if (ret)
+		goto out;
+
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9e694789eb91..56164bf27378 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -50,6 +50,7 @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
 /*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (9 preceding siblings ...)
  2022-06-22 11:16 ` [PATCH v5 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module via TDH.SYS.INFO SEAMCALL.

The host kernel can choose whether or not to use all convertible memory
regions as TDX-usable memory.  Before the TDX module is ready to create
any TDX guests, the kernel needs to configure the TDX-usable memory
regions by passing an array of "TD Memory Regions" (TDMRs) to the TDX
module.  Constructing the TDMR array requires information of both the
TDX module (TDSYSINFO_STRUCT) and the Convertible Memory Regions.  Call
TDH.SYS.INFO to get this information as preparation.

Use static variables for both TDSYSINFO_STRUCT and CMR array to avoid
having to pass them as function arguments when constructing the TDMR
array.  And they are too big to be put to the stack anyway.  Also, KVM
needs the TDSYSINFO_STRUCT to create TDX guests.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):
 - Renamed sanitize_cmrs() to check_cmrs().
 - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
   actual size returned by TDH.SYS.INFO.
 - Changed -EFAULT to -EINVAL in couple places.
 - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
   used by TDH.SYS.INFO ABI.
 - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
   arguments in tdx_get_sysinfo().
 - Changed to only print BIOS-CMR when check_cmrs() fails.

---
 arch/x86/virt/vmx/tdx/tdx.c | 137 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  61 ++++++++++++++++
 2 files changed, 198 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f3f6e20aa30e..1bc97756bc0d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -45,6 +45,11 @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* Below two are used in TDH.SYS.INFO SEAMCALL ABI */
+static struct tdsysinfo_struct tdx_sysinfo;
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+static int tdx_cmr_num;
+
 /* Detect whether CPU supports SEAM */
 static int detect_seam(void)
 {
@@ -204,6 +209,135 @@ static int tdx_module_init_cpus(void)
 	return atomic_read(&sc.err);
 }
 
+static inline bool cmr_valid(struct cmr_info *cmr)
+{
+	return !!cmr->size;
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
+		       const char *name)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		pr_info("%s : [0x%llx, 0x%llx)\n", name,
+				cmr->base, cmr->base + cmr->size);
+	}
+}
+
+/*
+ * Check the CMRs reported by TDH.SYS.INFO and update the actual number
+ * of CMRs.  The CMRs returned by the TDH.SYS.INFO may contain invalid
+ * CMRs after the last valid CMR, but there should be no invalid CMRs
+ * between two valid CMRs.  Check and update the actual number of CMRs
+ * number by dropping all tail empty CMRs.
+ */
+static int check_cmrs(struct cmr_info *cmr_array, int *actual_cmr_num)
+{
+	int cmr_num = *actual_cmr_num;
+	int i, j;
+
+	/*
+	 * Intel TDX module spec, 20.7.3 CMR_INFO:
+	 *
+	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
+	 *   array of CMR_INFO entries. The CMRs are sorted from the
+	 *   lowest base address to the highest base address, and they
+	 *   are non-overlapping.
+	 *
+	 * This implies that BIOS may generate invalid empty entries
+	 * if total CMRs are less than 32.  Skip them manually.
+	 */
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+		struct cmr_info *prev_cmr = NULL;
+
+		/* Skip further invalid CMRs */
+		if (!cmr_valid(cmr))
+			break;
+
+		if (i > 0)
+			prev_cmr = &cmr_array[i - 1];
+
+		/*
+		 * It is a TDX firmware bug if CMRs are not
+		 * in address ascending order.
+		 */
+		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
+					cmr->base)) {
+			print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
+			pr_err("Firmware bug: CMRs not in address ascending order.\n");
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * Also a sane BIOS should never generate invalid CMR(s) between
+	 * two valid CMRs.  Sanity check this and simply return error in
+	 * this case.
+	 *
+	 * By reaching here @i is the index of the first invalid CMR (or
+	 * cmr_num).  Starting with next entry of @i since it has already
+	 * been checked.
+	 */
+	for (j = i + 1; j < cmr_num; j++) {
+		if (cmr_valid(&cmr_array[j])) {
+			print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
+			pr_err("Firmware bug: invalid CMR(s) before valid CMRs.\n");
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * Trim all tail invalid empty CMRs.  BIOS should generate at
+	 * least one valid CMR, otherwise it's a TDX firmware bug.
+	 */
+	if (i == 0) {
+		print_cmrs(cmr_array, cmr_num, "BIOS-CMR");
+		pr_err("Firmware bug: No valid CMR.\n");
+		return -EINVAL;
+	}
+
+	/* Update the actual number of CMRs */
+	*actual_cmr_num = i;
+
+	/* Print kernel checked CMRs */
+	print_cmrs(cmr_array, *actual_cmr_num, "Kernel-checked-CMR");
+
+	return 0;
+}
+
+static int tdx_get_sysinfo(struct tdsysinfo_struct *tdsysinfo,
+			   struct cmr_info *cmr_array,
+			   int *actual_cmr_num)
+{
+	struct tdx_module_output out;
+	u64 ret;
+
+	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
+
+	ret = seamcall(TDH_SYS_INFO, __pa(tdsysinfo), TDSYSINFO_STRUCT_SIZE,
+			__pa(cmr_array), MAX_CMRS, &out);
+	if (ret)
+		return -EFAULT;
+
+	/* R9 contains the actual entries written the CMR array. */
+	*actual_cmr_num = out.r9;
+
+	pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		tdsysinfo->vendor_id, tdsysinfo->major_version,
+		tdsysinfo->minor_version, tdsysinfo->build_date,
+		tdsysinfo->build_num);
+
+	/*
+	 * check_cmrs() updates the actual number of CMRs by dropping all
+	 * tail invalid CMRs.
+	 */
+	return check_cmrs(cmr_array, actual_cmr_num);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -233,6 +367,9 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	ret = tdx_get_sysinfo(&tdx_sysinfo, tdx_cmr_array, &tdx_cmr_num);
+	if (ret)
+		goto out;
 
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 56164bf27378..63b1edd11660 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -49,10 +49,71 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
+	 * is 1024B defined by TDX architecture.  Use a union with
+	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+	 * equal to 1024.
+	 */
+	union {
+		struct cpuid_config	cpuid_configs[0];
+		u8			reserved5[892];
+	};
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below this
  * comment!
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (10 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-24 19:40   ` Dave Hansen
  2022-06-22 11:17 ` [PATCH v5 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock Kai Huang
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The TDX module reports a list of Convertible Memory Regions (CMR) to
identify which memory regions can be used as TDX memory, but they are
not automatically usable to the TDX module.  The kernel needs to choose
which convertible memory regions to be TDX memory and configure those
regions by passing an array of "TD Memory Regions" (TDMR) to the TDX
module.

To avoid having to modify the page allocator to distinguish TDX and
non-TDX memory allocation, convert all memory regions in the memblock to
TDX memory.  As the first step, sanity check all memory regions in
memblock are fully covered by CMRs so the above conversion is guaranteed
to work.  This works also because both ACPI memory hotplug (reported as
BIOS bug) and driver managed memory hotplug are both prevented when TDX
is enabled by BIOS, so no new non-TDX-convertible memory can end up to
the page allocator.

Select ARCH_KEEP_MEMBLOCK when CONFIG_INTEL_TDX_HOST to keep memblock
after boot so it can be used during the TDX module initialization.

Also, explicitly exclude memory regions below first 1MB as TDX memory
because those regions may not be reported as convertible memory.  This
is OK as the first 1MB is always reserved during kernel boot and won't
end up to the page allocator.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v4 (no feedback on v4):
 - Changed to use memblock from e820.
 - Simplified changelog a lot.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 100 ++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index efa830853e98..4988a91d5283 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1974,6 +1974,7 @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	select ARCH_HAS_CC_PLATFORM
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1bc97756bc0d..2b20d4a7a62b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,8 @@
 #include <linux/cpumask.h>
 #include <linux/smp.h>
 #include <linux/atomic.h>
+#include <linux/sizes.h>
+#include <linux/memblock.h>
 #include <asm/cpufeatures.h>
 #include <asm/cpufeature.h>
 #include <asm/msr-index.h>
@@ -338,6 +340,91 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *tdsysinfo,
 	return check_cmrs(cmr_array, actual_cmr_num);
 }
 
+/*
+ * Skip the memory region below 1MB.  Return true if the entire
+ * region is skipped.  Otherwise, the updated range is returned.
+ */
+static bool pfn_range_skip_lowmem(unsigned long *p_start_pfn,
+				  unsigned long *p_end_pfn)
+{
+	u64 start, end;
+
+	start = *p_start_pfn << PAGE_SHIFT;
+	end = *p_end_pfn << PAGE_SHIFT;
+
+	if (start < SZ_1M)
+		start = SZ_1M;
+
+	if (start >= end)
+		return true;
+
+	*p_start_pfn = (start >> PAGE_SHIFT);
+
+	return false;
+}
+
+/*
+ * Walks over all memblock memory regions that are intended to be
+ * converted to TDX memory.  Essentially, it is all memblock memory
+ * regions excluding the low memory below 1MB.
+ *
+ * This is because on some TDX platforms the low memory below 1MB is
+ * not included in CMRs.  Excluding the low 1MB can still guarantee
+ * that the pages managed by the page allocator are always TDX memory,
+ * as the low 1MB is reserved during kernel boot and won't end up to
+ * the ZONE_DMA (see reserve_real_mode()).
+ */
+#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
+	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
+		if (!pfn_range_skip_lowmem(p_start, p_end))
+
+/* Check whether first range is the subrange of the second */
+static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
+{
+	return r1_start >= r2_start && r1_end <= r2_end;
+}
+
+/* Check whether address range is covered by any CMR or not. */
+static bool range_covered_by_cmr(struct cmr_info *cmr_array, int cmr_num,
+				 u64 start, u64 end)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		if (is_subrange(start, end, cmr->base, cmr->base + cmr->size))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Check whether all memory regions in memblock are TDX convertible
+ * memory.  Return 0 if all memory regions are convertible, or error.
+ */
+static int check_memblock_tdx_convertible(void)
+{
+	unsigned long start_pfn, end_pfn;
+	int i;
+
+	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, NULL) {
+		u64 start, end;
+
+		start = start_pfn << PAGE_SHIFT;
+		end = end_pfn << PAGE_SHIFT;
+		if (!range_covered_by_cmr(tdx_cmr_array, tdx_cmr_num, start,
+					end)) {
+			pr_err("[0x%llx, 0x%llx) is not fully convertible memory\n",
+					start, end);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -371,6 +458,19 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * To avoid having to modify the page allocator to distinguish
+	 * TDX and non-TDX memory allocation, convert all memory regions
+	 * in memblock to TDX memory to make sure all pages managed by
+	 * the page allocator are TDX memory.
+	 *
+	 * Sanity check all memory regions are fully covered by CMRs to
+	 * make sure they are truly convertible.
+	 */
+	ret = check_memblock_tdx_convertible();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (11 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions Kai Huang
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, the TDX introduced the concept of a "Convertible Memory
Region" (CMR).  During boot, the firmware builds a list of all of the
memory ranges which can provide the TDX security guarantees.  The list
of these ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be put into.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory by passing an array of TDMRs to the TDX module.

Constructing the array of TDMRs consists below steps:

1) Create TDMRs to cover all memory regions that TDX module can use;
2) Allocate and set up PAMT for each TDMR;
3) Set up reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps after all
memblock memory regions are verified to be convertible.  Always free
TDMRs at the end of the initialization (no matter successful or not)
as TDMRs are only used during the initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):
 - Moved calculating TDMR size to this patch.
 - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
   once, instead of allocating each TDMR individually.
 - Removed "crypto protection" in the changelog.
 - -EFAULT -> -EINVAL in couple of places.

---
 arch/x86/virt/vmx/tdx/tdx.c | 73 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2b20d4a7a62b..645addb1bea2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -17,6 +17,8 @@
 #include <linux/atomic.h>
 #include <linux/sizes.h>
 #include <linux/memblock.h>
+#include <linux/gfp.h>
+#include <linux/align.h>
 #include <asm/cpufeatures.h>
 #include <asm/cpufeature.h>
 #include <asm/msr-index.h>
@@ -425,6 +427,55 @@ static int check_memblock_tdx_convertible(void)
 	return 0;
 }
 
+/* Calculate the actual TDMR_INFO size */
+static inline int cal_tdmr_size(void)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR_INFO depends on the maximum number
+	 * of reserved areas.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) *
+		   tdx_sysinfo.max_reserved_per_tdmr;
+
+	/*
+	 * TDX requires each TDMR_INFO to be 512-byte aligned.  Always
+	 * round up TDMR_INFO size to the 512-byte boundary.
+	 */
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static struct tdmr_info *alloc_tdmr_array(int *array_sz)
+{
+	/*
+	 * TDX requires each TDMR_INFO to be 512-byte aligned.
+	 * Use alloc_pages_exact() to allocate all TDMRs at once.
+	 * Each TDMR_INFO will still be 512-byte aligned since
+	 * cal_tdmr_size() always return 512-byte aligned size.
+	 */
+	*array_sz = cal_tdmr_size() * tdx_sysinfo.max_tdmrs;
+
+	/*
+	 * Zero the buffer so 'struct tdmr_info::size' can be
+	 * used to determine whether a TDMR is valid.
+	 */
+	return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
+}
+
+/*
+ * Construct an array of TDMRs to cover all memory regions in memblock.
+ * This makes sure all pages managed by the page allocator are TDX
+ * memory.  The actual number of TDMRs is kept to @tdmr_num.
+ */
+static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
+				     int *tdmr_num)
+{
+	/* Return -EINVAL until constructing TDMRs is done */
+	return -EINVAL;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -434,6 +485,9 @@ static int check_memblock_tdx_convertible(void)
  */
 static int init_tdx_module(void)
 {
+	struct tdmr_info *tdmr_array;
+	int tdmr_array_sz;
+	int tdmr_num;
 	int ret;
 
 	/*
@@ -471,11 +525,30 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Prepare enough space to construct TDMRs */
+	tdmr_array = alloc_tdmr_array(&tdmr_array_sz);
+	if (!tdmr_array) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Construct TDMRs to cover all memory regions in memblock */
+	ret = construct_tdmrs_memeblock(tdmr_array, &tdmr_num);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * The array of TDMRs is freed no matter the initialization is
+	 * successful or not.  They are not needed anymore after the
+	 * module initialization.
+	 */
+	free_pages_exact(tdmr_array, tdmr_array_sz);
 out:
 	return ret;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 63b1edd11660..55d6c69ab900 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -114,6 +114,29 @@ struct tdsysinfo_struct {
 	};
 } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	struct tdmr_reserved_area reserved_areas[0];
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below this
  * comment!
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (12 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The kernel configures TDX-usable memory regions by passing an array of
"TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains the
information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Create a number of TDMRs according to the memblock memory regions.  To
keep it simple, always try to create one TDMR for each memory region.
As the first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just create a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):
 - Removed allocating TDMR individually.
 - Improved changelog by using Dave's words.
 - Made TDMR_START() and TDMR_END() as static inline function.

---
 arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 645addb1bea2..fd9f449b5395 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -427,6 +427,24 @@ static int check_memblock_tdx_convertible(void)
 	return 0;
 }
 
+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_start(struct tdmr_info *tdmr)
+{
+	return tdmr->base;
+}
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
 /* Calculate the actual TDMR_INFO size */
 static inline int cal_tdmr_size(void)
 {
@@ -464,6 +482,82 @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
 	return alloc_pages_exact(*array_sz, GFP_KERNEL | __GFP_ZERO);
 }
 
+static struct tdmr_info *tdmr_array_entry(struct tdmr_info *tdmr_array,
+					  int idx)
+{
+	return (struct tdmr_info *)((unsigned long)tdmr_array +
+			cal_tdmr_size() * idx);
+}
+
+/*
+ * Create TDMRs to cover all memory regions in memblock.  The actual
+ * number of TDMRs is set to @tdmr_num.
+ */
+static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, nid, tdmr_idx = 0;
+
+	/*
+	 * Loop over all memory regions in memblock and create TDMRs to
+	 * cover them.  To keep it simple, always try to use one TDMR to
+	 * cover memory region.
+	 */
+	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, &nid) {
+		struct tdmr_info *tdmr;
+		u64 start, end;
+
+		tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+		start = TDMR_ALIGN_DOWN(start_pfn << PAGE_SHIFT);
+		end = TDMR_ALIGN_UP(end_pfn << PAGE_SHIFT);
+
+		/*
+		 * If the current TDMR's size hasn't been initialized,
+		 * it is a new TDMR to cover the new memory region.
+		 * Otherwise, the current TDMR has already covered the
+		 * previous memory region.  In the latter case, check
+		 * whether the current memory region has been fully or
+		 * partially covered by the current TDMR, since TDMR is
+		 * 1G aligned.
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next memory region if the current
+			 * region has already fully covered by the
+			 * current TDMR.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/*
+			 * If part of the current memory region has
+			 * already been covered by the current TDMR,
+			 * skip the already covered part.
+			 */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdx_sysinfo.max_tdmrs)
+				return -E2BIG;
+
+			tdmr = tdmr_array_entry(tdmr_array, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	*tdmr_num = tdmr_idx + 1;
+
+	return 0;
+}
+
 /*
  * Construct an array of TDMRs to cover all memory regions in memblock.
  * This makes sure all pages managed by the page allocator are TDX
@@ -472,8 +566,16 @@ static struct tdmr_info *alloc_tdmr_array(int *array_sz)
 static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
 				     int *tdmr_num)
 {
+	int ret;
+
+	ret = create_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EINVAL until constructing TDMRs is done */
-	return -EINVAL;
+	ret = -EINVAL;
+err:
+	return ret;
 }
 
 /*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (13 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-24 20:13   ` Dave Hansen
  2022-08-17 22:46   ` Sagi Shahar
  2022-06-22 11:17 ` [PATCH v5 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
                   ` (7 subsequent siblings)
  22 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TD guest early during system boot to get those
PAMTs allocated at early time, but the only way to fix is to add a boot
option to allocate or reserve PAMTs during kernel boot.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 200 ++++++++++++++++++++++++++++++++++++
 2 files changed, 201 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4988a91d5283..ec496e96d120 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	depends on KVM_INTEL
+	depends on CONTIG_ALLOC
 	select ARCH_HAS_CC_PLATFORM
 	select ARCH_KEEP_MEMBLOCK
 	help
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index fd9f449b5395..36260dd7e69f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -558,6 +558,196 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 	return 0;
 }
 
+/* Page sizes supported by TDX */
+enum tdx_page_sz {
+	TDX_PG_4K,
+	TDX_PG_2M,
+	TDX_PG_1G,
+	TDX_PG_MAX,
+};
+
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr,
+				      enum tdx_page_sz pgsz)
+{
+	unsigned long pamt_sz;
+	int pamt_entry_nr;
+
+	switch (pgsz) {
+	case TDX_PG_4K:
+		pamt_entry_nr = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PG_2M:
+		pamt_entry_nr = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PG_1G:
+		pamt_entry_nr = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = pamt_entry_nr * tdx_sysinfo.pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Pick a NUMA node on which to allocate this TDMR's metadata.
+ *
+ * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
+ * not be.  If the TDMR covers more than one node, just use the _first_
+ * one.  This can lead to small areas of off-node metadata for some
+ * memory.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, nid;
+
+	/* Find the first memory region covered by the TDMR */
+	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, &nid) {
+		if (end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
+			return nid;
+	}
+
+	/*
+	 * No memory region found for this TDMR.  It cannot happen since
+	 * when one TDMR is created, it must cover at least one (or
+	 * partial) memory region.
+	 */
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_base[TDX_PG_MAX];
+	unsigned long pamt_size[TDX_PG_MAX];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	enum tdx_page_sz pgsz;
+	struct page *pamt;
+	int nid;
+
+	nid = tdmr_get_nid(tdmr);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/* Calculate PAMT base and size for all supported page sizes. */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PG_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PG_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PG_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+			  unsigned long *pamt_npages)
+{
+	unsigned long pamt_base, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_base = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	*pamt_pfn = pamt_base >> PAGE_SHIFT;
+	*pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_npages;
+
+	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_npages)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_npages);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_num; i++) {
+		ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	return ret;
+}
+
+static unsigned long tdmrs_get_pamt_pages(struct tdmr_info *tdmr_array,
+					  int tdmr_num)
+{
+	unsigned long pamt_npages = 0;
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		unsigned long pfn, npages;
+
+		tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
+		pamt_npages += npages;
+	}
+
+	return pamt_npages;
+}
+
 /*
  * Construct an array of TDMRs to cover all memory regions in memblock.
  * This makes sure all pages managed by the page allocator are TDX
@@ -572,8 +762,13 @@ static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
 	if (ret)
 		goto err;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EINVAL until constructing TDMRs is done */
 	ret = -EINVAL;
+	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err:
 	return ret;
 }
@@ -644,6 +839,11 @@ static int init_tdx_module(void)
 	 * process are done.
 	 */
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	else
+		pr_info("%lu pages allocated for PAMT.\n",
+				tdmrs_get_pamt_pages(tdmr_array, tdmr_num));
 out_free_tdmrs:
 	/*
 	 * The array of TDMRs is freed no matter the initialization is
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (14 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 17/22] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

As the last step of constructing TDMRs, set up reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
 1 file changed, 158 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 36260dd7e69f..86d98c47bd37 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -19,6 +19,7 @@
 #include <linux/memblock.h>
 #include <linux/gfp.h>
 #include <linux/align.h>
+#include <linux/sort.h>
 #include <asm/cpufeatures.h>
 #include <asm/cpufeature.h>
 #include <asm/msr-index.h>
@@ -748,6 +749,157 @@ static unsigned long tdmrs_get_pamt_pages(struct tdmr_info *tdmr_array,
 	return pamt_npages;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+			      u64 addr, u64 size)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	/* Cannot exceed maximum reserved areas supported by TDX */
+	if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+		return -E2BIG;
+
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  Caller should guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_set_up_rsvd_areas(struct tdmr_info *tdmr,
+				  struct tdmr_info *tdmr_array,
+				  int tdmr_num)
+{
+	unsigned long start_pfn, end_pfn;
+	int rsvd_idx, i, ret = 0;
+	u64 prev_end;
+
+	/* Mark holes between memory regions as reserved */
+	rsvd_idx = 0;
+	prev_end = tdmr_start(tdmr);
+	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, NULL) {
+		u64 start, end;
+
+		start = start_pfn << PAGE_SHIFT;
+		end = end_pfn << PAGE_SHIFT;
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr_start(tdmr))
+			continue;
+
+		/*
+		 * Skip if no hole exists before this region. "<=" is
+		 * used because one memory region might span two TDMRs
+		 * (when the previous TDMR covers part of this region).
+		 * In this case the start address of this region is
+		 * smaller than the start address of the second TDMR.
+		 *
+		 * Update the prev_end to the end of this region where
+		 * the possible memory hole starts.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				start - prev_end);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * If any PAMT overlaps with this TDMR, the overlapping part
+	 * must also be put to the reserved area too.  Walk over all
+	 * TDMRs to find out those overlapping PAMTs and put them to
+	 * reserved areas.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tmp = tdmr_array_entry(tdmr_array, i);
+		u64 pamt_start, pamt_end;
+
+		pamt_start = tmp->pamt_4k_base;
+		pamt_end = pamt_start + tmp->pamt_4k_size +
+			tmp->pamt_2m_size + tmp->pamt_1g_size;
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr_start(tdmr)) ||
+				(pamt_start >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < tdmr_start(tdmr))
+			pamt_start = tdmr_start(tdmr);
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, pamt_start,
+				pamt_end - pamt_start);
+		if (ret)
+			return ret;
+	}
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+static int tdmrs_set_up_rsvd_areas_all(struct tdmr_info *tdmr_array,
+				      int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = tdmr_set_up_rsvd_areas(tdmr_array_entry(tdmr_array, i),
+				tdmr_array, tdmr_num);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct an array of TDMRs to cover all memory regions in memblock.
  * This makes sure all pages managed by the page allocator are TDX
@@ -766,8 +918,12 @@ static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
 	if (ret)
 		goto err;
 
-	/* Return -EINVAL until constructing TDMRs is done */
-	ret = -EINVAL;
+	ret = tdmrs_set_up_rsvd_areas_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_pamts;
+
+	return 0;
+err_free_pamts:
 	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err:
 	return ret;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 17/22] x86/virt/tdx: Reserve TDX module global KeyID
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (15 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 18/22] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to protect the TDX module metadata.  The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID.  Keep the
global KeyID as a static variable as KVM will need to use it too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 86d98c47bd37..df87a9f9ee24 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -55,6 +55,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
 static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 static int tdx_cmr_num;
 
+/* TDX module global KeyID.  Used in TDH.SYS.CONFIG ABI. */
+static u32 tdx_global_keyid;
+
 /* Detect whether CPU supports SEAM */
 static int detect_seam(void)
 {
@@ -990,6 +993,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/*
+	 * Reserve the first TDX KeyID as global KeyID to protect
+	 * TDX module metadata.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 18/22] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (16 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 17/22] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 19/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

After the TDX-usable memory regions are constructed in an array of TDMRs
and the global KeyID is reserved, configure them to the TDX module using
TDH.SYS.CONFIG SEAMCALL.  TDH.SYS.CONFIG can only be called once and can
be done on any logical cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 38 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index df87a9f9ee24..06e26379b632 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -18,6 +18,7 @@
 #include <linux/sizes.h>
 #include <linux/memblock.h>
 #include <linux/gfp.h>
+#include <linux/slab.h>
 #include <linux/align.h>
 #include <linux/sort.h>
 #include <asm/cpufeatures.h>
@@ -932,6 +933,37 @@ static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
+			     u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	int i, array_sz;
+	u64 ret;
+
+	/*
+	 * TDMR_INFO entries are configured to the TDX module via an
+	 * array of the physical address of each TDMR_INFO.  TDX module
+	 * requires the array itself to be 512-byte aligned.  Round up
+	 * the array size to 512-byte aligned so the buffer allocated
+	 * by kzalloc() will meet the alignment requirement.
+	 */
+	array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_pa_array[i] = __pa(tdmr_array_entry(tdmr_array, i));
+
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
+				global_keyid, 0, NULL);
+
+	/* Free the array as it is not required any more. */
+	kfree(tdmr_pa_array);
+
+	return ret ? -EFAULT : 0;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -999,11 +1031,17 @@ static int init_tdx_module(void)
 	 */
 	tdx_global_keyid = tdx_keyid_start;
 
+	/* Pass the TDMRs and the global KeyID to the TDX module */
+	ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
+out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
 	else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 55d6c69ab900..b9bc499b965b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -53,6 +53,7 @@
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
+#define TDH_SYS_CONFIG		45
 
 struct cmr_info {
 	u64	base;
@@ -120,6 +121,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 19/22] x86/virt/tdx: Configure global KeyID on all packages
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (17 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 18/22] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 20/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

After the array of TDMRs and the global KeyID are configured to the TDX
module, use TDH.SYS.KEY.CONFIG to configure the key of the global KeyID
on all packages.

TDH.SYS.KEY.CONFIG must be done on one (any) cpu for each package.  And
it cannot run concurrently on different CPUs.  Implement a helper to
run SEAMCALL on one cpu for each package one by one, and use it to
configure the global KeyID on all packages.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
with KeyID 0) before the TDX module uses the global KeyID to access the
PAMT.  Following the TDX module specification, flush cache before
configuring the global KeyID on all packages.

Given the PAMT size can be large (~1/256th of system RAM), just use
WBINVD on all CPUs to flush.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 83 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 06e26379b632..b9777a353835 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -194,6 +194,46 @@ static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	on_each_cpu(seamcall_smp_call_function, sc, true);
 }
 
+/*
+ * Call one SEAMCALL on one (any) cpu for each physical package in
+ * serialized way.  Return immediately in case of any error if
+ * SEAMCALL fails on any cpu.
+ *
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
+ * to be atomic, but for simplicity just reuse it instead of adding
+ * a new one.
+ */
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
+{
+	cpumask_var_t packages;
+	int cpu, ret = 0;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		ret = smp_call_function_single(cpu, seamcall_smp_call_function,
+				sc, true);
+		if (ret)
+			break;
+
+		/*
+		 * Doesn't have to use atomic_read(), but it doesn't
+		 * hurt either.
+		 */
+		ret = atomic_read(&sc->err);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 /*
  * Do TDX module global initialization.  It also detects whether the
  * module has been loaded or not.
@@ -964,6 +1004,21 @@ static int config_tdx_module(struct tdmr_info *tdmr_array, int tdmr_num,
 	return ret ? -EFAULT : 0;
 }
 
+static int config_global_keyid(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
+
+	/*
+	 * Configure the key of the global KeyID on all packages by
+	 * calling TDH.SYS.KEY.CONFIG on all packages.
+	 *
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
+	 * a recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 */
+	return seamcall_on_each_package_serialized(&sc);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -1036,15 +1091,39 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 *
+	 * Follow the TDX spec to flush cache before configuring the
+	 * global KeyID on all packages.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
 out_free_pamts:
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMT may already have been initialized by
+		 * TDX module.  Flush cache before returning PAMT back
+		 * to the kernel.
+		 */
+		wbinvd_on_all_cpus();
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
-	else
+	} else
 		pr_info("%lu pages allocated for PAMT.\n",
 				tdmrs_get_pamt_pages(tdmr_array, tdmr_num));
 out_free_tdmrs:
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b9bc499b965b..2d25a93b89ef 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -49,6 +49,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 20/22] x86/virt/tdx: Initialize all TDMRs
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (18 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 19/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 21/22] x86/virt/tdx: Support kexec() Kai Huang
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
the memory pages can be used by the TDX module.  The time to initialize
TDMR is proportional to the size of the TDMR because TDH.SYS.TDMR.INIT
internally initializes the PAMT entries using the global KeyID.

To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
initializes an (implementation-specific) subset of PAMT entries of one
TDMR in one invocation.  The caller needs to call TDH.SYS.TDMR.INIT
iteratively until all PAMT entries of the given TDMR are initialized.

TDH.SYS.TDMR.INITs can run concurrently on multiple CPUs as long as they
are initializing different TDMRs.  To keep it simple, just initialize
all TDMRs one by one.  On a 2-socket machine with 2.2G CPUs and 64GB
memory, each TDH.SYS.TDMR.INIT roughly takes ~7us on average, and it
takes roughly ~100ms to complete initializing all TDMRs while system is
idle.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 70 ++++++++++++++++++++++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b9777a353835..da1af1b60c35 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1019,6 +1019,65 @@ static int config_global_keyid(void)
 	return seamcall_on_each_package_serialized(&sc);
 }
 
+/* Initialize one TDMR */
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing PAMT entries might be time-consuming (in
+	 * proportion to the size of the requested TDMR).  To avoid long
+	 * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
+	 * an (implementation-defined) subset of PAMT entries in one
+	 * invocation.
+	 *
+	 * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
+	 * of the requested TDMR are initialized (if next-to-initialize
+	 * address matches the end address of the TDMR).
+	 */
+	do {
+		struct tdx_module_output out;
+		u64 ret;
+
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, &out);
+		if (ret)
+			return -EFAULT;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INT succeeded.
+		 */
+		next = out.rdx;
+		/* Allow scheduling when needed */
+		if (need_resched())
+			cond_resched();
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+/* Initialize all TDMRs */
+static int init_tdmrs(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i;
+
+	/*
+	 * Initialize TDMRs one-by-one for simplicity, though the TDX
+	 * architecture does allow different TDMRs to be initialized in
+	 * parallel on multiple CPUs.  Parallel initialization could
+	 * be added later when the time spent in the serialized scheme
+	 * becomes a real concern.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_array_entry(tdmr_array, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -1109,11 +1168,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * Return -EINVAL until all steps of TDX module initialization
-	 * process are done.
-	 */
-	ret = -EINVAL;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto out_free_pamts;
+
+	tdx_module_status = TDX_MODULE_INITIALIZED;
 out_free_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2d25a93b89ef..e0309558be13 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -53,6 +53,7 @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_LP_SHUTDOWN	44
 #define TDH_SYS_CONFIG		45
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 21/22] x86/virt/tdx: Support kexec()
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (19 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 20/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-06-22 11:17 ` [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
  2022-06-24 19:47 ` [PATCH v5 00/22] TDX host kernel support Dave Hansen
  22 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

To support kexec(), if the TDX module is ever initialized, the kernel
needs to flush all dirty cachelines associated with any TDX private
KeyID, otherwise they may slightly corrupt the new kernel.

Following SME support, use wbinvd() to flush cache in stop_this_cpu().
Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by BIOS instead to flush cache.

The current TDX module architecture doesn't play nicely with kexec().
The TDX module can only be initialized once during its lifetime, and
there is no SEAMCALL to reset the module to give a new clean slate to
the new kernel.  Therefore, ideally, if the module is ever initialized,
it's better to shut down the module.  The new kernel won't be able to
use TDX anyway (as it needs to go through the TDX module initialization
process which will fail immediately at the first step).

However, there's no guarantee CPU is in VMX operation during kexec().
This means it's impractical to shut down the module.  Just do nothing
but leave the module open.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kernel/process.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dbaf12c43fe1..ff5449c23522 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -769,8 +769,15 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * Similar to SME, if the TDX module is ever initialized, the
+	 * cachelines associated with any TDX private KeyID must be
+	 * flushed before transiting to the new kernel.  The TDX module
+	 * is initialized on demand, and it takes the mutex to read it's
+	 * status.  Just check whether TDX is enabled by BIOS instead to
+	 * flush cache.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (20 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 21/22] x86/virt/tdx: Support kexec() Kai Huang
@ 2022-06-22 11:17 ` Kai Huang
  2022-08-18  4:07   ` Bagas Sanjaya
  2022-06-24 19:47 ` [PATCH v5 00/22] TDX host kernel support Dave Hansen
  22 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-22 11:17 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 190 +++++++++++++++++++++++++++++++++++---
 1 file changed, 179 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index b8fa4329e1a5..6c6b09ca6ba4 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,174 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+To enable TDX, BIOS configures SEAMRR and TDX private KeyIDs consistently
+across all CPU packages.  TDX doesn't trust BIOS.  The MCHECK verifies
+all configurations from BIOS are correct and enables SEAMRR.
+
+After TDX is enabled in BIOS, the TDX module needs to be loaded into the
+SEAMRR range and properly initialized, before it can be used to create
+and run protected VMs.
+
+The TDX architecture doesn't require BIOS to load the TDX module, but
+current kernel assumes it is loaded by BIOS (i.e. either directly or by
+some UEFI shell tool) before booting to the kernel.  Current kernel
+detects TDX and initializes the TDX module.
+
+TDX boot-time detection
+-----------------------
+
+Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
+can see below dmesg if TDX is enabled by BIOS:
+
+|  [..] tdx: SEAMRR enabled.
+|  [..] tdx: TDX private KeyID range: [16, 64).
+|  [..] tdx: TDX enabled by BIOS.
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect whether the TDX module.  The kernel
+detects the TDX module by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  Current kernel doesn't choose to always initialize the
+TDX module during kernel boot, but provides a function tdx_init() to
+allow the caller to initialize TDX when it truly wants to use TDX:
+
+        ret = tdx_init();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online and
+are in VMX operation (requirement of making SEAMCALL) during tdx_init().
+Currently, KVM is the only user of TDX.  KVM always guarantees all online
+CPUs are in VMX operation when there's any VM.  Current kernel doesn't
+handle entering VMX operation in tdx_init() but leaves this to the
+caller.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below:
+
+|  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below:
+
+|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+|  [..] tdx: 65667 pages allocated for PAMT.
+|  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below:
+
+|  [..] tdx: Failed to initialize TDX module.  Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't work with ACPI CPU hotplug.  To guarantee the security MCHECK
+verifies all logical CPUs for all packages during platform boot.  Any
+hot-added CPU is not verified thus cannot support TDX.  A non-buggy BIOS
+should never deliver ACPI CPU hot-add event to the kernel.  Such event is
+reported as BIOS bug and the hot-added CPU is rejected.
+
+TDX requires all boot-time verified logical CPUs being present until
+machine reset.  If kernel receives ACPI CPU hot-removal event, assume the
+kernel cannot continue to work normally so just BUG().
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Memory Hotplug
+~~~~~~~~~~~~~~
+
+The TDX module reports a list of "Convertible Memory Region" (CMR) to
+indicate which memory regions are TDX-capable.  Those regions are
+generated by BIOS and verified by the MCHECK so that they are truly
+present during platform boot and can meet security guarantee.
+
+This means TDX doesn't work with ACPI memory hot-add.  A non-buggy BIOS
+should never deliver ACPI memory hot-add event to the kernel.  Such event
+is reported as BIOS bug and the hot-added memory is rejected.
+
+TDX also doesn't work with ACPI memory hot-removal.  If kernel receives
+ACPI memory hot-removal event, assume the kernel cannot continue to work
+normally so just BUG().
+
+Also, the kernel needs to choose which TDX-capable regions to use as TDX
+memory and pass those regions to the TDX module when it gets initialized.
+Once they are passed to the TDX module, the TDX-usable memory regions are
+fixed during module's lifetime.
+
+To avoid having to modify the page allocator to distinguish TDX and
+non-TDX memory allocation, current kernel guarantees all pages managed by
+the page allocator are TDX memory.  This means any hot-added memory to
+the page allocator will break such guarantee thus should be prevented.
+
+There are basically two memory hot-add cases that need to be prevented:
+ACPI memory hot-add and driver managed memory hot-add.  The kernel
+rejectes the driver managed memory hot-add too when TDX is enabled by
+BIOS.  For instance, dmesg shows below error when using kmem driver to
+add a legacy PMEM as system RAM:
+
+|  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
+|  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
+
+However, adding new memory to ZONE_DEVICE should not be prevented as
+those pages are not managed by the page allocator.  Therefore,
+memremap_pages() variants are still allowed although they internally
+also uses memory hotplug functions.
+
+Kexec()
+~~~~~~~
+
+TDX (and MKTME) doesn't guarantee cache coherency among different KeyIDs.
+If the TDX module is ever initialized, the kernel needs to flush dirty
+cachelines associated with any TDX private KeyID, otherwise they may
+slightly corrupt the new kernel.
+
+Similar to SME support, the kernel uses wbinvd() to flush cache in
+stop_this_cpu().
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel.  Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module.  The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module.  Current kernel just leaves the
+module in open state.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +188,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +198,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +209,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +220,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +241,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +261,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +275,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +295,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +313,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +335,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +357,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
@ 2022-06-22 11:42   ` Rafael J. Wysocki
  2022-06-23  0:01     ` Kai Huang
  2022-06-24 18:57   ` Dave Hansen
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 114+ messages in thread
From: Rafael J. Wysocki @ 2022-06-22 11:42 UTC (permalink / raw)
  To: Kai Huang
  Cc: Linux Kernel Mailing List, kvm-devel, ACPI Devel Maling List,
	Sean Christopherson, Paolo Bonzini, Dave Hansen, Len Brown,
	Tony Luck, Rafael Wysocki, Reinette Chatre, Dan Williams,
	Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, isaku.yamahata, Tom Lendacky,
	Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld, Juri Lelli,
	Mark Rutland, Frederic Weisbecker, Yue Haibing, dongli.zhang

On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
>
> Platforms with confidential computing technology may not support ACPI
> CPU hotplug when such technology is enabled by the BIOS.  Examples
> include Intel platforms which support Intel Trust Domain Extensions
> (TDX).
>
> If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> bug and reject the new CPU.  For hot-removal, for simplicity just assume
> the kernel cannot continue to work normally, and BUG().
>
> Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> platform doesn't support ACPI CPU hotplug, so that kernel can handle
> ACPI CPU hotplug events for such platform.  The existing attribute
> CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
>
> In acpi_processor_{add|remove}(), add early check against this attribute
> and handle accordingly if it is set.
>
> Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/coco/core.c          |  2 +-
>  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
>  include/linux/cc_platform.h   | 15 +++++++++++++--
>  kernel/cpu.c                  |  2 +-
>  4 files changed, 38 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index 4320fadae716..1bde1af75296 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
>  {
>         switch (attr) {
>         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> -       case CC_ATTR_HOTPLUG_DISABLED:
> +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
>         case CC_ATTR_GUEST_MEM_ENCRYPT:
>         case CC_ATTR_MEM_ENCRYPT:
>                 return true;
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6737b1cbf6d6..b960db864cd4 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -15,6 +15,7 @@
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/cc_platform.h>
>
>  #include <acpi/processor.h>
>
> @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
>         struct device *dev;
>         int result = 0;
>
> +       /*
> +        * If the confidential computing platform doesn't support ACPI
> +        * memory hotplug, the BIOS should never deliver such event to
> +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> +        * the new CPU.
> +        */
> +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {

This will affect initialization, not just hotplug AFAICS.

You should reset the .hotplug.enabled flag in processor_handler to
false instead.

> +               dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI CPU hotplug.  New CPU ignored.\n");
> +               return -EINVAL;
> +       }
> +
>         pr = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL);
>         if (!pr)
>                 return -ENOMEM;
> @@ -434,6 +446,17 @@ static void acpi_processor_remove(struct acpi_device *device)
>         if (!device || !acpi_driver_data(device))
>                 return;
>
> +       /*
> +        * The confidential computing platform is broken if ACPI memory
> +        * hot-removal isn't supported but it happened anyway.  Assume
> +        * it's not guaranteed that the kernel can continue to work
> +        * normally.  Just BUG().
> +        */
> +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> +               dev_err(&device->dev, "Platform doesn't support ACPI CPU hotplug. BUG().\n");
> +               BUG();
> +       }
> +
>         pr = acpi_driver_data(device);
>         if (pr->id >= nr_cpu_ids)
>                 goto out;
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index 691494bbaf5a..9ce9256facc8 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -74,14 +74,25 @@ enum cc_attr {
>         CC_ATTR_GUEST_UNROLL_STRING_IO,
>
>         /**
> -        * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
> +        * @CC_ATTR_CPU_HOTPLUG_DISABLED: CPU hotplug is not supported or
> +        *                                disabled.
>          *
>          * The platform/OS is running as a guest/virtual machine does not
>          * support CPU hotplug feature.
>          *
>          * Examples include TDX Guest.
>          */
> -       CC_ATTR_HOTPLUG_DISABLED,
> +       CC_ATTR_CPU_HOTPLUG_DISABLED,
> +
> +       /**
> +        * @CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED: ACPI CPU hotplug is not
> +        *                                     supported.
> +        *
> +        * The platform/OS does not support ACPI CPU hotplug.
> +        *
> +        * Examples include TDX platform.
> +        */
> +       CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED,
>  };
>
>  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index edb8c199f6a3..966772cce063 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1191,7 +1191,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
>          * If the platform does not support hotplug, report it explicitly to
>          * differentiate it from a transient offlining failure.
>          */
> -       if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
> +       if (cc_platform_has(CC_ATTR_CPU_HOTPLUG_DISABLED))
>                 return -EOPNOTSUPP;
>         if (cpu_hotplug_disabled)
>                 return -EBUSY;
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-22 11:15 ` [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
@ 2022-06-22 11:45   ` Rafael J. Wysocki
  2022-06-23  0:08     ` Kai Huang
  2022-06-28 12:01     ` Igor Mammedov
  0 siblings, 2 replies; 114+ messages in thread
From: Rafael J. Wysocki @ 2022-06-22 11:45 UTC (permalink / raw)
  To: Kai Huang
  Cc: Linux Kernel Mailing List, kvm-devel, ACPI Devel Maling List,
	Sean Christopherson, Paolo Bonzini, Dave Hansen, Len Brown,
	Tony Luck, Rafael Wysocki, Reinette Chatre, Dan Williams,
	Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, isaku.yamahata, Tom Lendacky

On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
>
> Platforms with confidential computing technology may not support ACPI
> memory hotplug when such technology is enabled by the BIOS.  Examples
> include Intel platforms which support Intel Trust Domain Extensions
> (TDX).
>
> If the kernel ever receives ACPI memory hotplug event, it is likely a
> BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> just assume the kernel cannot continue to work normally, and just BUG().
>
> Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> platform doesn't support ACPI memory hotplug, so that kernel can handle
> ACPI memory hotplug events for such platform.
>
> In acpi_memory_device_{add|remove}(), add early check against this
> attribute and handle accordingly if it is set.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
>  include/linux/cc_platform.h    | 10 ++++++++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24f662d8bd39..94d6354ea453 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -15,6 +15,7 @@
>  #include <linux/acpi.h>
>  #include <linux/memory.h>
>  #include <linux/memory_hotplug.h>
> +#include <linux/cc_platform.h>
>
>  #include "internal.h"
>
> @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
>         if (!device)
>                 return -EINVAL;
>
> +       /*
> +        * If the confidential computing platform doesn't support ACPI
> +        * memory hotplug, the BIOS should never deliver such event to
> +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> +        * the memory device.
> +        */
> +       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED)) {

Same comment as for the acpi_processor driver: this will affect the
initialization too and it would be cleaner to reset the
.hotplug.enabled flag of the scan handler.

> +               dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI memory hotplug. New memory device ignored.\n");
> +               return -EINVAL;
> +       }
> +
>         mem_device = kzalloc(sizeof(struct acpi_memory_device), GFP_KERNEL);
>         if (!mem_device)
>                 return -ENOMEM;
> @@ -334,6 +346,17 @@ static void acpi_memory_device_remove(struct acpi_device *device)
>         if (!device || !acpi_driver_data(device))
>                 return;
>
> +       /*
> +        * The confidential computing platform is broken if ACPI memory
> +        * hot-removal isn't supported but it happened anyway.  Assume
> +        * it is not guaranteed that the kernel can continue to work
> +        * normally.  Just BUG().
> +        */
> +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> +               dev_err(&device->dev, "Platform doesn't support ACPI memory hotplug. BUG().\n");
> +               BUG();
> +       }
> +
>         mem_device = acpi_driver_data(device);
>         acpi_memory_remove_memory(mem_device);
>         acpi_memory_device_free(mem_device);
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index 9ce9256facc8..b831c24bd7f6 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -93,6 +93,16 @@ enum cc_attr {
>          * Examples include TDX platform.
>          */
>         CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED,
> +
> +       /**
> +        * @CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED: ACPI memory hotplug is
> +        *                                        not supported.
> +        *
> +        * The platform/os does not support ACPI memory hotplug.
> +        *
> +        * Examples include TDX platform.
> +        */
> +       CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED,
>  };
>
>  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:42   ` Rafael J. Wysocki
@ 2022-06-23  0:01     ` Kai Huang
  2022-06-27  8:01       ` Igor Mammedov
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-23  0:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, kvm-devel, ACPI Devel Maling List,
	Sean Christopherson, Paolo Bonzini, Dave Hansen, Len Brown,
	Tony Luck, Rafael Wysocki, Reinette Chatre, Dan Williams,
	Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, isaku.yamahata, Tom Lendacky,
	Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld, Juri Lelli,
	Mark Rutland, Frederic Weisbecker, Yue Haibing, dongli.zhang

On Wed, 2022-06-22 at 13:42 +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > Platforms with confidential computing technology may not support ACPI
> > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > include Intel platforms which support Intel Trust Domain Extensions
> > (TDX).
> > 
> > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > the kernel cannot continue to work normally, and BUG().
> > 
> > Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> > platform doesn't support ACPI CPU hotplug, so that kernel can handle
> > ACPI CPU hotplug events for such platform.  The existing attribute
> > CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
> > 
> > In acpi_processor_{add|remove}(), add early check against this attribute
> > and handle accordingly if it is set.
> > 
> > Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> > CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/coco/core.c          |  2 +-
> >  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
> >  include/linux/cc_platform.h   | 15 +++++++++++++--
> >  kernel/cpu.c                  |  2 +-
> >  4 files changed, 38 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > index 4320fadae716..1bde1af75296 100644
> > --- a/arch/x86/coco/core.c
> > +++ b/arch/x86/coco/core.c
> > @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> >  {
> >         switch (attr) {
> >         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > -       case CC_ATTR_HOTPLUG_DISABLED:
> > +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
> >         case CC_ATTR_GUEST_MEM_ENCRYPT:
> >         case CC_ATTR_MEM_ENCRYPT:
> >                 return true;
> > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> > index 6737b1cbf6d6..b960db864cd4 100644
> > --- a/drivers/acpi/acpi_processor.c
> > +++ b/drivers/acpi/acpi_processor.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/kernel.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > +#include <linux/cc_platform.h>
> > 
> >  #include <acpi/processor.h>
> > 
> > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> >         struct device *dev;
> >         int result = 0;
> > 
> > +       /*
> > +        * If the confidential computing platform doesn't support ACPI
> > +        * memory hotplug, the BIOS should never deliver such event to
> > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > +        * the new CPU.
> > +        */
> > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> 
> This will affect initialization, not just hotplug AFAICS.
> 
> You should reset the .hotplug.enabled flag in processor_handler to
> false instead.

Hi Rafael,

Thanks for the review.  By "affect initialization" did you mean this
acpi_processor_add() is also called during kernel boot when any logical cpu is
brought up?  Or do you mean ACPI CPU hotplug can also happen during kernel boot
(after acpi_processor_init())?

I see acpi_processor_init() calls acpi_processor_check_duplicates() which calls
acpi_evaluate_object() but I don't know details of ACPI so I don't know whether
this would trigger acpi_processor_add().

One thing is TDX doesn't support ACPI CPU hotplug is an architectural thing, so
it is illegal even if it happens during kernel boot.  Dave's idea is the kernel
should  speak out loudly if physical CPU hotplug indeed happened on (BIOS) TDX-
enabled platforms.  Otherwise perhaps we can just give up initializing the ACPI
CPU hotplug in acpi_processor_init(), something like below?

--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -707,6 +707,10 @@ bool acpi_duplicate_processor_id(int proc_id)
 void __init acpi_processor_init(void)
 {
        acpi_processor_check_duplicates();
+
+       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED))
+               return;
+
        acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
        acpi_scan_add_handler(&processor_container_handler);
 }


> 
> > +               dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI CPU hotplug.  New CPU ignored.\n");
> > +               return -EINVAL;
> > +       }
> > +

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-22 11:45   ` Rafael J. Wysocki
@ 2022-06-23  0:08     ` Kai Huang
  2022-06-28 17:55       ` Rafael J. Wysocki
  2022-06-28 12:01     ` Igor Mammedov
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-23  0:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, kvm-devel, ACPI Devel Maling List,
	Sean Christopherson, Paolo Bonzini, Dave Hansen, Len Brown,
	Tony Luck, Rafael Wysocki, Reinette Chatre, Dan Williams,
	Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, isaku.yamahata, Tom Lendacky

On Wed, 2022-06-22 at 13:45 +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > Platforms with confidential computing technology may not support ACPI
> > memory hotplug when such technology is enabled by the BIOS.  Examples
> > include Intel platforms which support Intel Trust Domain Extensions
> > (TDX).
> > 
> > If the kernel ever receives ACPI memory hotplug event, it is likely a
> > BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> > a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> > just assume the kernel cannot continue to work normally, and just BUG().
> > 
> > Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> > platform doesn't support ACPI memory hotplug, so that kernel can handle
> > ACPI memory hotplug events for such platform.
> > 
> > In acpi_memory_device_{add|remove}(), add early check against this
> > attribute and handle accordingly if it is set.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
> >  include/linux/cc_platform.h    | 10 ++++++++++
> >  2 files changed, 33 insertions(+)
> > 
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 24f662d8bd39..94d6354ea453 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/acpi.h>
> >  #include <linux/memory.h>
> >  #include <linux/memory_hotplug.h>
> > +#include <linux/cc_platform.h>
> > 
> >  #include "internal.h"
> > 
> > @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
> >         if (!device)
> >                 return -EINVAL;
> > 
> > +       /*
> > +        * If the confidential computing platform doesn't support ACPI
> > +        * memory hotplug, the BIOS should never deliver such event to
> > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > +        * the memory device.
> > +        */
> > +       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED)) {
> 
> Same comment as for the acpi_processor driver: this will affect the
> initialization too and it would be cleaner to reset the
> .hotplug.enabled flag of the scan handler.
> 
> 

Hi Rafael,

Thanks for review.  The same to the ACPI CPU hotplug handling, this is illegal
also during kernel boot.  If we just want to disable, then perhaps something
like below?

--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -366,6 +366,9 @@ static bool __initdata acpi_no_memhotplug;
 
 void __init acpi_memory_hotplug_init(void)
 {
+       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED))
+               acpi_no_memhotplug = true;
+
        if (acpi_no_memhotplug) {
                memory_device_handler.attach = NULL;
                acpi_scan_add_handler(&memory_device_handler);


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot
  2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2022-06-23  5:57   ` Chao Gao
  2022-06-23  9:23     ` Kai Huang
  2022-08-02  2:01   ` [PATCH v5 1/22] " Wu, Binbin
  1 sibling, 1 reply; 114+ messages in thread
From: Chao Gao @ 2022-06-23  5:57 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata

On Wed, Jun 22, 2022 at 11:15:30PM +1200, Kai Huang wrote:
>Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>host and certain physical attacks.  TDX introduces a new CPU mode called
>Secure Arbitration Mode (SEAM) and a new isolated range pointed by the
						    ^ perhaps, range of memory

>SEAM Ranger Register (SEAMRR).  A CPU-attested software module called
>'the TDX module' runs inside the new isolated range to implement the
>functionalities to manage and run protected VMs.
>
>Pre-TDX Intel hardware has support for a memory encryption architecture
>called MKTME.  The memory encryption hardware underpinning MKTME is also
>used for Intel TDX.  TDX ends up "stealing" some of the physical address
>space from the MKTME architecture for crypto-protection to VMs.  BIOS is
>responsible for partitioning the "KeyID" space between legacy MKTME and
>TDX.  The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or
>'TDX KeyIDs' for short.
>
>To enable TDX, BIOS needs to configure SEAMRR (core-scope) and TDX
>private KeyIDs (package-scope) consistently for all packages.  TDX
>doesn't trust BIOS.  TDX ensures all BIOS configurations are correct,
>and if not, refuses to enable SEAMRR on any core.  This means detecting
>SEAMRR alone on BSP is enough to check whether TDX has been enabled by
>BIOS.
>
>To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
>TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
>to opt-in TDX host kernel support (to distinguish with TDX guest kernel
>support).  So far only KVM is the only user of TDX.  Make the new config
>option depend on KVM_INTEL.
>
>Use early_initcall() to detect whether TDX is enabled by BIOS during
>kernel boot, and add a function to report that.  Use a function instead
>of a new CPU feature bit.  This is because the TDX module needs to be
>initialized before it can be used to run any TDX guests, and the TDX
>module is initialized at runtime by the caller who wants to use TDX.
>
>Explicitly detect SEAMRR but not just only detect TDX private KeyIDs.
>Theoretically, a misconfiguration of TDX private KeyIDs can result in
>SEAMRR being disabled, but the BSP can still report the correct TDX
>KeyIDs.  Such BIOS bug can be caught when initializing the TDX module,
>but it's better to do more detection during boot to provide a more
>accurate result.
>
>Also detect the TDX KeyIDs.  This allows userspace to know how many TDX
>guests the platform can run w/o needing to wait until TDX is fully
>functional.
>
>Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Chao Gao <chao.gao@intel.com>

But some cosmetic comments below ...

>---
>+
>+static u32 tdx_keyid_start __ro_after_init;
>+static u32 tdx_keyid_num __ro_after_init;
>+
...

>+static int detect_tdx_keyids(void)
>+{
>+	u64 keyid_part;
>+
>+	rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);

how about:
	rdmsr(MSR_IA32_MKTME_KEYID_PARTITIONING, tdx_keyid_start, tdx_keyid_num);
	tdx_keyid_start++;

Then TDX_KEYID_NUM/START can be dropped.

>+
>+	tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
>+	tdx_keyid_start = TDX_KEYID_START(keyid_part);
>+
>+	pr_info("TDX private KeyID range: [%u, %u).\n",
>+			tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
>+
>+	/*
>+	 * TDX guarantees at least two TDX KeyIDs are configured by
>+	 * BIOS, otherwise SEAMRR is disabled.  Invalid TDX private
>+	 * range means kernel bug (TDX is broken).

Maybe it is better to have a comment for why TDX/kernel guarantees
there should be at least 2 TDX keyIDs.

>+
>+/*
>+ * This file contains both macros and data structures defined by the TDX
>+ * architecture and Linux defined software data structures and functions.
>+ * The two should not be mixed together for better readability.  The
>+ * architectural definitions come first.
>+ */
>+
>+/*
>+ * Intel Trusted Domain CPU Architecture Extension spec:
>+ *
>+ * IA32_MTRRCAP:
>+ *   Bit 15:	The support of SEAMRR
>+ *
>+ * IA32_SEAMRR_PHYS_MASK (core-scope):
>+ *   Bit 10:	Lock bit
>+ *   Bit 11:	Enable bit
>+ */
>+#define MTRR_CAP_SEAMRR			BIT_ULL(15)

Can you move this bit definition to arch/x86/include/asm/msr-index.h
right after MSR_MTRRcap definition there?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot
  2022-06-23  5:57   ` Chao Gao
@ 2022-06-23  9:23     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-23  9:23 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata

On Thu, 2022-06-23 at 13:57 +0800, Chao Gao wrote:
> On Wed, Jun 22, 2022 at 11:15:30PM +1200, Kai Huang wrote:
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks.  TDX introduces a new CPU mode called
> > Secure Arbitration Mode (SEAM) and a new isolated range pointed by the
> 						    ^ perhaps, range of memory

OK.  The spec indeed says "execute out of memory defined by SEAM ranger register
(SEAMRR)".

> 
> > +static int detect_tdx_keyids(void)
> > +{
> > +	u64 keyid_part;
> > +
> > +	rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);
> 
> how about:
> 	rdmsr(MSR_IA32_MKTME_KEYID_PARTITIONING, tdx_keyid_start, tdx_keyid_num);
> 	tdx_keyid_start++;
> 
> Then TDX_KEYID_NUM/START can be dropped.

OK will do.

> 
> > +
> > +	tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
> > +	tdx_keyid_start = TDX_KEYID_START(keyid_part);
> > +
> > +	pr_info("TDX private KeyID range: [%u, %u).\n",
> > +			tdx_keyid_start, tdx_keyid_start + tdx_keyid_num);
> > +
> > +	/*
> > +	 * TDX guarantees at least two TDX KeyIDs are configured by
> > +	 * BIOS, otherwise SEAMRR is disabled.  Invalid TDX private
> > +	 * range means kernel bug (TDX is broken).
> 
> Maybe it is better to have a comment for why TDX/kernel guarantees
> there should be at least 2 TDX keyIDs.

"TDX guarantees" means it is architectural behaviour.  Perhaps I can change to
"TDX architecture guarantee" to be more explicit.

This part is currently not in the public spec, but I am working with others to
add this to the public spec.

> 
> > +
> > +/*
> > + * This file contains both macros and data structures defined by the TDX
> > + * architecture and Linux defined software data structures and functions.
> > + * The two should not be mixed together for better readability.  The
> > + * architectural definitions come first.
> > + */
> > +
> > +/*
> > + * Intel Trusted Domain CPU Architecture Extension spec:
> > + *
> > + * IA32_MTRRCAP:
> > + *   Bit 15:	The support of SEAMRR
> > + *
> > + * IA32_SEAMRR_PHYS_MASK (core-scope):
> > + *   Bit 10:	Lock bit
> > + *   Bit 11:	Enable bit
> > + */
> > +#define MTRR_CAP_SEAMRR			BIT_ULL(15)
> 
> Can you move this bit definition to arch/x86/include/asm/msr-index.h
> right after MSR_MTRRcap definition there?

The comment at the beginning of this file says:

/*
 * CPU model specific register (MSR) numbers.
 *
 * Do not add new entries to this file unless the definitions are shared
 * between multiple compilation units.
 */

I am not sure whether adding a new bit of one MSR (which is already defined) is
adding a "new entry".  Perhaps it is not.  But I'd like to leave to maintainers.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-22 11:16 ` [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
@ 2022-06-24  1:41   ` Chao Gao
  2022-06-24 11:21     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Chao Gao @ 2022-06-24  1:41 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata, thomas.lendacky, Tianyu.Lan

On Wed, Jun 22, 2022 at 11:16:07PM +1200, Kai Huang wrote:
>-static bool intel_cc_platform_has(enum cc_attr attr)
>+#ifdef CONFIG_INTEL_TDX_GUEST
>+static bool intel_tdx_guest_has(enum cc_attr attr)
> {
> 	switch (attr) {
> 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
>@@ -28,6 +31,33 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> 		return false;
> 	}
> }
>+#endif
>+
>+#ifdef CONFIG_INTEL_TDX_HOST
>+static bool intel_tdx_host_has(enum cc_attr attr)
>+{
>+	switch (attr) {
>+	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
>+	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
>+		return true;
>+	default:
>+		return false;
>+	}
>+}
>+#endif
>+
>+static bool intel_cc_platform_has(enum cc_attr attr)
>+{
>+#ifdef CONFIG_INTEL_TDX_GUEST
>+	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
>+		return intel_tdx_guest_has(attr);
>+#endif
>+#ifdef CONFIG_INTEL_TDX_HOST
>+	if (platform_tdx_enabled())
>+		return intel_tdx_host_has(attr);
>+#endif
>+	return false;
>+}

how about:

static bool intel_cc_platform_has(enum cc_attr attr)
{
	switch (attr) {
	/* attributes applied to TDX guest only */
	case CC_ATTR_GUEST_UNROLL_STRING_IO:
	...
		return boot_cpu_has(X86_FEATURE_TDX_GUEST);

	/* attributes applied to TDX host only */
	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
		return platform_tdx_enabled();

	default:
		return false;
	}
}

so that we can get rid of #ifdef/endif.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory
  2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
@ 2022-06-24  2:12   ` Chao Gao
  2022-06-24 11:23     ` Kai Huang
  2022-06-24 19:01   ` Dave Hansen
  1 sibling, 1 reply; 114+ messages in thread
From: Chao Gao @ 2022-06-24  2:12 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, akpm

On Wed, Jun 22, 2022 at 11:16:19PM +1200, Kai Huang wrote:
>@@ -55,6 +55,7 @@
> #include <asm/uv/uv.h>
> #include <asm/setup.h>
> #include <asm/ftrace.h>
>+#include <asm/tdx.h>
> 
> #include "mm_internal.h"
> 
>@@ -972,6 +973,26 @@ int arch_add_memory(int nid, u64 start, u64 size,
> 	return add_pages(nid, start_pfn, nr_pages, params);
> }
> 
>+int arch_memory_add_precheck(int nid, u64 start, u64 size, mhp_t mhp_flags)
>+{
>+	if (!platform_tdx_enabled())
>+		return 0;

add a new cc attribute (if existing ones don't fit) for TDX host platform and
check the attribute here. So that the code here can be reused by other cc
platforms if they have the same requirement.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2022-06-22 11:16 ` [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
@ 2022-06-24  2:39   ` Chao Gao
  2022-06-24 11:27     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Chao Gao @ 2022-06-24  2:39 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata

On Wed, Jun 22, 2022 at 11:16:29PM +1200, Kai Huang wrote:
>Before the TDX module can be used to create and run TD guests, it must
>be loaded into the isolated region pointed by the SEAMRR and properly
>initialized.  The TDX module is expected to be loaded by BIOS before
>booting to the kernel, and the kernel is expected to detect and
>initialize it.
>
>The TDX module can be initialized only once in its lifetime.  Instead
>of always initializing it at boot time, this implementation chooses an
>on-demand approach to initialize TDX until there is a real need (e.g
>when requested by KVM).  This avoids consuming the memory that must be
>allocated by kernel and given to the TDX module as metadata (~1/256th of
>the TDX-usable memory), and also saves the time of initializing the TDX
>module (and the metadata) when TDX is not used at all.  Initializing the
>TDX module at runtime on-demand also is more flexible to support TDX
>module runtime updating in the future (after updating the TDX module, it
>needs to be initialized again).
>
>Add a placeholder tdx_init() to detect and initialize the TDX module on
>demand, with a state machine protected by mutex to support concurrent
>calls from multiple callers.
>
>The TDX module will be initialized in multi-steps defined by the TDX
>architecture:
>
>  1) Global initialization;
>  2) Logical-CPU scope initialization;
>  3) Enumerate the TDX module capabilities and platform configuration;
>  4) Configure the TDX module about usable memory ranges and global
>     KeyID information;
>  5) Package-scope configuration for the global KeyID;
>  6) Initialize usable memory ranges based on 4).
>
>The TDX module can also be shut down at any time during its lifetime.
>In case of any error during the initialization process, shut down the
>module.  It's pointless to leave the module in any intermediate state
>during the initialization.
>
>Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Chao Gao <chao.gao@intel.com>

One nit below:

>+static int __tdx_init(void)
>+{
>+	int ret;
>+
>+	/*
>+	 * Initializing the TDX module requires running some code on
>+	 * all MADT-enabled CPUs.  If not all MADT-enabled CPUs are
>+	 * online, it's not possible to initialize the TDX module.
>+	 *
>+	 * For simplicity temporarily disable CPU hotplug to prevent
>+	 * any CPU from going offline during the initialization.
>+	 */
>+	cpus_read_lock();
>+
>+	/*
>+	 * Check whether all MADT-enabled CPUs are online and return
>+	 * early with an explicit message so the user can be aware.
>+	 *
>+	 * Note ACPI CPU hotplug is prevented when TDX is enabled, so
>+	 * num_processors always reflects all present MADT-enabled
>+	 * CPUs during boot when disabled_cpus is 0.
>+	 */
>+	if (disabled_cpus || num_online_cpus() != num_processors) {
>+		pr_err("Unable to initialize the TDX module when there's offline CPU(s).\n");
>+		ret = -EINVAL;
>+		goto out;
>+	}
>+
>+	ret = init_tdx_module();
>+	if (ret == -ENODEV) {
>+		pr_info("TDX module is not loaded.\n");

tdx_module_status should be set to TDX_MODULE_NONE here.

>+		goto out;
>+	}

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-24  1:41   ` Chao Gao
@ 2022-06-24 11:21     ` Kai Huang
  2022-06-29  8:35       ` Yuan Yao
  2022-06-29 14:22       ` Dave Hansen
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-24 11:21 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata, thomas.lendacky, Tianyu.Lan

On Fri, 2022-06-24 at 09:41 +0800, Chao Gao wrote:
> On Wed, Jun 22, 2022 at 11:16:07PM +1200, Kai Huang wrote:
> > -static bool intel_cc_platform_has(enum cc_attr attr)
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +static bool intel_tdx_guest_has(enum cc_attr attr)
> > {
> > 	switch (attr) {
> > 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > @@ -28,6 +31,33 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > 		return false;
> > 	}
> > }
> > +#endif
> > +
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +static bool intel_tdx_host_has(enum cc_attr attr)
> > +{
> > +	switch (attr) {
> > +	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > +	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > +		return true;
> > +	default:
> > +		return false;
> > +	}
> > +}
> > +#endif
> > +
> > +static bool intel_cc_platform_has(enum cc_attr attr)
> > +{
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> > +		return intel_tdx_guest_has(attr);
> > +#endif
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +	if (platform_tdx_enabled())
> > +		return intel_tdx_host_has(attr);
> > +#endif
> > +	return false;
> > +}
> 
> how about:
> 
> static bool intel_cc_platform_has(enum cc_attr attr)
> {
> 	switch (attr) {
> 	/* attributes applied to TDX guest only */
> 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> 	...
> 		return boot_cpu_has(X86_FEATURE_TDX_GUEST);
> 
> 	/* attributes applied to TDX host only */
> 	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> 	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> 		return platform_tdx_enabled();
> 
> 	default:
> 		return false;
> 	}
> }
> 
> so that we can get rid of #ifdef/endif.

Personally I don't quite like this way.  To me having separate function for host
and guest is more clear and more flexible.  And I don't think having
#ifdef/endif has any problem.  I would like to leave to maintainers.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory
  2022-06-24  2:12   ` Chao Gao
@ 2022-06-24 11:23     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-24 11:23 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-kernel, kvm, linux-mm, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, akpm

On Fri, 2022-06-24 at 10:12 +0800, Chao Gao wrote:
> On Wed, Jun 22, 2022 at 11:16:19PM +1200, Kai Huang wrote:
> > @@ -55,6 +55,7 @@
> > #include <asm/uv/uv.h>
> > #include <asm/setup.h>
> > #include <asm/ftrace.h>
> > +#include <asm/tdx.h>
> > 
> > #include "mm_internal.h"
> > 
> > @@ -972,6 +973,26 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > 	return add_pages(nid, start_pfn, nr_pages, params);
> > }
> > 
> > +int arch_memory_add_precheck(int nid, u64 start, u64 size, mhp_t mhp_flags)
> > +{
> > +	if (!platform_tdx_enabled())
> > +		return 0;
> 
> add a new cc attribute (if existing ones don't fit) for TDX host platform and
> check the attribute here. So that the code here can be reused by other cc
> platforms if they have the same requirement.

Please see my explanation in the commit message:

The __weak arch-specific hook is used instead of a new CC_ATTR similar
to disable software CPU hotplug.  It is because some driver managed
memory resources may actually be TDX-capable (such as legacy PMEM, which
is underneath indeed RAM), and the arch-specific hook can be further
enhanced to allow those when needed.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2022-06-24  2:39   ` Chao Gao
@ 2022-06-24 11:27     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-24 11:27 UTC (permalink / raw)
  To: Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata


> > +	ret = init_tdx_module();
> > +	if (ret == -ENODEV) {
> > +		pr_info("TDX module is not loaded.\n");
> 
> tdx_module_status should be set to TDX_MODULE_NONE here.

Thanks.  Will fix.

> 
> > +		goto out;
> > +	}

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-22 11:16 ` [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
@ 2022-06-24 18:38   ` Dave Hansen
  2022-06-27  5:23     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 18:38 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/22/22 04:16, Kai Huang wrote:
> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> before calling __seamcall().

I was trying to make the argument earlier that you don't need *ANY*
detection for TDX, other than the ability to make a SEAMCALL.
Basically, patch 01/22 could go away.

You are right that:

	The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions.

But, it's also not hard to make it *able* to handle exceptions.

So what does patch 01/22 buy us?  One EXTABLE entry?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-22 11:16 ` [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
@ 2022-06-24 18:50   ` Dave Hansen
  2022-06-27  5:26     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 18:50 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

So, the last patch was called:

	Implement SEAMCALL function

and yet, in this patch, we have a "seamcall()" function.  That's a bit
confusing and not covered at *all* in this subject.

Further, seamcall() is the *ONLY* caller of __seamcall() that I see in
this series.  That makes its presence here even more odd.

The seamcall() bits should either be in their own patch, or mashed in
with __seamcall().

> +/*
> + * Wrapper of __seamcall().  It additionally prints out the error
> + * informationi if __seamcall() fails normally.  It is useful during
> + * the module initialization by providing more information to the user.
> + */
> +static u64 seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		    struct tdx_module_output *out)
> +{
> +	u64 ret;
> +
> +	ret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +	if (ret == TDX_SEAMCALL_VMFAILINVALID || !ret)
> +		return ret;
> +
> +	pr_err("SEAMCALL failed: leaf: 0x%llx, error: 0x%llx\n", fn, ret);
> +	if (out)
> +		pr_err("SEAMCALL additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> +			out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
> +
> +	return ret;
> +}
> +
> +static void seamcall_smp_call_function(void *data)
> +{
> +	struct seamcall_ctx *sc = data;
> +	struct tdx_module_output out;
> +	u64 ret;
> +
> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, &out);
> +	if (ret)
> +		atomic_set(&sc->err, -EFAULT);
> +}
> +
> +/*
> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> + */
> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +}

You can get away with this three-liner seamcall_on_each_cpu() being in
this patch, but seamcall() itself doesn't belong here.

>  /*
>   * Detect and initialize the TDX module.
>   *
> @@ -138,7 +195,10 @@ static int init_tdx_module(void)
>  
>  static void shutdown_tdx_module(void)
>  {
> -	/* TODO: Shut down the TDX module */
> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> +	seamcall_on_each_cpu(&sc);
> +
>  	tdx_module_status = TDX_MODULE_SHUTDOWN;
>  }
>  
> @@ -221,6 +281,9 @@ bool platform_tdx_enabled(void)
>   * CPU hotplug is temporarily disabled internally to prevent any cpu
>   * from going offline.
>   *
> + * Caller also needs to guarantee all CPUs are in VMX operation during
> + * this function, otherwise Oops may be triggered.

I would *MUCH* rather have this be a:

	if (!cpu_feature_enabled(X86_FEATURE_VMX))
		WARN_ONCE("VMX should be on blah blah\n");

than just plain oops.  Even a pr_err() that preceded the oops would be
nicer than an oops that someone has to go decode and then grumble when
their binutils is too old that it can't disassemble the TDCALL.



>   * This function can be called in parallel by multiple callers.
>   *
>   * Return:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index f1a2dfb978b1..95d4eb884134 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,11 @@
>  #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
>  
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN	44
> +
>  /*
>   * Do not put any hardware-defined TDX structure representations below this
>   * comment!


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
  2022-06-22 11:42   ` Rafael J. Wysocki
@ 2022-06-24 18:57   ` Dave Hansen
  2022-06-27  5:05     ` Kai Huang
  2022-06-29  5:33   ` Christoph Hellwig
  2022-08-03  3:55   ` Binbin Wu
  3 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 18:57 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On 6/22/22 04:15, Kai Huang wrote:
> Platforms with confidential computing technology may not support ACPI
> CPU hotplug when such technology is enabled by the BIOS.  Examples
> include Intel platforms which support Intel Trust Domain Extensions
> (TDX).
> 
> If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> bug and reject the new CPU.  For hot-removal, for simplicity just assume
> the kernel cannot continue to work normally, and BUG().

So, the kernel is now declaring ACPI CPU hotplug and TDX to be
incompatible and even BUG()'ing if we see them together.  Has anyone
told the firmware guys about this?  Is this in a spec somewhere?  When
the kernel goes boom, are the firmware folks going to cry "Kernel bug!!"?

This doesn't seem like something the kernel should be doing unilaterally.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory
  2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
  2022-06-24  2:12   ` Chao Gao
@ 2022-06-24 19:01   ` Dave Hansen
  2022-06-27  5:27     ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 19:01 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	akpm

On 6/22/22 04:16, Kai Huang wrote:
> @@ -1319,6 +1330,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	if (ret)
>  		return ret;
>  
> +	ret = arch_memory_add_precheck(nid, start, size, mhp_flags);
> +	if (ret)
> +		return ret;

Shouldn't a patch that claims to be only for "driver managed memory" be
patching add_memory_driver_managed()?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-06-22 11:17 ` [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
@ 2022-06-24 19:40   ` Dave Hansen
  2022-06-27  6:16     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 19:40 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/22/22 04:17, Kai Huang wrote:
...
> Also, explicitly exclude memory regions below first 1MB as TDX memory
> because those regions may not be reported as convertible memory.  This
> is OK as the first 1MB is always reserved during kernel boot and won't
> end up to the page allocator.

Are you sure?  I wasn't for a few minutes until I found reserve_real_mode()

Could we point to that in this changelog, please?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index efa830853e98..4988a91d5283 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1974,6 +1974,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	select ARCH_HAS_CC_PLATFORM
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 1bc97756bc0d..2b20d4a7a62b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -15,6 +15,8 @@
>  #include <linux/cpumask.h>
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
> +#include <linux/sizes.h>
> +#include <linux/memblock.h>
>  #include <asm/cpufeatures.h>
>  #include <asm/cpufeature.h>
>  #include <asm/msr-index.h>
> @@ -338,6 +340,91 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *tdsysinfo,
>  	return check_cmrs(cmr_array, actual_cmr_num);
>  }
>  
> +/*
> + * Skip the memory region below 1MB.  Return true if the entire
> + * region is skipped.  Otherwise, the updated range is returned.
> + */
> +static bool pfn_range_skip_lowmem(unsigned long *p_start_pfn,
> +				  unsigned long *p_end_pfn)
> +{
> +	u64 start, end;
> +
> +	start = *p_start_pfn << PAGE_SHIFT;
> +	end = *p_end_pfn << PAGE_SHIFT;
> +
> +	if (start < SZ_1M)
> +		start = SZ_1M;
> +
> +	if (start >= end)
> +		return true;
> +
> +	*p_start_pfn = (start >> PAGE_SHIFT);
> +
> +	return false;
> +}
> +
> +/*
> + * Walks over all memblock memory regions that are intended to be
> + * converted to TDX memory.  Essentially, it is all memblock memory
> + * regions excluding the low memory below 1MB.
> + *
> + * This is because on some TDX platforms the low memory below 1MB is
> + * not included in CMRs.  Excluding the low 1MB can still guarantee
> + * that the pages managed by the page allocator are always TDX memory,
> + * as the low 1MB is reserved during kernel boot and won't end up to
> + * the ZONE_DMA (see reserve_real_mode()).
> + */
> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> +		if (!pfn_range_skip_lowmem(p_start, p_end))

Let's summarize where we are at this point:

1. All RAM is described in memblocks
2. Some memblocks are reserved and some are free
3. The lower 1MB is marked reserved
4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
   have to exclude the lower 1MB as a special case.

That seems superficially rather ridiculous.  Shouldn't we just pick a
memblock iterator that skips the 1MB?  Surely there is such a thing.
Or, should we be doing something different with the 1MB in the memblock
structure?

> +/* Check whether first range is the subrange of the second */
> +static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
> +{
> +	return r1_start >= r2_start && r1_end <= r2_end;
> +}
> +
> +/* Check whether address range is covered by any CMR or not. */
> +static bool range_covered_by_cmr(struct cmr_info *cmr_array, int cmr_num,
> +				 u64 start, u64 end)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +
> +		if (is_subrange(start, end, cmr->base, cmr->base + cmr->size))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Check whether all memory regions in memblock are TDX convertible
> + * memory.  Return 0 if all memory regions are convertible, or error.
> + */
> +static int check_memblock_tdx_convertible(void)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i;
> +
> +	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, NULL) {
> +		u64 start, end;
> +
> +		start = start_pfn << PAGE_SHIFT;
> +		end = end_pfn << PAGE_SHIFT;
> +		if (!range_covered_by_cmr(tdx_cmr_array, tdx_cmr_num, start,
> +					end)) {
> +			pr_err("[0x%llx, 0x%llx) is not fully convertible memory\n",
> +					start, end);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Detect and initialize the TDX module.
>   *
> @@ -371,6 +458,19 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * To avoid having to modify the page allocator to distinguish
> +	 * TDX and non-TDX memory allocation, convert all memory regions
> +	 * in memblock to TDX memory to make sure all pages managed by
> +	 * the page allocator are TDX memory.
> +	 *
> +	 * Sanity check all memory regions are fully covered by CMRs to
> +	 * make sure they are truly convertible.
> +	 */
> +	ret = check_memblock_tdx_convertible();
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * Return -EINVAL until all steps of TDX module initialization
>  	 * process are done.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 00/22] TDX host kernel support
  2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
                   ` (21 preceding siblings ...)
  2022-06-22 11:17 ` [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
@ 2022-06-24 19:47 ` Dave Hansen
  2022-06-27  4:09   ` Kai Huang
  22 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 19:47 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	akpm, thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On 6/22/22 04:15, Kai Huang wrote:
> Please kindly help to review, and I would appreciate reviewed-by or
> acked-by tags if the patches look good to you.

Serious question: Is *ANYONE* looking at these patches other than you
and the maintainers?  I first saw this code (inside Intel) in early
2020.  In that time, not a single review tag has been acquired?

$ egrep -ic 'acked-by:|reviewed-by:' kais-patches.mbox
0

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2022-06-24 20:13   ` Dave Hansen
  2022-06-27 10:31     ` Kai Huang
  2022-08-17 22:46   ` Sagi Shahar
  1 sibling, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-24 20:13 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4988a91d5283..ec496e96d120 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
>  	depends on CPU_SUP_INTEL
>  	depends on X86_64
>  	depends on KVM_INTEL
> +	depends on CONTIG_ALLOC
>  	select ARCH_HAS_CC_PLATFORM
>  	select ARCH_KEEP_MEMBLOCK
>  	help
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fd9f449b5395..36260dd7e69f 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -558,6 +558,196 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
>  	return 0;
>  }
>  
> +/* Page sizes supported by TDX */
> +enum tdx_page_sz {
> +	TDX_PG_4K,
> +	TDX_PG_2M,
> +	TDX_PG_1G,
> +	TDX_PG_MAX,
> +};

Are these the same constants as the magic numbers in Kirill's
try_accept_one()?

> +/*
> + * Calculate PAMT size given a TDMR and a page size.  The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> +				      enum tdx_page_sz pgsz)
> +{
> +	unsigned long pamt_sz;
> +	int pamt_entry_nr;

'nr_pamt_entries', please.

> +	switch (pgsz) {
> +	case TDX_PG_4K:
> +		pamt_entry_nr = tdmr->size >> PAGE_SHIFT;
> +		break;
> +	case TDX_PG_2M:
> +		pamt_entry_nr = tdmr->size >> PMD_SHIFT;
> +		break;
> +	case TDX_PG_1G:
> +		pamt_entry_nr = tdmr->size >> PUD_SHIFT;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}
> +
> +	pamt_sz = pamt_entry_nr * tdx_sysinfo.pamt_entry_size;
> +	/* TDX requires PAMT size must be 4K aligned */
> +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> +	return pamt_sz;
> +}
> +
> +/*
> + * Pick a NUMA node on which to allocate this TDMR's metadata.
> + *
> + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> + * not be.  If the TDMR covers more than one node, just use the _first_
> + * one.  This can lead to small areas of off-node metadata for some
> + * memory.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, nid;
> +
> +	/* Find the first memory region covered by the TDMR */
> +	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, &nid) {
> +		if (end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> +			return nid;
> +	}
> +
> +	/*
> +	 * No memory region found for this TDMR.  It cannot happen since
> +	 * when one TDMR is created, it must cover at least one (or
> +	 * partial) memory region.
> +	 */
> +	WARN_ON_ONCE(1);
> +	return 0;
> +}

You should really describe what you are doing.  At first glance "return
0;" looks like "declare success".  How about something like this?

	/*
	 * Fall back to allocating the TDMR from node 0 when no memblock
	 * can be found.  This should never happen since TDMRs originate
	 * from the memblocks.
	 */

Does that miss any of the points you were trying to make?

> +static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long pamt_base[TDX_PG_MAX];
> +	unsigned long pamt_size[TDX_PG_MAX];
> +	unsigned long tdmr_pamt_base;
> +	unsigned long tdmr_pamt_size;
> +	enum tdx_page_sz pgsz;
> +	struct page *pamt;
> +	int nid;
> +
> +	nid = tdmr_get_nid(tdmr);
> +
> +	/*
> +	 * Calculate the PAMT size for each TDX supported page size
> +	 * and the total PAMT size.
> +	 */
> +	tdmr_pamt_size = 0;
> +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> +		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
> +		tdmr_pamt_size += pamt_size[pgsz];
> +	}
> +
> +	/*
> +	 * Allocate one chunk of physically contiguous memory for all
> +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> +	 * in overlapped TDMRs.
> +	 */
> +	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> +			nid, &node_online_map);
> +	if (!pamt)
> +		return -ENOMEM;

I'm not sure it's worth mentioning, but this doesn't really need to be
GFP_KERNEL.  __GFP_HIGHMEM would actually be just fine.  But,
considering that this is 64-bit only, that's just a technicality.

> +	/* Calculate PAMT base and size for all supported page sizes. */

That comment isn't doing much good.  If you say anything here it should be:

	/*
	 * Break the contiguous allocation back up into
	 * the individual PAMTs for each page size:
	 */

Also, this is *not* "calculating size".  That's done above.

> +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> +		pamt_base[pgsz] = tdmr_pamt_base;
> +		tdmr_pamt_base += pamt_size[pgsz];
> +	}
> +
> +	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> +	tdmr->pamt_4k_size = pamt_size[TDX_PG_4K];
> +	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> +	tdmr->pamt_2m_size = pamt_size[TDX_PG_2M];
> +	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> +	tdmr->pamt_1g_size = pamt_size[TDX_PG_1G];
> +
> +	return 0;
> +}
>
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> +			  unsigned long *pamt_npages)
> +{
> +	unsigned long pamt_base, pamt_sz;
> +
> +	/*
> +	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
> +	 * should always point to the beginning of that allocation.
> +	 */
> +	pamt_base = tdmr->pamt_4k_base;
> +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> +	*pamt_pfn = pamt_base >> PAGE_SHIFT;
> +	*pamt_npages = pamt_sz >> PAGE_SHIFT;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long pamt_pfn, pamt_npages;
> +
> +	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
> +
> +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> +	if (!pamt_npages)
> +		return;
> +
> +	if (WARN_ON_ONCE(!pamt_pfn))
> +		return;
> +
> +	free_contig_range(pamt_pfn, pamt_npages);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_num; i++)
> +		tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
> +{
> +	int i, ret = 0;
> +
> +	for (i = 0; i < tdmr_num; i++) {
> +		ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> +	return ret;
> +}
> +
> +static unsigned long tdmrs_get_pamt_pages(struct tdmr_info *tdmr_array,
> +					  int tdmr_num)

"get" is for refcounting.  tdmrs_count_pamt_pages() would be preferable.

> +{
> +	unsigned long pamt_npages = 0;
> +	int i;
> +
> +	for (i = 0; i < tdmr_num; i++) {
> +		unsigned long pfn, npages;
> +
> +		tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
> +		pamt_npages += npages;
> +	}
> +
> +	return pamt_npages;
> +}
> +
>  /*
>   * Construct an array of TDMRs to cover all memory regions in memblock.
>   * This makes sure all pages managed by the page allocator are TDX
> @@ -572,8 +762,13 @@ static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
>  	if (ret)
>  		goto err;
>  
> +	ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
> +	if (ret)
> +		goto err;
> +
>  	/* Return -EINVAL until constructing TDMRs is done */
>  	ret = -EINVAL;
> +	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
>  err:
>  	return ret;
>  }
> @@ -644,6 +839,11 @@ static int init_tdx_module(void)
>  	 * process are done.
>  	 */
>  	ret = -EINVAL;
> +	if (ret)
> +		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> +	else
> +		pr_info("%lu pages allocated for PAMT.\n",
> +				tdmrs_get_pamt_pages(tdmr_array, tdmr_num));
>  out_free_tdmrs:
>  	/*
>  	 * The array of TDMRs is freed no matter the initialization is

The rest looks OK.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 00/22] TDX host kernel support
  2022-06-24 19:47 ` [PATCH v5 00/22] TDX host kernel support Dave Hansen
@ 2022-06-27  4:09   ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27  4:09 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: linux-mm, linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	akpm, thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Fri, 2022-06-24 at 12:47 -0700, Dave Hansen wrote:
> On 6/22/22 04:15, Kai Huang wrote:
> > Please kindly help to review, and I would appreciate reviewed-by or
> > acked-by tags if the patches look good to you.
> 
> Serious question: Is *ANYONE* looking at these patches other than you
> and the maintainers?  I first saw this code (inside Intel) in early
> 2020.  In that time, not a single review tag has been acquired?
> 
> $ egrep -ic 'acked-by:|reviewed-by:' kais-patches.mbox
> 0

Hi Dave,

There were big design changes in the history of this series (i.e. we originally
supported loading both the NP-SEAMLDR ACM and the TDX module during boot, and we
changed from initializing the module from during kernel boot to at runtime), but
yes some other Linux/KVM TDX developers in our team have been reviewing this
series during the all time, at least at some extent.  They just didn't give
Reviewed-by or Acked-by.

Especially, after we had agreed that this series in general should enable TDX
with minimal code change, Kevin helped to review this series intensively and
helped to simplify the code to the current shape (i.e. TDMR part).  He didn't
give any of tags either (only said this series is ready for you to review),
perhaps because he was _helping_ to get this series to the shape that is ready
for you and other Intel reviewers to review.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-24 18:57   ` Dave Hansen
@ 2022-06-27  5:05     ` Kai Huang
  2022-07-13 11:09       ` Kai Huang
  2022-08-03  3:40       ` Binbin Wu
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27  5:05 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Fri, 2022-06-24 at 11:57 -0700, Dave Hansen wrote:
> On 6/22/22 04:15, Kai Huang wrote:
> > Platforms with confidential computing technology may not support ACPI
> > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > include Intel platforms which support Intel Trust Domain Extensions
> > (TDX).
> > 
> > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > the kernel cannot continue to work normally, and BUG().
> 
> So, the kernel is now declaring ACPI CPU hotplug and TDX to be
> incompatible and even BUG()'ing if we see them together.  Has anyone
> told the firmware guys about this?  Is this in a spec somewhere?  When
> the kernel goes boom, are the firmware folks going to cry "Kernel bug!!"?
> 
> This doesn't seem like something the kernel should be doing unilaterally.

TDX doesn't support ACPI CPU hotplug (both hot-add and hot-removal) is an
architectural behaviour.  The public specs doesn't explicitly say  it, but it is
implied:

1) During platform boot MCHECK verifies all logical CPUs on all packages that
they are TDX compatible, and it keeps some information, such as total CPU
packages and total logical cpus at some location of SEAMRR so it can later be
used by P-SEAMLDR and TDX module.  Please see "3.4 SEAMLDR_SEAMINFO" in the P-
SEAMLDR spec:

https://cdrdv2.intel.com/v1/dl/getContent/733584

2) Also some SEAMCALLs must be called on all logical CPUs or CPU packages that
the platform has (such as such as TDH.SYS.INIT.LP and TDH.SYS.KEY.CONFIG),
otherwise the further step of TDX module initialization will fail.

Unfortunately there's no public spec mentioning what's the behaviour of ACPI CPU
hotplug on TDX enabled platform.  For instance, whether BIOS will ever get the
ACPI CPU hot-plug event, or if BIOS gets the event, will it suppress it.  What I
got from Intel internally is a non-buggy BIOS should never report such event to
the kernel, so if kernel receives such event, it should be fair enough to treat
it as BIOS bug.

But theoretically, the BIOS isn't in TDX's TCB, and can be from 3rd party..

Also, I was told "CPU hot-plug is a system feature, not a CPU feature or Intel
architecture feature", so Intel doesn't have an architectural specification for
CPU hot-plug. 

At the meantime, I am pushing Intel internally to add some statements regarding
to the TDX and CPU hotplug interaction to the BIOS write guide and make it
public.  I guess this is the best thing we can do.

Regarding to the code change, I agree the BUG() isn't good.  I used it because:
1) this basically on a theoretical problem and shouldn't happen in practice; 2)
because there's no architectural specification regarding to the behaviour of TDX
when CPU hot-removal, so I just used BUG() in assumption that TDX isn't safe to
use anymore.

But Rafael doesn't like current code change either. I think maybe we can just
disable CPU hotplug code when TDX is enabled by BIOS (something like below):

--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -707,6 +707,10 @@ bool acpi_duplicate_processor_id(int proc_id)
 void __init acpi_processor_init(void)
 {
        acpi_processor_check_duplicates();
+
+       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED))
+               return;
+
        acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
        acpi_scan_add_handler(&processor_container_handler);
 }

This approach is cleaner I think, but we won't be able to report "BIOS bug" when
ACPI CPU hotplug happens.  But to me it's OK as perhaps it's arguable to treat
it as BIOS bug (as theoretically BIOS can be from 3rd party). 

What's your opinion?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-24 18:38   ` Dave Hansen
@ 2022-06-27  5:23     ` Kai Huang
  2022-06-27 20:58       ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27  5:23 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> On 6/22/22 04:16, Kai Huang wrote:
> > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > before calling __seamcall().
> 
> I was trying to make the argument earlier that you don't need *ANY*
> detection for TDX, other than the ability to make a SEAMCALL.
> Basically, patch 01/22 could go away.
> 
> You are right that:
> 
> 	The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions.
> 
> But, it's also not hard to make it *able* to handle exceptions.
> 
> So what does patch 01/22 buy us?  One EXTABLE entry?

There are below pros if we can detect whether TDX is enabled by BIOS during boot
before initializing the TDX Module:

1) There are requirements from customers to report whether platform supports TDX
and the TDX keyID numbers before initializing the TDX module so the userspace
cloud software can use this information to do something.  Sorry I cannot find
the lore link now.

Isaku, if you see, could you provide more info?

2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
managed memory hotplug.  Kexec() support patch also can use it.

Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
is enabled by BIOS, but not whether TDX module is loaded, or the result of
initializing the TDX module.  So I think we should have some code to detect TDX
during boot.


Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
code comparing to detecting TDX during boot:

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4b75c930fa1b..4a97ca8eb14c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,7 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>

+#ifdef CONFIG_INTEL_TDX_HOST
 /*
  * SW-defined error codes.
  *
@@ -18,6 +19,21 @@
 #define TDX_SW_ERROR                   (TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID     (TDX_SW_ERROR | _UL(0xFFFF0000))

+/*
+ * Special error codes to indicate SEAMCALL #GP and #UD.
+ *
+ * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and
+ * causes #UD when CPU is not in VMX operation.  Define two separate
+ * error codes to distinguish the two cases so caller can be aware of
+ * what caused the SEAMCALL to fail.
+ *
+ * Bits 61:48 are reserved bits which will never be set by the TDX
+ * module.  Borrow 2 reserved bits to represent #GP and #UD.
+ */
+#define TDX_SEAMCALL_GP                (TDX_ERROR | GENMASK_ULL(48, 48))
+#define TDX_SEAMCALL_UD                (TDX_ERROR | GENMASK_ULL(49, 49))
+#endif
+
 #ifndef __ASSEMBLY__

 /*
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..7431c47258d9 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>

 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
        /* Leave input param 2 in RDX */

        .if \host
+1:
        seamcall
        /*
         * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,9 +59,25 @@
         * This value will never be used as actual SEAMCALL error code as
         * it is from the Reserved status code class.
         */
-       jnc .Lno_vmfailinvalid
+       jnc .Lseamcall_out
        mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+       jmp .Lseamcall_out
+2:
+       /*
+        * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+        * the trap number.  Check the trap number and set up the return
+        * value to %rax.
+        */
+       cmp $X86_TRAP_GP, %eax
+       je .Lseamcall_gp
+       mov $TDX_SEAMCALL_UD, %rax
+       jmp .Lseamcall_out
+.Lseamcall_gp:
+       mov $TDX_SEAMCALL_GP, %rax
+       jmp .Lseamcall_out
+
+       _ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out





^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-24 18:50   ` Dave Hansen
@ 2022-06-27  5:26     ` Kai Huang
  2022-06-27 20:46       ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27  5:26 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-06-24 at 11:50 -0700, Dave Hansen wrote:
> So, the last patch was called:
> 
> 	Implement SEAMCALL function
> 
> and yet, in this patch, we have a "seamcall()" function.  That's a bit
> confusing and not covered at *all* in this subject.
> 
> Further, seamcall() is the *ONLY* caller of __seamcall() that I see in
> this series.  That makes its presence here even more odd.
> 
> The seamcall() bits should either be in their own patch, or mashed in
> with __seamcall().

Right.  The reason I didn't put the seamcall() into previous patch was it is
only used in this tdx.c, so it should be static.  But adding a static function
w/o using it in previous patch will trigger a compile warning.  So I introduced
here where it is first used.

One option is I can introduce seamcall() as a static inline function in tdx.h in
previous patch so there won't be a warning.  I'll change to use this way. 
Please let me know if you have any comments.

> 
> > +/*
> > + * Wrapper of __seamcall().  It additionally prints out the error
> > + * informationi if __seamcall() fails normally.  It is useful during
> > + * the module initialization by providing more information to the user.
> > + */
> > +static u64 seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > +		    struct tdx_module_output *out)
> > +{
> > +	u64 ret;
> > +
> > +	ret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +	if (ret == TDX_SEAMCALL_VMFAILINVALID || !ret)
> > +		return ret;
> > +
> > +	pr_err("SEAMCALL failed: leaf: 0x%llx, error: 0x%llx\n", fn, ret);
> > +	if (out)
> > +		pr_err("SEAMCALL additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > +			out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
> > +
> > +	return ret;
> > +}
> > +
> > +static void seamcall_smp_call_function(void *data)
> > +{
> > +	struct seamcall_ctx *sc = data;
> > +	struct tdx_module_output out;
> > +	u64 ret;
> > +
> > +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, &out);
> > +	if (ret)
> > +		atomic_set(&sc->err, -EFAULT);
> > +}
> > +
> > +/*
> > + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
> > + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> > + */
> > +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > +{
> > +	on_each_cpu(seamcall_smp_call_function, sc, true);
> > +}
> 
> You can get away with this three-liner seamcall_on_each_cpu() being in
> this patch, but seamcall() itself doesn't belong here.

Right.  Please see above reply.

> 
> >  /*
> >   * Detect and initialize the TDX module.
> >   *
> > @@ -138,7 +195,10 @@ static int init_tdx_module(void)
> >  
> >  static void shutdown_tdx_module(void)
> >  {
> > -	/* TODO: Shut down the TDX module */
> > +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> > +
> > +	seamcall_on_each_cpu(&sc);
> > +
> >  	tdx_module_status = TDX_MODULE_SHUTDOWN;
> >  }
> >  
> > @@ -221,6 +281,9 @@ bool platform_tdx_enabled(void)
> >   * CPU hotplug is temporarily disabled internally to prevent any cpu
> >   * from going offline.
> >   *
> > + * Caller also needs to guarantee all CPUs are in VMX operation during
> > + * this function, otherwise Oops may be triggered.
> 
> I would *MUCH* rather have this be a:
> 
> 	if (!cpu_feature_enabled(X86_FEATURE_VMX))
> 		WARN_ONCE("VMX should be on blah blah\n");
> 
> than just plain oops.  Even a pr_err() that preceded the oops would be
> nicer than an oops that someone has to go decode and then grumble when
> their binutils is too old that it can't disassemble the TDCALL.

I can add this to seamcall():

	/*
	 * SEAMCALL requires CPU being in VMX operation otherwise it causes
#UD.
	 * Sanity check and return early to avoid Oops.  Note cpu_vmx_enabled()
	 * actually only checks whether VMX is enabled but doesn't check
whether
	 * CPU is in VMX operation (VMXON is done).  There's no way to check
	 * whether VMXON has been done, but currently enabling VMX and doing
	 * VMXON are always done together.
	 */
	if (!cpu_vmx_enabled())	 {
		WARN_ONCE("CPU is not in VMX operation before making
SEAMCALL");
		return -EINVAL;
	}

The reason I didn't do is I'd like to make seamcall() simple, that it only
returns TDX_SEAMCALL_VMFAILINVALID or the actual SEAMCALL leaf error.  With
above, this function also returns kernel error code, which isn't good.

Alternatively, we can always add EXTABLE to TDX_MODULE_CALL macro to handle #UD
and #GP by returning dedicated error codes (please also see my reply to previous
patch for the code needed to handle), in which case we don't need such check
here.

Always handling #UD in TDX_MODULE_CALL macro also has another advantage:  there
will be no Oops for #UD regardless the issue that "there's no way to check
whether VMXON has been done" in the above comment.

What's your opinion?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory
  2022-06-24 19:01   ` Dave Hansen
@ 2022-06-27  5:27     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27  5:27 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: linux-mm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	akpm

On Fri, 2022-06-24 at 12:01 -0700, Dave Hansen wrote:
> On 6/22/22 04:16, Kai Huang wrote:
> > @@ -1319,6 +1330,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> >  	if (ret)
> >  		return ret;
> >  
> > +	ret = arch_memory_add_precheck(nid, start, size, mhp_flags);
> > +	if (ret)
> > +		return ret;
> 
> Shouldn't a patch that claims to be only for "driver managed memory" be
> patching add_memory_driver_managed()?

Right given the ACPI memory hotplug is handled in a separate patch.  Will move
to add_memory_driver_managed().


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-06-24 19:40   ` Dave Hansen
@ 2022-06-27  6:16     ` Kai Huang
  2022-07-07  2:37       ` Kai Huang
  2022-07-07 14:26       ` Dave Hansen
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27  6:16 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
> On 6/22/22 04:17, Kai Huang wrote:
> ...
> > Also, explicitly exclude memory regions below first 1MB as TDX memory
> > because those regions may not be reported as convertible memory.  This
> > is OK as the first 1MB is always reserved during kernel boot and won't
> > end up to the page allocator.
> 
> Are you sure?  I wasn't for a few minutes until I found reserve_real_mode()
> 
> Could we point to that in this changelog, please?

OK will explicitly point out reserve_real_mode().

> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index efa830853e98..4988a91d5283 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1974,6 +1974,7 @@ config INTEL_TDX_HOST
> >  	depends on X86_64
> >  	depends on KVM_INTEL
> >  	select ARCH_HAS_CC_PLATFORM
> > +	select ARCH_KEEP_MEMBLOCK
> >  	help
> >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> >  	  host and certain physical attacks.  This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 1bc97756bc0d..2b20d4a7a62b 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -15,6 +15,8 @@
> >  #include <linux/cpumask.h>
> >  #include <linux/smp.h>
> >  #include <linux/atomic.h>
> > +#include <linux/sizes.h>
> > +#include <linux/memblock.h>
> >  #include <asm/cpufeatures.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/msr-index.h>
> > @@ -338,6 +340,91 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *tdsysinfo,
> >  	return check_cmrs(cmr_array, actual_cmr_num);
> >  }
> >  
> > +/*
> > + * Skip the memory region below 1MB.  Return true if the entire
> > + * region is skipped.  Otherwise, the updated range is returned.
> > + */
> > +static bool pfn_range_skip_lowmem(unsigned long *p_start_pfn,
> > +				  unsigned long *p_end_pfn)
> > +{
> > +	u64 start, end;
> > +
> > +	start = *p_start_pfn << PAGE_SHIFT;
> > +	end = *p_end_pfn << PAGE_SHIFT;
> > +
> > +	if (start < SZ_1M)
> > +		start = SZ_1M;
> > +
> > +	if (start >= end)
> > +		return true;
> > +
> > +	*p_start_pfn = (start >> PAGE_SHIFT);
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * Walks over all memblock memory regions that are intended to be
> > + * converted to TDX memory.  Essentially, it is all memblock memory
> > + * regions excluding the low memory below 1MB.
> > + *
> > + * This is because on some TDX platforms the low memory below 1MB is
> > + * not included in CMRs.  Excluding the low 1MB can still guarantee
> > + * that the pages managed by the page allocator are always TDX memory,
> > + * as the low 1MB is reserved during kernel boot and won't end up to
> > + * the ZONE_DMA (see reserve_real_mode()).
> > + */
> > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> 
> Let's summarize where we are at this point:
> 
> 1. All RAM is described in memblocks
> 2. Some memblocks are reserved and some are free
> 3. The lower 1MB is marked reserved
> 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
>    have to exclude the lower 1MB as a special case.
> 
> That seems superficially rather ridiculous.  Shouldn't we just pick a
> memblock iterator that skips the 1MB?  Surely there is such a thing.

Perhaps you are suggesting we should always loop the _free_ ranges so we don't
need to care about the first 1MB which is reserved?

The problem is some reserved memory regions are actually later freed to the page
allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().

Btw, I do have a checkpatch warning around this code:

ERROR: Macros with complex values should be enclosed in parentheses
#109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
+#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
+	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
+		if (!pfn_range_skip_lowmem(p_start, p_end))

But it looks like a false positive to me.

> Or, should we be doing something different with the 1MB in the memblock
> structure?

memblock APIs are used by other kernel components.  I don't think we should
modify memblock code behaviour for TDX.  Do you have any specific suggestion?

One possible option I can think is explicitly "register" memory regions as TDX
memory when they are firstly freed to the page allocator.  Those regions
includes: 

1) memblock_free_all();

Where majority of the pages are freed to page allocator from memblock.

2) memblock_free_late();

Which covers regions freed to page allocator after 1).

3) free_init_pages();

Which is explicitly used for some reserved areas such as initrd and part of
kernel image.

This will require new data structures to represent TDX memblock and the code to
create, insert and merge contiguous TDX memblocks, etc.  The advantage is we can
just iterate those TDX memblocks when constructing TDMRs.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-23  0:01     ` Kai Huang
@ 2022-06-27  8:01       ` Igor Mammedov
  2022-06-28 10:04         ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Igor Mammedov @ 2022-06-27  8:01 UTC (permalink / raw)
  To: Kai Huang
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky, Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld,
	Juri Lelli, Mark Rutland, Frederic Weisbecker, Yue Haibing,
	dongli.zhang

On Thu, 23 Jun 2022 12:01:48 +1200
Kai Huang <kai.huang@intel.com> wrote:

> On Wed, 2022-06-22 at 13:42 +0200, Rafael J. Wysocki wrote:
> > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:  
> > > 
> > > Platforms with confidential computing technology may not support ACPI
> > > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > > include Intel platforms which support Intel Trust Domain Extensions
> > > (TDX).
> > > 
> > > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > > the kernel cannot continue to work normally, and BUG().
> > > 
> > > Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> > > platform doesn't support ACPI CPU hotplug, so that kernel can handle
> > > ACPI CPU hotplug events for such platform.  The existing attribute
> > > CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
> > > 
> > > In acpi_processor_{add|remove}(), add early check against this attribute
> > > and handle accordingly if it is set.
> > > 
> > > Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> > > CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
> > > 
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > ---
> > >  arch/x86/coco/core.c          |  2 +-
> > >  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
> > >  include/linux/cc_platform.h   | 15 +++++++++++++--
> > >  kernel/cpu.c                  |  2 +-
> > >  4 files changed, 38 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > > index 4320fadae716..1bde1af75296 100644
> > > --- a/arch/x86/coco/core.c
> > > +++ b/arch/x86/coco/core.c
> > > @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > >  {
> > >         switch (attr) {
> > >         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > -       case CC_ATTR_HOTPLUG_DISABLED:
> > > +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
> > >         case CC_ATTR_GUEST_MEM_ENCRYPT:
> > >         case CC_ATTR_MEM_ENCRYPT:
> > >                 return true;
> > > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> > > index 6737b1cbf6d6..b960db864cd4 100644
> > > --- a/drivers/acpi/acpi_processor.c
> > > +++ b/drivers/acpi/acpi_processor.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/kernel.h>
> > >  #include <linux/module.h>
> > >  #include <linux/pci.h>
> > > +#include <linux/cc_platform.h>
> > > 
> > >  #include <acpi/processor.h>
> > > 
> > > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> > >         struct device *dev;
> > >         int result = 0;
> > > 
> > > +       /*
> > > +        * If the confidential computing platform doesn't support ACPI
> > > +        * memory hotplug, the BIOS should never deliver such event to
> > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > +        * the new CPU.
> > > +        */
> > > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {  
> > 
> > This will affect initialization, not just hotplug AFAICS.
> > 
> > You should reset the .hotplug.enabled flag in processor_handler to
> > false instead.  
> 
> Hi Rafael,
> 
> Thanks for the review.  By "affect initialization" did you mean this
> acpi_processor_add() is also called during kernel boot when any logical cpu is
> brought up?  Or do you mean ACPI CPU hotplug can also happen during kernel boot
> (after acpi_processor_init())?
> 
> I see acpi_processor_init() calls acpi_processor_check_duplicates() which calls
> acpi_evaluate_object() but I don't know details of ACPI so I don't know whether
> this would trigger acpi_processor_add().
> 
> One thing is TDX doesn't support ACPI CPU hotplug is an architectural thing, so
> it is illegal even if it happens during kernel boot.  Dave's idea is the kernel
> should  speak out loudly if physical CPU hotplug indeed happened on (BIOS) TDX-
> enabled platforms.  Otherwise perhaps we can just give up initializing the ACPI
> CPU hotplug in acpi_processor_init(), something like below?

The thing is that by the time ACPI machinery kicks in, physical hotplug
has already happened and in case of (kvm+qemu+ovmf hypervisor combo)
firmware has already handled it somehow and handed it over to ACPI.
If you say it's architectural thing then cpu hotplug is platform/firmware
bug and should be disabled there instead of working around it in the kernel.

Perhaps instead of 'preventing' hotplug, complain/panic and be done with it.
 
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -707,6 +707,10 @@ bool acpi_duplicate_processor_id(int proc_id)
>  void __init acpi_processor_init(void)
>  {
>         acpi_processor_check_duplicates();
> +
> +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED))
> +               return;
> +
>         acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
>         acpi_scan_add_handler(&processor_container_handler);
>  }
> 
> 
> >   
> > > +               dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI CPU hotplug.  New CPU ignored.\n");
> > > +               return -EINVAL;
> > > +       }
> > > +  
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-24 20:13   ` Dave Hansen
@ 2022-06-27 10:31     ` Kai Huang
  2022-06-27 20:41       ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27 10:31 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-06-24 at 13:13 -0700, Dave Hansen wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 4988a91d5283..ec496e96d120 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
> >  	depends on CPU_SUP_INTEL
> >  	depends on X86_64
> >  	depends on KVM_INTEL
> > +	depends on CONTIG_ALLOC
> >  	select ARCH_HAS_CC_PLATFORM
> >  	select ARCH_KEEP_MEMBLOCK
> >  	help
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index fd9f449b5395..36260dd7e69f 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -558,6 +558,196 @@ static int create_tdmrs(struct tdmr_info *tdmr_array,
> > int *tdmr_num)
> >  	return 0;
> >  }
> >  
> > +/* Page sizes supported by TDX */
> > +enum tdx_page_sz {
> > +	TDX_PG_4K,
> > +	TDX_PG_2M,
> > +	TDX_PG_1G,
> > +	TDX_PG_MAX,
> > +};
> 
> Are these the same constants as the magic numbers in Kirill's
> try_accept_one()?

try_accept_once() uses 'enum pg_level' PG_LEVEL_{4K,2M,1G} directly.  They can
be used directly too, but 'enum pg_level' has more than we need here:

enum pg_level {
        PG_LEVEL_NONE,                                                         
        PG_LEVEL_4K,                                                           
        PG_LEVEL_2M,                                                           
        PG_LEVEL_1G,
        PG_LEVEL_512G,                                                         
        PG_LEVEL_NUM                                                           
}; 

It has PG_LEVEL_NONE, so PG_LEVEL_4K starts with 1.

Below in tdmr_set_up_pamt(), I have two local arrays to store the base/size for
all TDX supported page sizes:

	unsigned long pamt_base[TDX_PG_MAX];
	unsigned long pamt_size[TDX_PG_MAX]; 

And a loop to calculate the size of PAMT for each page size:

	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
		...
	}

And later a similar loop to get the PAMT base of each page size too.

I can change them to:

	/*
	 * TDX only supports 4K, 2M and 1G page, but doesn't
	 * support 512G page size.
	 */
#define TDX_PG_LEVEL_MAX	PG_LEVEL_512G

	unsigned long pamt_base[TDX_PG_LEVEL_MAX];
	unsigned long pamt_size[TDX_PG_LEVEL_MAX];

And change the loop to:

	for (pgsz = PG_LEVEL_4K; pgsz < TDX_PG_LEVEL_MAX; pgsz++) {
		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
		...
	}

This would waste one 'unsigned long' for both pamt_base and pamt_size array, as
entry 0 isn't used for both of them.  Or we explicitly -1 array index:

	for (pgsz = PG_LEVEL_4K; pgsz < TDX_PG_LEVEL_MAX; pgsz++) {
		pamt_size[pgsz - 1] = tdmr_get_pamt_sz(tdmr, pgsz);
		...
	}

What's your opinion? 

> > +/*
> > + * Calculate PAMT size given a TDMR and a page size.  The returned
> > + * PAMT size is always aligned up to 4K page boundary.
> > + */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> > +				      enum tdx_page_sz pgsz)
> > +{
> > +	unsigned long pamt_sz;
> > +	int pamt_entry_nr;
> 
> 'nr_pamt_entries', please.

OK.

> 
> > +	switch (pgsz) {
> > +	case TDX_PG_4K:
> > +		pamt_entry_nr = tdmr->size >> PAGE_SHIFT;
> > +		break;
> > +	case TDX_PG_2M:
> > +		pamt_entry_nr = tdmr->size >> PMD_SHIFT;
> > +		break;
> > +	case TDX_PG_1G:
> > +		pamt_entry_nr = tdmr->size >> PUD_SHIFT;
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		return 0;
> > +	}
> > +
> > +	pamt_sz = pamt_entry_nr * tdx_sysinfo.pamt_entry_size;
> > +	/* TDX requires PAMT size must be 4K aligned */
> > +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> > +
> > +	return pamt_sz;
> > +}
> > +
> > +/*
> > + * Pick a NUMA node on which to allocate this TDMR's metadata.
> > + *
> > + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> > + * not be.  If the TDMR covers more than one node, just use the _first_
> > + * one.  This can lead to small areas of off-node metadata for some
> > + * memory.
> > + */
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	int i, nid;
> > +
> > +	/* Find the first memory region covered by the TDMR */
> > +	memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, &nid)
> > {
> > +		if (end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> > +			return nid;
> > +	}
> > +
> > +	/*
> > +	 * No memory region found for this TDMR.  It cannot happen since
> > +	 * when one TDMR is created, it must cover at least one (or
> > +	 * partial) memory region.
> > +	 */
> > +	WARN_ON_ONCE(1);
> > +	return 0;
> > +}
> 
> You should really describe what you are doing.  At first glance "return
> 0;" looks like "declare success".  How about something like this?
> 
> 	/*
> 	 * Fall back to allocating the TDMR from node 0 when no memblock
> 	 * can be found.  This should never happen since TDMRs originate
> 	 * from the memblocks.
> 	 */
> 
> Does that miss any of the points you were trying to make?

No. Your comments looks better and will use yours.  Thanks.

> 
> > +static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long pamt_base[TDX_PG_MAX];
> > +	unsigned long pamt_size[TDX_PG_MAX];
> > +	unsigned long tdmr_pamt_base;
> > +	unsigned long tdmr_pamt_size;
> > +	enum tdx_page_sz pgsz;
> > +	struct page *pamt;
> > +	int nid;
> > +
> > +	nid = tdmr_get_nid(tdmr);
> > +
> > +	/*
> > +	 * Calculate the PAMT size for each TDX supported page size
> > +	 * and the total PAMT size.
> > +	 */
> > +	tdmr_pamt_size = 0;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> > +		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
> > +		tdmr_pamt_size += pamt_size[pgsz];
> > +	}
> > +
> > +	/*
> > +	 * Allocate one chunk of physically contiguous memory for all
> > +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> > +	 * in overlapped TDMRs.
> > +	 */
> > +	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> > +			nid, &node_online_map);
> > +	if (!pamt)
> > +		return -ENOMEM;
> 
> I'm not sure it's worth mentioning, but this doesn't really need to be
> GFP_KERNEL.  __GFP_HIGHMEM would actually be just fine.  But,
> considering that this is 64-bit only, that's just a technicality.



> 
> > +	/* Calculate PAMT base and size for all supported page sizes. */
> 
> That comment isn't doing much good.  If you say anything here it should be:
> 
> 	/*
> 	 * Break the contiguous allocation back up into
> 	 * the individual PAMTs for each page size:
> 	 */
> 
> Also, this is *not* "calculating size".  That's done above.

Thanks will use this comment.

> 
> > +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> > +		pamt_base[pgsz] = tdmr_pamt_base;
> > +		tdmr_pamt_base += pamt_size[pgsz];
> > +	}
> > +
> > +	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> > +	tdmr->pamt_4k_size = pamt_size[TDX_PG_4K];
> > +	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> > +	tdmr->pamt_2m_size = pamt_size[TDX_PG_2M];
> > +	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> > +	tdmr->pamt_1g_size = pamt_size[TDX_PG_1G];
> > +
> > +	return 0;
> > +}
> > 
> > +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> > +			  unsigned long *pamt_npages)
> > +{
> > +	unsigned long pamt_base, pamt_sz;
> > +
> > +	/*
> > +	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
> > +	 * should always point to the beginning of that allocation.
> > +	 */
> > +	pamt_base = tdmr->pamt_4k_base;
> > +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr-
> > >pamt_1g_size;
> > +
> > +	*pamt_pfn = pamt_base >> PAGE_SHIFT;
> > +	*pamt_npages = pamt_sz >> PAGE_SHIFT;
> > +}
> > +
> > +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long pamt_pfn, pamt_npages;
> > +
> > +	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
> > +
> > +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> > +	if (!pamt_npages)
> > +		return;
> > +
> > +	if (WARN_ON_ONCE(!pamt_pfn))
> > +		return;
> > +
> > +	free_contig_range(pamt_pfn, pamt_npages);
> > +}
> > +
> > +static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdmr_num; i++)
> > +		tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
> > +}
> > +
> > +/* Allocate and set up PAMTs for all TDMRs */
> > +static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int
> > tdmr_num)
> > +{
> > +	int i, ret = 0;
> > +
> > +	for (i = 0; i < tdmr_num; i++) {
> > +		ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	return 0;
> > +err:
> > +	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > +	return ret;
> > +}
> > +
> > +static unsigned long tdmrs_get_pamt_pages(struct tdmr_info *tdmr_array,
> > +					  int tdmr_num)
> 
> "get" is for refcounting.  tdmrs_count_pamt_pages() would be preferable.

Will use count.  Thanks.

> 
> > +{
> > +	unsigned long pamt_npages = 0;
> > +	int i;
> > +
> > +	for (i = 0; i < tdmr_num; i++) {
> > +		unsigned long pfn, npages;
> > +
> > +		tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn,
> > &npages);
> > +		pamt_npages += npages;
> > +	}
> > +
> > +	return pamt_npages;
> > +}
> > +
> >  /*
> >   * Construct an array of TDMRs to cover all memory regions in memblock.
> >   * This makes sure all pages managed by the page allocator are TDX
> > @@ -572,8 +762,13 @@ static int construct_tdmrs_memeblock(struct tdmr_info
> > *tdmr_array,
> >  	if (ret)
> >  		goto err;
> >  
> > +	ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
> > +	if (ret)
> > +		goto err;
> > +
> >  	/* Return -EINVAL until constructing TDMRs is done */
> >  	ret = -EINVAL;
> > +	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> >  err:
> >  	return ret;
> >  }
> > @@ -644,6 +839,11 @@ static int init_tdx_module(void)
> >  	 * process are done.
> >  	 */
> >  	ret = -EINVAL;
> > +	if (ret)
> > +		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > +	else
> > +		pr_info("%lu pages allocated for PAMT.\n",
> > +				tdmrs_get_pamt_pages(tdmr_array,
> > tdmr_num));
> >  out_free_tdmrs:
> >  	/*
> >  	 * The array of TDMRs is freed no matter the initialization is
> 
> The rest looks OK.

Thanks.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-27 10:31     ` Kai Huang
@ 2022-06-27 20:41       ` Dave Hansen
  2022-06-27 22:50         ` Kai Huang
  2022-06-28  0:48         ` Xiaoyao Li
  0 siblings, 2 replies; 114+ messages in thread
From: Dave Hansen @ 2022-06-27 20:41 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/27/22 03:31, Kai Huang wrote:
>>> +/* Page sizes supported by TDX */
>>> +enum tdx_page_sz {
>>> +	TDX_PG_4K,
>>> +	TDX_PG_2M,
>>> +	TDX_PG_1G,
>>> +	TDX_PG_MAX,
>>> +};
>> Are these the same constants as the magic numbers in Kirill's
>> try_accept_one()?
> try_accept_once() uses 'enum pg_level' PG_LEVEL_{4K,2M,1G} directly.  They can
> be used directly too, but 'enum pg_level' has more than we need here:

I meant this:

+       switch (level) {
+       case PG_LEVEL_4K:
+               page_size = 0;
+               break;

Because TDX_PG_4K==page_size==0, and for this:

+       case PG_LEVEL_2M:
+               page_size = 1;

where TDX_PG_2M==page_size==1

See?

Are Kirill's magic 0/1/2 numbers the same as

	TDX_PG_4K,
	TDX_PG_2M,
	TDX_PG_1G,

?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-27  5:26     ` Kai Huang
@ 2022-06-27 20:46       ` Dave Hansen
  2022-06-27 22:34         ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-27 20:46 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/26/22 22:26, Kai Huang wrote:
> On Fri, 2022-06-24 at 11:50 -0700, Dave Hansen wrote:
>> So, the last patch was called:
>>
>> 	Implement SEAMCALL function
>>
>> and yet, in this patch, we have a "seamcall()" function.  That's a bit
>> confusing and not covered at *all* in this subject.
>>
>> Further, seamcall() is the *ONLY* caller of __seamcall() that I see in
>> this series.  That makes its presence here even more odd.
>>
>> The seamcall() bits should either be in their own patch, or mashed in
>> with __seamcall().
> 
> Right.  The reason I didn't put the seamcall() into previous patch was it is
> only used in this tdx.c, so it should be static.  But adding a static function
> w/o using it in previous patch will trigger a compile warning.  So I introduced
> here where it is first used.
> 
> One option is I can introduce seamcall() as a static inline function in tdx.h in
> previous patch so there won't be a warning.  I'll change to use this way. 
> Please let me know if you have any comments.

Does a temporary __unused get rid of the warning?

>>>  /*
>>>   * Detect and initialize the TDX module.
>>>   *
>>> @@ -138,7 +195,10 @@ static int init_tdx_module(void)
>>>  
>>>  static void shutdown_tdx_module(void)
>>>  {
>>> -	/* TODO: Shut down the TDX module */
>>> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
>>> +
>>> +	seamcall_on_each_cpu(&sc);
>>> +
>>>  	tdx_module_status = TDX_MODULE_SHUTDOWN;
>>>  }
>>>  
>>> @@ -221,6 +281,9 @@ bool platform_tdx_enabled(void)
>>>   * CPU hotplug is temporarily disabled internally to prevent any cpu
>>>   * from going offline.
>>>   *
>>> + * Caller also needs to guarantee all CPUs are in VMX operation during
>>> + * this function, otherwise Oops may be triggered.
>>
>> I would *MUCH* rather have this be a:
>>
>> 	if (!cpu_feature_enabled(X86_FEATURE_VMX))
>> 		WARN_ONCE("VMX should be on blah blah\n");
>>
>> than just plain oops.  Even a pr_err() that preceded the oops would be
>> nicer than an oops that someone has to go decode and then grumble when
>> their binutils is too old that it can't disassemble the TDCALL.
> 
> I can add this to seamcall():
> 
> 	/*
> 	 * SEAMCALL requires CPU being in VMX operation otherwise it causes
> #UD.
> 	 * Sanity check and return early to avoid Oops.  Note cpu_vmx_enabled()
> 	 * actually only checks whether VMX is enabled but doesn't check
> whether
> 	 * CPU is in VMX operation (VMXON is done).  There's no way to check
> 	 * whether VMXON has been done, but currently enabling VMX and doing
> 	 * VMXON are always done together.
> 	 */
> 	if (!cpu_vmx_enabled())	 {
> 		WARN_ONCE("CPU is not in VMX operation before making
> SEAMCALL");
> 		return -EINVAL;
> 	}
> 
> The reason I didn't do is I'd like to make seamcall() simple, that it only
> returns TDX_SEAMCALL_VMFAILINVALID or the actual SEAMCALL leaf error.  With
> above, this function also returns kernel error code, which isn't good.

I think you're missing the point.  You wasted two lines of code on a
*COMMENT* that doesn't actually help anyone decode an oops.  You could
have, instead, spent two lines on actual code that would have been just
as good or better than a comment *AND* help folks looking at an oops.

It's almost always better to do something actionable in code than to
comment it, unless it's in some crazy fast path.

> Alternatively, we can always add EXTABLE to TDX_MODULE_CALL macro to handle #UD
> and #GP by returning dedicated error codes (please also see my reply to previous
> patch for the code needed to handle), in which case we don't need such check
> here.
> 
> Always handling #UD in TDX_MODULE_CALL macro also has another advantage:  there
> will be no Oops for #UD regardless the issue that "there's no way to check
> whether VMXON has been done" in the above comment.
> 
> What's your opinion?

I think you should explore using the EXTABLE.  Let's see how it looks.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-27  5:23     ` Kai Huang
@ 2022-06-27 20:58       ` Dave Hansen
  2022-06-27 22:10         ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-27 20:58 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/26/22 22:23, Kai Huang wrote:
> On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
>> On 6/22/22 04:16, Kai Huang wrote:
>>> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
>>> CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
>>> SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
>>> before calling __seamcall().
>>
>> I was trying to make the argument earlier that you don't need *ANY*
>> detection for TDX, other than the ability to make a SEAMCALL.
>> Basically, patch 01/22 could go away.
...
>> So what does patch 01/22 buy us?  One EXTABLE entry?
> 
> There are below pros if we can detect whether TDX is enabled by BIOS during boot
> before initializing the TDX Module:
> 
> 1) There are requirements from customers to report whether platform supports TDX
> and the TDX keyID numbers before initializing the TDX module so the userspace
> cloud software can use this information to do something.  Sorry I cannot find
> the lore link now.

<sigh>

Never listen to customers literally.  It'll just lead you down the wrong
path.  They told you, "we need $FOO in dmesg" and you ran with it
without understanding why.  The fact that you even *need* to find the
lore link is because you didn't bother to realize what they really needed.

dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
do it with a *REAL* ABI.  Not dmesg.

> 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> managed memory hotplug.  Kexec() support patch also can use it.
> 
> Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> is enabled by BIOS, but not whether TDX module is loaded, or the result of
> initializing the TDX module.  So I think we should have some code to detect TDX
> during boot.

This is *EXACTLY* why our colleagues at Intel needs to tell us about
what the OS and firmware should do when TDX is in varying states of decay.

Does the mere presence of the TDX module prevent hotplug?  Or, if a
system has the TDX module loaded but no intent to ever use TDX, why
can't it just use hotplug like a normal system which is not addled with
the TDX albatross around its neck?

> Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
> code comparing to detecting TDX during boot:

It depends on a bunch of things.  It might only be a line or two of
assembly.

If you actually went and tried it, you might be able to convince me it's
a bad idea.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-27 20:58       ` Dave Hansen
@ 2022-06-27 22:10         ` Kai Huang
  2022-07-19 19:39           ` Dan Williams
  2022-07-20 10:18           ` Kai Huang
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27 22:10 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> On 6/26/22 22:23, Kai Huang wrote:
> > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > On 6/22/22 04:16, Kai Huang wrote:
> > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > before calling __seamcall().
> > > 
> > > I was trying to make the argument earlier that you don't need *ANY*
> > > detection for TDX, other than the ability to make a SEAMCALL.
> > > Basically, patch 01/22 could go away.
> ...
> > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > 
> > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > before initializing the TDX Module:
> > 
> > 1) There are requirements from customers to report whether platform supports TDX
> > and the TDX keyID numbers before initializing the TDX module so the userspace
> > cloud software can use this information to do something.  Sorry I cannot find
> > the lore link now.
> 
> <sigh>
> 
> Never listen to customers literally.  It'll just lead you down the wrong
> path.  They told you, "we need $FOO in dmesg" and you ran with it
> without understanding why.  The fact that you even *need* to find the
> lore link is because you didn't bother to realize what they really needed.
> 
> dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> do it with a *REAL* ABI.  Not dmesg.

Showing in the dmesg is the first step, but later we have plan to expose keyID
info via /sysfs.  Of course, it's always arguable customer's such requirement is
absolutely needed, but to me it's still a good thing to have code to detect TDX
during boot.  The code isn't complicated as you can see.

> 
> > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > managed memory hotplug.  Kexec() support patch also can use it.
> > 
> > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > initializing the TDX module.  So I think we should have some code to detect TDX
> > during boot.
> 
> This is *EXACTLY* why our colleagues at Intel needs to tell us about
> what the OS and firmware should do when TDX is in varying states of decay.

Yes I am working on it to make it public.

> 
> Does the mere presence of the TDX module prevent hotplug?  
> 

For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
Whether SEAMRR is enabled determines.

For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
work with others internally to get some public statement.

> Or, if a
> system has the TDX module loaded but no intent to ever use TDX, why
> can't it just use hotplug like a normal system which is not addled with
> the TDX albatross around its neck?

I think if a machine has enabled TDX in the BIOS, the user of the machine very
likely has intention to actually use TDX.

Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
use TDX, it's better to not disable it.  But to me it's also not a disaster if
we just disable driver-managed memory hotplug if TDX is enabled by BIOS.

For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but
I'll get some public statement around this.

> 
> > Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
> > code comparing to detecting TDX during boot:
> 
> It depends on a bunch of things.  It might only be a line or two of
> assembly.
> 
> If you actually went and tried it, you might be able to convince me it's
> a bad idea.

The code I showed is basically the patch we need to call SEAMCALL at runtime w/o
detecting TDX at first.  #GP must be handled as it is what SEAMCALL triggers if
TDX is not enabled.  #UD happens when CPU isn't in VMX operation, and we should
distinguish it from #GP if we already want to handle #GP.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-27 20:46       ` Dave Hansen
@ 2022-06-27 22:34         ` Kai Huang
  2022-06-27 22:56           ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27 22:34 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 13:46 -0700, Dave Hansen wrote:
> On 6/26/22 22:26, Kai Huang wrote:
> > On Fri, 2022-06-24 at 11:50 -0700, Dave Hansen wrote:
> > > So, the last patch was called:
> > > 
> > > 	Implement SEAMCALL function
> > > 
> > > and yet, in this patch, we have a "seamcall()" function.  That's a bit
> > > confusing and not covered at *all* in this subject.
> > > 
> > > Further, seamcall() is the *ONLY* caller of __seamcall() that I see in
> > > this series.  That makes its presence here even more odd.
> > > 
> > > The seamcall() bits should either be in their own patch, or mashed in
> > > with __seamcall().
> > 
> > Right.  The reason I didn't put the seamcall() into previous patch was it is
> > only used in this tdx.c, so it should be static.  But adding a static function
> > w/o using it in previous patch will trigger a compile warning.  So I introduced
> > here where it is first used.
> > 
> > One option is I can introduce seamcall() as a static inline function in tdx.h in
> > previous patch so there won't be a warning.  I'll change to use this way. 
> > Please let me know if you have any comments.
> 
> Does a temporary __unused get rid of the warning?

Yes, and both __maybe_unused and __always_unused also git rid of the warning
too.

__unused is not defined in compiler_attributes.h, so we need to use
__attribute__((__unused__)) explicitly, or have __unused defined to it as a
macro.

I think I can just use __always_unused for this purpose?

So I think we put seamcall() implementation to the patch which implements
__seamcall().  And we can inline for seamcall() and put it in either tdx.h or
tdx.c, or we can use __always_unused  (or the one you prefer) to get rid of the
warning.

What's your opinion?
> 
> > > >  /*
> > > >   * Detect and initialize the TDX module.
> > > >   *
> > > > @@ -138,7 +195,10 @@ static int init_tdx_module(void)
> > > >  
> > > >  static void shutdown_tdx_module(void)
> > > >  {
> > > > -	/* TODO: Shut down the TDX module */
> > > > +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> > > > +
> > > > +	seamcall_on_each_cpu(&sc);
> > > > +
> > > >  	tdx_module_status = TDX_MODULE_SHUTDOWN;
> > > >  }
> > > >  
> > > > @@ -221,6 +281,9 @@ bool platform_tdx_enabled(void)
> > > >   * CPU hotplug is temporarily disabled internally to prevent any cpu
> > > >   * from going offline.
> > > >   *
> > > > + * Caller also needs to guarantee all CPUs are in VMX operation during
> > > > + * this function, otherwise Oops may be triggered.
> > > 
> > > I would *MUCH* rather have this be a:
> > > 
> > > 	if (!cpu_feature_enabled(X86_FEATURE_VMX))
> > > 		WARN_ONCE("VMX should be on blah blah\n");
> > > 
> > > than just plain oops.  Even a pr_err() that preceded the oops would be
> > > nicer than an oops that someone has to go decode and then grumble when
> > > their binutils is too old that it can't disassemble the TDCALL.
> > 
> > I can add this to seamcall():
> > 
> > 	/*
> > 	 * SEAMCALL requires CPU being in VMX operation otherwise it causes
> > #UD.
> > 	 * Sanity check and return early to avoid Oops.  Note cpu_vmx_enabled()
> > 	 * actually only checks whether VMX is enabled but doesn't check
> > whether
> > 	 * CPU is in VMX operation (VMXON is done).  There's no way to check
> > 	 * whether VMXON has been done, but currently enabling VMX and doing
> > 	 * VMXON are always done together.
> > 	 */
> > 	if (!cpu_vmx_enabled())	 {
> > 		WARN_ONCE("CPU is not in VMX operation before making
> > SEAMCALL");
> > 		return -EINVAL;
> > 	}
> > 
> > The reason I didn't do is I'd like to make seamcall() simple, that it only
> > returns TDX_SEAMCALL_VMFAILINVALID or the actual SEAMCALL leaf error.  With
> > above, this function also returns kernel error code, which isn't good.
> 
> I think you're missing the point.  You wasted two lines of code on a
> *COMMENT* that doesn't actually help anyone decode an oops.  You could
> have, instead, spent two lines on actual code that would have been just
> as good or better than a comment *AND* help folks looking at an oops.
> 
> It's almost always better to do something actionable in code than to
> comment it, unless it's in some crazy fast path.

Agreed.  Thanks.

> 
> > Alternatively, we can always add EXTABLE to TDX_MODULE_CALL macro to handle #UD
> > and #GP by returning dedicated error codes (please also see my reply to previous
> > patch for the code needed to handle), in which case we don't need such check
> > here.
> > 
> > Always handling #UD in TDX_MODULE_CALL macro also has another advantage:  there
> > will be no Oops for #UD regardless the issue that "there's no way to check
> > whether VMXON has been done" in the above comment.
> > 
> > What's your opinion?
> 
> I think you should explore using the EXTABLE.  Let's see how it looks.

I tried to wrote the code before.  I didn't test but it should look like to
something below.  Any comments?

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4b75c930fa1b..4a97ca8eb14c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,7 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>

+#ifdef CONFIG_INTEL_TDX_HOST
 /*
  * SW-defined error codes.
  *
@@ -18,6 +19,21 @@
 #define TDX_SW_ERROR                   (TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID     (TDX_SW_ERROR | _UL(0xFFFF0000))

+/*
+ * Special error codes to indicate SEAMCALL #GP and #UD.
+ *
+ * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and
+ * causes #UD when CPU is not in VMX operation.  Define two separate
+ * error codes to distinguish the two cases so caller can be aware of
+ * what caused the SEAMCALL to fail.
+ *
+ * Bits 61:48 are reserved bits which will never be set by the TDX
+ * module.  Borrow 2 reserved bits to represent #GP and #UD.
+ */
+#define TDX_SEAMCALL_GP                (TDX_ERROR | GENMASK_ULL(48, 48))
+#define TDX_SEAMCALL_UD                (TDX_ERROR | GENMASK_ULL(49, 49))
+#endif
+
 #ifndef __ASSEMBLY__

 /*
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..7431c47258d9 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>

 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
        /* Leave input param 2 in RDX */

        .if \host
+1:
        seamcall
        /*
         * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,9 +59,25 @@
         * This value will never be used as actual SEAMCALL error code as
         * it is from the Reserved status code class.
         */
-       jnc .Lno_vmfailinvalid
+       jnc .Lseamcall_out
        mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+       jmp .Lseamcall_out
+2:
+       /*
+        * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+        * the trap number.  Check the trap number and set up the return
+        * value to %rax.
+        */
+       cmp $X86_TRAP_GP, %eax
+       je .Lseamcall_gp
+       mov $TDX_SEAMCALL_UD, %rax
+       jmp .Lseamcall_out
+.Lseamcall_gp:
+       mov $TDX_SEAMCALL_GP, %rax
+       jmp .Lseamcall_out
+
+       _ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out






^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-27 20:41       ` Dave Hansen
@ 2022-06-27 22:50         ` Kai Huang
  2022-06-27 22:57           ` Dave Hansen
  2022-06-28  0:48         ` Xiaoyao Li
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27 22:50 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 13:41 -0700, Dave Hansen wrote:
> On 6/27/22 03:31, Kai Huang wrote:
> > > > +/* Page sizes supported by TDX */
> > > > +enum tdx_page_sz {
> > > > +	TDX_PG_4K,
> > > > +	TDX_PG_2M,
> > > > +	TDX_PG_1G,
> > > > +	TDX_PG_MAX,
> > > > +};
> > > Are these the same constants as the magic numbers in Kirill's
> > > try_accept_one()?
> > try_accept_once() uses 'enum pg_level' PG_LEVEL_{4K,2M,1G} directly.  They can
> > be used directly too, but 'enum pg_level' has more than we need here:
> 
> I meant this:
> 
> +       switch (level) {
> +       case PG_LEVEL_4K:
> +               page_size = 0;
> +               break;
> 
> Because TDX_PG_4K==page_size==0, and for this:
> 
> +       case PG_LEVEL_2M:
> +               page_size = 1;
> 
> where TDX_PG_2M==page_size==1
> 
> See?
> 
> Are Kirill's magic 0/1/2 numbers the same as
> 
> 	TDX_PG_4K,
> 	TDX_PG_2M,
> 	TDX_PG_1G,
> 
> ?

Yes they are the same.  Kirill uses 0/1/2 as input of TDX_ACCEPT_PAGE TDCALL. 
Here I only need them to distinguish different page sizes.

Do you mean we should put TDX_PG_4K/2M/1G definition to asm/tdx.h, and
try_accept_one() should use them instead of magic 0/1/2?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-27 22:34         ` Kai Huang
@ 2022-06-27 22:56           ` Dave Hansen
  2022-06-27 23:59             ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-27 22:56 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/27/22 15:34, Kai Huang wrote:
> On Mon, 2022-06-27 at 13:46 -0700, Dave Hansen wrote:
> I think I can just use __always_unused for this purpose?
> 
> So I think we put seamcall() implementation to the patch which implements
> __seamcall().  And we can inline for seamcall() and put it in either tdx.h or
> tdx.c, or we can use __always_unused  (or the one you prefer) to get rid of the
> warning.
> 
> What's your opinion?

A temporary __always_unused seems fine to me.

>>> Alternatively, we can always add EXTABLE to TDX_MODULE_CALL macro to handle #UD
>>> and #GP by returning dedicated error codes (please also see my reply to previous
>>> patch for the code needed to handle), in which case we don't need such check
>>> here.
>>>
>>> Always handling #UD in TDX_MODULE_CALL macro also has another advantage:  there
>>> will be no Oops for #UD regardless the issue that "there's no way to check
>>> whether VMXON has been done" in the above comment.
>>>
>>> What's your opinion?
>>
>> I think you should explore using the EXTABLE.  Let's see how it looks.
> 
> I tried to wrote the code before.  I didn't test but it should look like to
> something below.  Any comments?
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4b75c930fa1b..4a97ca8eb14c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,7 @@
>  #include <asm/ptrace.h>
>  #include <asm/shared/tdx.h>
> 
> +#ifdef CONFIG_INTEL_TDX_HOST
>  /*
>   * SW-defined error codes.
>   *
> @@ -18,6 +19,21 @@
>  #define TDX_SW_ERROR                   (TDX_ERROR | GENMASK_ULL(47, 40))
>  #define TDX_SEAMCALL_VMFAILINVALID     (TDX_SW_ERROR | _UL(0xFFFF0000))
> 
> +/*
> + * Special error codes to indicate SEAMCALL #GP and #UD.
> + *
> + * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and
> + * causes #UD when CPU is not in VMX operation.  Define two separate
> + * error codes to distinguish the two cases so caller can be aware of
> + * what caused the SEAMCALL to fail.
> + *
> + * Bits 61:48 are reserved bits which will never be set by the TDX
> + * module.  Borrow 2 reserved bits to represent #GP and #UD.
> + */
> +#define TDX_SEAMCALL_GP                (TDX_ERROR | GENMASK_ULL(48, 48))
> +#define TDX_SEAMCALL_UD                (TDX_ERROR | GENMASK_ULL(49, 49))
> +#endif
> +
>  #ifndef __ASSEMBLY__
> 
>  /*
> diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> index 49a54356ae99..7431c47258d9 100644
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #include <asm/asm-offsets.h>
>  #include <asm/tdx.h>
> +#include <asm/asm.h>
> 
>  /*
>   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> @@ -45,6 +46,7 @@
>         /* Leave input param 2 in RDX */
> 
>         .if \host
> +1:
>         seamcall
>         /*
>          * SEAMCALL instruction is essentially a VMExit from VMX root
> @@ -57,9 +59,25 @@
>          * This value will never be used as actual SEAMCALL error code as
>          * it is from the Reserved status code class.
>          */
> -       jnc .Lno_vmfailinvalid
> +       jnc .Lseamcall_out
>         mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> -.Lno_vmfailinvalid:
> +       jmp .Lseamcall_out
> +2:
> +       /*
> +        * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> +        * the trap number.  Check the trap number and set up the return
> +        * value to %rax.
> +        */
> +       cmp $X86_TRAP_GP, %eax
> +       je .Lseamcall_gp
> +       mov $TDX_SEAMCALL_UD, %rax
> +       jmp .Lseamcall_out
> +.Lseamcall_gp:
> +       mov $TDX_SEAMCALL_GP, %rax
> +       jmp .Lseamcall_out
> +
> +       _ASM_EXTABLE_FAULT(1b, 2b)
> +.Lseamcall_out

Not too bad, although the end of that is a bit ugly.  It would be nicer
if you could just return the %rax value in the exception section instead
of having to do the transform there.  Maybe have a TDX_ERROR code with
enough bits to hold any X86_TRAP_FOO.

It'd be nice if Peter Z or Andy L has a sec to look at this.  Seems like
the kind of thing they'd have good ideas about.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-27 22:50         ` Kai Huang
@ 2022-06-27 22:57           ` Dave Hansen
  2022-06-27 23:05             ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-27 22:57 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/27/22 15:50, Kai Huang wrote:
>> Are Kirill's magic 0/1/2 numbers the same as
>>
>> 	TDX_PG_4K,
>> 	TDX_PG_2M,
>> 	TDX_PG_1G,
>>
>> ?
> Yes they are the same.  Kirill uses 0/1/2 as input of TDX_ACCEPT_PAGE TDCALL. 
> Here I only need them to distinguish different page sizes.
> 
> Do you mean we should put TDX_PG_4K/2M/1G definition to asm/tdx.h, and
> try_accept_one() should use them instead of magic 0/1/2?

I honestly don't care how you do it as long as the magic numbers go away
(within reason).

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-27 22:57           ` Dave Hansen
@ 2022-06-27 23:05             ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-27 23:05 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 15:57 -0700, Dave Hansen wrote:
> On 6/27/22 15:50, Kai Huang wrote:
> > > Are Kirill's magic 0/1/2 numbers the same as
> > > 
> > > 	TDX_PG_4K,
> > > 	TDX_PG_2M,
> > > 	TDX_PG_1G,
> > > 
> > > ?
> > Yes they are the same.  Kirill uses 0/1/2 as input of TDX_ACCEPT_PAGE TDCALL. 
> > Here I only need them to distinguish different page sizes.
> > 
> > Do you mean we should put TDX_PG_4K/2M/1G definition to asm/tdx.h, and
> > try_accept_one() should use them instead of magic 0/1/2?
> 
> I honestly don't care how you do it as long as the magic numbers go away
> (within reason).

OK.  I'll write a patch to replace 0/1/2 magic numbers in try_accept_one().

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-27 22:56           ` Dave Hansen
@ 2022-06-27 23:59             ` Kai Huang
  2022-06-28  0:03               ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-27 23:59 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 15:56 -0700, Dave Hansen wrote:
> On 6/27/22 15:34, Kai Huang wrote:
> > On Mon, 2022-06-27 at 13:46 -0700, Dave Hansen wrote:
> > I think I can just use __always_unused for this purpose?
> > 
> > So I think we put seamcall() implementation to the patch which implements
> > __seamcall().  And we can inline for seamcall() and put it in either tdx.h or
> > tdx.c, or we can use __always_unused  (or the one you prefer) to get rid of the
> > warning.
> > 
> > What's your opinion?
> 
> A temporary __always_unused seems fine to me.

Thanks will do.

> 
> > > > Alternatively, we can always add EXTABLE to TDX_MODULE_CALL macro to handle #UD
> > > > and #GP by returning dedicated error codes (please also see my reply to previous
> > > > patch for the code needed to handle), in which case we don't need such check
> > > > here.
> > > > 
> > > > Always handling #UD in TDX_MODULE_CALL macro also has another advantage:  there
> > > > will be no Oops for #UD regardless the issue that "there's no way to check
> > > > whether VMXON has been done" in the above comment.
> > > > 
> > > > What's your opinion?
> > > 
> > > I think you should explore using the EXTABLE.  Let's see how it looks.
> > 
> > I tried to wrote the code before.  I didn't test but it should look like to
> > something below.  Any comments?
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 4b75c930fa1b..4a97ca8eb14c 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -8,6 +8,7 @@
> >  #include <asm/ptrace.h>
> >  #include <asm/shared/tdx.h>
> > 
> > +#ifdef CONFIG_INTEL_TDX_HOST
> >  /*
> >   * SW-defined error codes.
> >   *
> > @@ -18,6 +19,21 @@
> >  #define TDX_SW_ERROR                   (TDX_ERROR | GENMASK_ULL(47, 40))
> >  #define TDX_SEAMCALL_VMFAILINVALID     (TDX_SW_ERROR | _UL(0xFFFF0000))
> > 
> > +/*
> > + * Special error codes to indicate SEAMCALL #GP and #UD.
> > + *
> > + * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and
> > + * causes #UD when CPU is not in VMX operation.  Define two separate
> > + * error codes to distinguish the two cases so caller can be aware of
> > + * what caused the SEAMCALL to fail.
> > + *
> > + * Bits 61:48 are reserved bits which will never be set by the TDX
> > + * module.  Borrow 2 reserved bits to represent #GP and #UD.
> > + */
> > +#define TDX_SEAMCALL_GP                (TDX_ERROR | GENMASK_ULL(48, 48))
> > +#define TDX_SEAMCALL_UD                (TDX_ERROR | GENMASK_ULL(49, 49))
> > +#endif
> > +
> >  #ifndef __ASSEMBLY__
> > 
> >  /*
> > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > index 49a54356ae99..7431c47258d9 100644
> > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > @@ -1,6 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #include <asm/asm-offsets.h>
> >  #include <asm/tdx.h>
> > +#include <asm/asm.h>
> > 
> >  /*
> >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > @@ -45,6 +46,7 @@
> >         /* Leave input param 2 in RDX */
> > 
> >         .if \host
> > +1:
> >         seamcall
> >         /*
> >          * SEAMCALL instruction is essentially a VMExit from VMX root
> > @@ -57,9 +59,25 @@
> >          * This value will never be used as actual SEAMCALL error code as
> >          * it is from the Reserved status code class.
> >          */
> > -       jnc .Lno_vmfailinvalid
> > +       jnc .Lseamcall_out
> >         mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> > -.Lno_vmfailinvalid:
> > +       jmp .Lseamcall_out
> > +2:
> > +       /*
> > +        * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> > +        * the trap number.  Check the trap number and set up the return
> > +        * value to %rax.
> > +        */
> > +       cmp $X86_TRAP_GP, %eax
> > +       je .Lseamcall_gp
> > +       mov $TDX_SEAMCALL_UD, %rax
> > +       jmp .Lseamcall_out
> > +.Lseamcall_gp:
> > +       mov $TDX_SEAMCALL_GP, %rax
> > +       jmp .Lseamcall_out
> > +
> > +       _ASM_EXTABLE_FAULT(1b, 2b)
> > +.Lseamcall_out
> 
> Not too bad, although the end of that is a bit ugly.  It would be nicer
> if you could just return the %rax value in the exception section instead
> of having to do the transform there.  Maybe have a TDX_ERROR code with
> enough bits to hold any X86_TRAP_FOO.

We already have declared bits 47:40 == 0xFF is never used by TDX module:

/*
 * SW-defined error codes.
 *
 * Bits 47:40 == 0xFF indicate Reserved status code class that never used by
 * TDX module.
 */
#define TDX_ERROR                       _BITUL(63)
#define TDX_SW_ERROR                    (TDX_ERROR | GENMASK_ULL(47, 40))
#define TDX_SEAMCALL_VMFAILINVALID      (TDX_SW_ERROR | _UL(0xFFFF0000))

So how about just putting the X86_TRAP_FOO to the last 32-bits?  We only have 32
traps, so 32-bits is more than enough.

#define TDX_SEAMCALL_GP		(TDX_SW_ERROR | X86_TRAP_GP)
#define TDX_SEAMCALL_UD		(TDX_SW_ERROR | X86_TRAP_UD)

If so,  in the assembly, I think we can just XOR TDX_SW_ERROR to the %rax and
return %rax:

2:
        /*
	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
	 * the trap number.  Convert trap number to TDX error code by setting
	 * TDX_SW_ERROR to the high 32-bits of %rax.
	 */
	xorq	$TDX_SW_ERROR, %rax

How does this look?



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-27 23:59             ` Kai Huang
@ 2022-06-28  0:03               ` Dave Hansen
  2022-06-28  0:11                 ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-28  0:03 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/27/22 16:59, Kai Huang wrote:
> If so,  in the assembly, I think we can just XOR TDX_SW_ERROR to the %rax and
> return %rax:
> 
> 2:
>         /*
> 	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> 	 * the trap number.  Convert trap number to TDX error code by setting
> 	 * TDX_SW_ERROR to the high 32-bits of %rax.
> 	 */
> 	xorq	$TDX_SW_ERROR, %rax
> 
> How does this look?

I guess it doesn't matter if you know the things being masked together
are padded correctly, but I probably would have done a straight OR, not XOR.

Otherwise, I think that looks OK.  Simplifies the assembly for sure.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error
  2022-06-28  0:03               ` Dave Hansen
@ 2022-06-28  0:11                 ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-28  0:11 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-06-27 at 17:03 -0700, Dave Hansen wrote:
> On 6/27/22 16:59, Kai Huang wrote:
> > If so,  in the assembly, I think we can just XOR TDX_SW_ERROR to the %rax and
> > return %rax:
> > 
> > 2:
> >         /*
> > 	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> > 	 * the trap number.  Convert trap number to TDX error code by setting
> > 	 * TDX_SW_ERROR to the high 32-bits of %rax.
> > 	 */
> > 	xorq	$TDX_SW_ERROR, %rax
> > 
> > How does this look?
> 
> I guess it doesn't matter if you know the things being masked together
> are padded correctly, but I probably would have done a straight OR, not XOR.
> 
> Otherwise, I think that looks OK.  Simplifies the assembly for sure.

Right straight OR is better.  Thanks.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-27 20:41       ` Dave Hansen
  2022-06-27 22:50         ` Kai Huang
@ 2022-06-28  0:48         ` Xiaoyao Li
  2022-06-28 17:03           ` Dave Hansen
  1 sibling, 1 reply; 114+ messages in thread
From: Xiaoyao Li @ 2022-06-28  0:48 UTC (permalink / raw)
  To: Dave Hansen, Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/28/2022 4:41 AM, Dave Hansen wrote:
> On 6/27/22 03:31, Kai Huang wrote:
>>>> +/* Page sizes supported by TDX */
>>>> +enum tdx_page_sz {
>>>> +	TDX_PG_4K,
>>>> +	TDX_PG_2M,
>>>> +	TDX_PG_1G,
>>>> +	TDX_PG_MAX,
>>>> +};
>>> Are these the same constants as the magic numbers in Kirill's
>>> try_accept_one()?
>> try_accept_once() uses 'enum pg_level' PG_LEVEL_{4K,2M,1G} directly.  They can
>> be used directly too, but 'enum pg_level' has more than we need here:
> 
> I meant this:
> 
> +       switch (level) {
> +       case PG_LEVEL_4K:
> +               page_size = 0;
> +               break;
> 
> Because TDX_PG_4K==page_size==0, and for this:
> 
> +       case PG_LEVEL_2M:
> +               page_size = 1;

here we can just do

	page_size = level - 1;

or
	
	tdx_page_level = level - 1;

yes, TDX's page level definition is one level smaller of Linux's definition.

> where TDX_PG_2M==page_size==1
> 
> See?
> 
> Are Kirill's magic 0/1/2 numbers the same as
> 
> 	TDX_PG_4K,
> 	TDX_PG_2M,
> 	TDX_PG_1G,
> 
> ?




^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-27  8:01       ` Igor Mammedov
@ 2022-06-28 10:04         ` Kai Huang
  2022-06-28 11:52           ` Igor Mammedov
  2022-06-28 17:33           ` Rafael J. Wysocki
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-28 10:04 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky, Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld,
	Juri Lelli, Mark Rutland, Frederic Weisbecker, Yue Haibing,
	dongli.zhang

On Mon, 2022-06-27 at 10:01 +0200, Igor Mammedov wrote:
> On Thu, 23 Jun 2022 12:01:48 +1200
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > On Wed, 2022-06-22 at 13:42 +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:  
> > > > 
> > > > Platforms with confidential computing technology may not support ACPI
> > > > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > > > include Intel platforms which support Intel Trust Domain Extensions
> > > > (TDX).
> > > > 
> > > > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > > > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > > > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > > > the kernel cannot continue to work normally, and BUG().
> > > > 
> > > > Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> > > > platform doesn't support ACPI CPU hotplug, so that kernel can handle
> > > > ACPI CPU hotplug events for such platform.  The existing attribute
> > > > CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
> > > > 
> > > > In acpi_processor_{add|remove}(), add early check against this attribute
> > > > and handle accordingly if it is set.
> > > > 
> > > > Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> > > > CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
> > > > 
> > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > ---
> > > >  arch/x86/coco/core.c          |  2 +-
> > > >  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
> > > >  include/linux/cc_platform.h   | 15 +++++++++++++--
> > > >  kernel/cpu.c                  |  2 +-
> > > >  4 files changed, 38 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > > > index 4320fadae716..1bde1af75296 100644
> > > > --- a/arch/x86/coco/core.c
> > > > +++ b/arch/x86/coco/core.c
> > > > @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > > >  {
> > > >         switch (attr) {
> > > >         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > > -       case CC_ATTR_HOTPLUG_DISABLED:
> > > > +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
> > > >         case CC_ATTR_GUEST_MEM_ENCRYPT:
> > > >         case CC_ATTR_MEM_ENCRYPT:
> > > >                 return true;
> > > > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> > > > index 6737b1cbf6d6..b960db864cd4 100644
> > > > --- a/drivers/acpi/acpi_processor.c
> > > > +++ b/drivers/acpi/acpi_processor.c
> > > > @@ -15,6 +15,7 @@
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/module.h>
> > > >  #include <linux/pci.h>
> > > > +#include <linux/cc_platform.h>
> > > > 
> > > >  #include <acpi/processor.h>
> > > > 
> > > > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> > > >         struct device *dev;
> > > >         int result = 0;
> > > > 
> > > > +       /*
> > > > +        * If the confidential computing platform doesn't support ACPI
> > > > +        * memory hotplug, the BIOS should never deliver such event to
> > > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > > +        * the new CPU.
> > > > +        */
> > > > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {  
> > > 
> > > This will affect initialization, not just hotplug AFAICS.
> > > 
> > > You should reset the .hotplug.enabled flag in processor_handler to
> > > false instead.  
> > 
> > Hi Rafael,
> > 
> > Thanks for the review.  By "affect initialization" did you mean this
> > acpi_processor_add() is also called during kernel boot when any logical cpu is
> > brought up?  Or do you mean ACPI CPU hotplug can also happen during kernel boot
> > (after acpi_processor_init())?
> > 
> > I see acpi_processor_init() calls acpi_processor_check_duplicates() which calls
> > acpi_evaluate_object() but I don't know details of ACPI so I don't know whether
> > this would trigger acpi_processor_add().
> > 
> > One thing is TDX doesn't support ACPI CPU hotplug is an architectural thing, so
> > it is illegal even if it happens during kernel boot.  Dave's idea is the kernel
> > should  speak out loudly if physical CPU hotplug indeed happened on (BIOS) TDX-
> > enabled platforms.  Otherwise perhaps we can just give up initializing the ACPI
> > CPU hotplug in acpi_processor_init(), something like below?
> 
> The thing is that by the time ACPI machinery kicks in, physical hotplug
> has already happened and in case of (kvm+qemu+ovmf hypervisor combo)
> firmware has already handled it somehow and handed it over to ACPI.
> If you say it's architectural thing then cpu hotplug is platform/firmware
> bug and should be disabled there instead of working around it in the kernel.
> 
> Perhaps instead of 'preventing' hotplug, complain/panic and be done with it.

Hi Igor,

Thanks for feedback.  Yes the current implementation actually reports CPU hot-
add as BIOS bug.  I think I can report BIOS bug for hot-removal too.  And
currently I actually used BUG() for the hot-removal case.  For hot-add I didn't
use BUG() but rejected the new CPU as the latter is more conservative. 

Hi Rafael,

I am not sure I got what you mean by "This will affect initialization, not just
hotplug AFAICS", could you elaborate a little bit?  Thanks.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-28 10:04         ` Kai Huang
@ 2022-06-28 11:52           ` Igor Mammedov
  2022-06-28 17:33           ` Rafael J. Wysocki
  1 sibling, 0 replies; 114+ messages in thread
From: Igor Mammedov @ 2022-06-28 11:52 UTC (permalink / raw)
  To: Kai Huang
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky, Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld,
	Juri Lelli, Mark Rutland, Frederic Weisbecker, Yue Haibing,
	dongli.zhang

On Tue, 28 Jun 2022 22:04:43 +1200
Kai Huang <kai.huang@intel.com> wrote:

> On Mon, 2022-06-27 at 10:01 +0200, Igor Mammedov wrote:
> > On Thu, 23 Jun 2022 12:01:48 +1200
> > Kai Huang <kai.huang@intel.com> wrote:
> >   
> > > On Wed, 2022-06-22 at 13:42 +0200, Rafael J. Wysocki wrote:  
> > > > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:    
> > > > > 
> > > > > Platforms with confidential computing technology may not support ACPI
> > > > > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > > > > include Intel platforms which support Intel Trust Domain Extensions
> > > > > (TDX).
> > > > > 
> > > > > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > > > > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > > > > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > > > > the kernel cannot continue to work normally, and BUG().
> > > > > 
> > > > > Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> > > > > platform doesn't support ACPI CPU hotplug, so that kernel can handle
> > > > > ACPI CPU hotplug events for such platform.  The existing attribute
> > > > > CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
> > > > > 
> > > > > In acpi_processor_{add|remove}(), add early check against this attribute
> > > > > and handle accordingly if it is set.
> > > > > 
> > > > > Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> > > > > CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
> > > > > 
> > > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > > ---
> > > > >  arch/x86/coco/core.c          |  2 +-
> > > > >  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
> > > > >  include/linux/cc_platform.h   | 15 +++++++++++++--
> > > > >  kernel/cpu.c                  |  2 +-
> > > > >  4 files changed, 38 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > > > > index 4320fadae716..1bde1af75296 100644
> > > > > --- a/arch/x86/coco/core.c
> > > > > +++ b/arch/x86/coco/core.c
> > > > > @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > > > >  {
> > > > >         switch (attr) {
> > > > >         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > > > -       case CC_ATTR_HOTPLUG_DISABLED:
> > > > > +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
> > > > >         case CC_ATTR_GUEST_MEM_ENCRYPT:
> > > > >         case CC_ATTR_MEM_ENCRYPT:
> > > > >                 return true;
> > > > > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> > > > > index 6737b1cbf6d6..b960db864cd4 100644
> > > > > --- a/drivers/acpi/acpi_processor.c
> > > > > +++ b/drivers/acpi/acpi_processor.c
> > > > > @@ -15,6 +15,7 @@
> > > > >  #include <linux/kernel.h>
> > > > >  #include <linux/module.h>
> > > > >  #include <linux/pci.h>
> > > > > +#include <linux/cc_platform.h>
> > > > > 
> > > > >  #include <acpi/processor.h>
> > > > > 
> > > > > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> > > > >         struct device *dev;
> > > > >         int result = 0;
> > > > > 
> > > > > +       /*
> > > > > +        * If the confidential computing platform doesn't support ACPI
> > > > > +        * memory hotplug, the BIOS should never deliver such event to
> > > > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > > > +        * the new CPU.
> > > > > +        */
> > > > > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {    
> > > > 
> > > > This will affect initialization, not just hotplug AFAICS.
> > > > 
> > > > You should reset the .hotplug.enabled flag in processor_handler to
> > > > false instead.    
> > > 
> > > Hi Rafael,
> > > 
> > > Thanks for the review.  By "affect initialization" did you mean this
> > > acpi_processor_add() is also called during kernel boot when any logical cpu is
> > > brought up?  Or do you mean ACPI CPU hotplug can also happen during kernel boot
> > > (after acpi_processor_init())?
> > > 
> > > I see acpi_processor_init() calls acpi_processor_check_duplicates() which calls
> > > acpi_evaluate_object() but I don't know details of ACPI so I don't know whether
> > > this would trigger acpi_processor_add().
> > > 
> > > One thing is TDX doesn't support ACPI CPU hotplug is an architectural thing, so
> > > it is illegal even if it happens during kernel boot.  Dave's idea is the kernel
> > > should  speak out loudly if physical CPU hotplug indeed happened on (BIOS) TDX-
> > > enabled platforms.  Otherwise perhaps we can just give up initializing the ACPI
> > > CPU hotplug in acpi_processor_init(), something like below?  
> > 
> > The thing is that by the time ACPI machinery kicks in, physical hotplug
> > has already happened and in case of (kvm+qemu+ovmf hypervisor combo)
> > firmware has already handled it somehow and handed it over to ACPI.
> > If you say it's architectural thing then cpu hotplug is platform/firmware
> > bug and should be disabled there instead of working around it in the kernel.
> > 
> > Perhaps instead of 'preventing' hotplug, complain/panic and be done with it.  
> 
> Hi Igor,
> 
> Thanks for feedback.  Yes the current implementation actually reports CPU hot-
> add as BIOS bug.  I think I can report BIOS bug for hot-removal too.  And
> currently I actually used BUG() for the hot-removal case.  For hot-add I didn't
> use BUG() but rejected the new CPU as the latter is more conservative. 

Is it safe to ignore not properly initialized for TDX CPU,
sitting there (it may wake up to IRQs (as minimum SMI, but
maybe to IPIs as well (depending in what state FW left it))?

for hypervisors, one should disable cpu hotplug there
(ex: in QEMU, you can try to disable cpu hotplug completely
if TDX is enabled so it won't ever come to 'physical' cpu
being added to guest and no CPU hotplug related ACPI AML
code generated)

> Hi Rafael,
> 
> I am not sure I got what you mean by "This will affect initialization, not just
> hotplug AFAICS", could you elaborate a little bit?  Thanks.
> 
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-22 11:45   ` Rafael J. Wysocki
  2022-06-23  0:08     ` Kai Huang
@ 2022-06-28 12:01     ` Igor Mammedov
  2022-06-28 23:49       ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Igor Mammedov @ 2022-06-28 12:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kai Huang, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky

On Wed, 22 Jun 2022 13:45:01 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> >
> > Platforms with confidential computing technology may not support ACPI
> > memory hotplug when such technology is enabled by the BIOS.  Examples
> > include Intel platforms which support Intel Trust Domain Extensions
> > (TDX).
> >
> > If the kernel ever receives ACPI memory hotplug event, it is likely a
> > BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> > a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> > just assume the kernel cannot continue to work normally, and just BUG().
> >
> > Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> > platform doesn't support ACPI memory hotplug, so that kernel can handle
> > ACPI memory hotplug events for such platform.
> >
> > In acpi_memory_device_{add|remove}(), add early check against this
> > attribute and handle accordingly if it is set.
> >
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
> >  include/linux/cc_platform.h    | 10 ++++++++++
> >  2 files changed, 33 insertions(+)
> >
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 24f662d8bd39..94d6354ea453 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/acpi.h>
> >  #include <linux/memory.h>
> >  #include <linux/memory_hotplug.h>
> > +#include <linux/cc_platform.h>
> >
> >  #include "internal.h"
> >
> > @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
> >         if (!device)
> >                 return -EINVAL;
> >
> > +       /*
> > +        * If the confidential computing platform doesn't support ACPI
> > +        * memory hotplug, the BIOS should never deliver such event to
> > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > +        * the memory device.
> > +        */
> > +       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED)) {  
> 
> Same comment as for the acpi_processor driver: this will affect the
> initialization too and it would be cleaner to reset the
> .hotplug.enabled flag of the scan handler.

with QEMU, it is likely broken when memory is added as
  '-device pc-dimm'
on CLI since it's advertised only as device node in DSDT.

> 
> > +               dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI memory hotplug. New memory device ignored.\n");
> > +               return -EINVAL;
> > +       }
> > +
> >         mem_device = kzalloc(sizeof(struct acpi_memory_device), GFP_KERNEL);
> >         if (!mem_device)
> >                 return -ENOMEM;
> > @@ -334,6 +346,17 @@ static void acpi_memory_device_remove(struct acpi_device *device)
> >         if (!device || !acpi_driver_data(device))
> >                 return;
> >
> > +       /*
> > +        * The confidential computing platform is broken if ACPI memory
> > +        * hot-removal isn't supported but it happened anyway.  Assume
> > +        * it is not guaranteed that the kernel can continue to work
> > +        * normally.  Just BUG().
> > +        */
> > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> > +               dev_err(&device->dev, "Platform doesn't support ACPI memory hotplug. BUG().\n");
> > +               BUG();
> > +       }
> > +
> >         mem_device = acpi_driver_data(device);
> >         acpi_memory_remove_memory(mem_device);
> >         acpi_memory_device_free(mem_device);
> > diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> > index 9ce9256facc8..b831c24bd7f6 100644
> > --- a/include/linux/cc_platform.h
> > +++ b/include/linux/cc_platform.h
> > @@ -93,6 +93,16 @@ enum cc_attr {
> >          * Examples include TDX platform.
> >          */
> >         CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED,
> > +
> > +       /**
> > +        * @CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED: ACPI memory hotplug is
> > +        *                                        not supported.
> > +        *
> > +        * The platform/os does not support ACPI memory hotplug.
> > +        *
> > +        * Examples include TDX platform.
> > +        */
> > +       CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED,
> >  };
> >
> >  #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> > --
> > 2.36.1
> >  
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-28  0:48         ` Xiaoyao Li
@ 2022-06-28 17:03           ` Dave Hansen
  0 siblings, 0 replies; 114+ messages in thread
From: Dave Hansen @ 2022-06-28 17:03 UTC (permalink / raw)
  To: Xiaoyao Li, Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/27/22 17:48, Xiaoyao Li wrote:
>>
>> I meant this:
>>
>> +       switch (level) {
>> +       case PG_LEVEL_4K:
>> +               page_size = 0;
>> +               break;
>>
>> Because TDX_PG_4K==page_size==0, and for this:
>>
>> +       case PG_LEVEL_2M:
>> +               page_size = 1;
> 
> here we can just do
> 
>     page_size = level - 1;
> 
> or
>     
>     tdx_page_level = level - 1;
> 
> yes, TDX's page level definition is one level smaller of Linux's
> definition.

Uhh.  No.

The 'page_size' is in the kernel/TDX-module ABI.  It can't change.
PG_LEVEL_* is just some random internal Linux enum.  It *CAN* change.

There's a *MASSIVE* difference between the two.  What you suggest will
probably actually work.  But, it will work accidentally and may break in
horribly confusing ways in the future.

It's the difference between hacking something together and actually
writing code that will keep working for a long time.  Please, take a
minute and reflect on this.  Please.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-28 10:04         ` Kai Huang
  2022-06-28 11:52           ` Igor Mammedov
@ 2022-06-28 17:33           ` Rafael J. Wysocki
  2022-06-28 23:41             ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Rafael J. Wysocki @ 2022-06-28 17:33 UTC (permalink / raw)
  To: Kai Huang
  Cc: Igor Mammedov, Rafael J. Wysocki, Linux Kernel Mailing List,
	kvm-devel, ACPI Devel Maling List, Sean Christopherson,
	Paolo Bonzini, Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky, Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld,
	Juri Lelli, Mark Rutland, Frederic Weisbecker, Yue Haibing,
	dongli.zhang

On Tue, Jun 28, 2022 at 12:04 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Mon, 2022-06-27 at 10:01 +0200, Igor Mammedov wrote:
> > On Thu, 23 Jun 2022 12:01:48 +1200
> > Kai Huang <kai.huang@intel.com> wrote:
> >
> > > On Wed, 2022-06-22 at 13:42 +0200, Rafael J. Wysocki wrote:
> > > > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > >
> > > > > Platforms with confidential computing technology may not support ACPI
> > > > > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > > > > include Intel platforms which support Intel Trust Domain Extensions
> > > > > (TDX).
> > > > >
> > > > > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > > > > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > > > > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > > > > the kernel cannot continue to work normally, and BUG().
> > > > >
> > > > > Add a new attribute CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED to indicate the
> > > > > platform doesn't support ACPI CPU hotplug, so that kernel can handle
> > > > > ACPI CPU hotplug events for such platform.  The existing attribute
> > > > > CC_ATTR_HOTPLUG_DISABLED is for software CPU hotplug thus doesn't fit.
> > > > >
> > > > > In acpi_processor_{add|remove}(), add early check against this attribute
> > > > > and handle accordingly if it is set.
> > > > >
> > > > > Also take this chance to rename existing CC_ATTR_HOTPLUG_DISABLED to
> > > > > CC_ATTR_CPU_HOTPLUG_DISABLED as it is for software CPU hotplug.
> > > > >
> > > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > > ---
> > > > >  arch/x86/coco/core.c          |  2 +-
> > > > >  drivers/acpi/acpi_processor.c | 23 +++++++++++++++++++++++
> > > > >  include/linux/cc_platform.h   | 15 +++++++++++++--
> > > > >  kernel/cpu.c                  |  2 +-
> > > > >  4 files changed, 38 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > > > > index 4320fadae716..1bde1af75296 100644
> > > > > --- a/arch/x86/coco/core.c
> > > > > +++ b/arch/x86/coco/core.c
> > > > > @@ -20,7 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > > > >  {
> > > > >         switch (attr) {
> > > > >         case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > > > -       case CC_ATTR_HOTPLUG_DISABLED:
> > > > > +       case CC_ATTR_CPU_HOTPLUG_DISABLED:
> > > > >         case CC_ATTR_GUEST_MEM_ENCRYPT:
> > > > >         case CC_ATTR_MEM_ENCRYPT:
> > > > >                 return true;
> > > > > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> > > > > index 6737b1cbf6d6..b960db864cd4 100644
> > > > > --- a/drivers/acpi/acpi_processor.c
> > > > > +++ b/drivers/acpi/acpi_processor.c
> > > > > @@ -15,6 +15,7 @@
> > > > >  #include <linux/kernel.h>
> > > > >  #include <linux/module.h>
> > > > >  #include <linux/pci.h>
> > > > > +#include <linux/cc_platform.h>
> > > > >
> > > > >  #include <acpi/processor.h>
> > > > >
> > > > > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> > > > >         struct device *dev;
> > > > >         int result = 0;
> > > > >
> > > > > +       /*
> > > > > +        * If the confidential computing platform doesn't support ACPI
> > > > > +        * memory hotplug, the BIOS should never deliver such event to
> > > > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > > > +        * the new CPU.
> > > > > +        */
> > > > > +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> > > >
> > > > This will affect initialization, not just hotplug AFAICS.
> > > >
> > > > You should reset the .hotplug.enabled flag in processor_handler to
> > > > false instead.
> > >
> > > Hi Rafael,
> > >
> > > Thanks for the review.  By "affect initialization" did you mean this
> > > acpi_processor_add() is also called during kernel boot when any logical cpu is
> > > brought up?  Or do you mean ACPI CPU hotplug can also happen during kernel boot
> > > (after acpi_processor_init())?
> > >
> > > I see acpi_processor_init() calls acpi_processor_check_duplicates() which calls
> > > acpi_evaluate_object() but I don't know details of ACPI so I don't know whether
> > > this would trigger acpi_processor_add().
> > >
> > > One thing is TDX doesn't support ACPI CPU hotplug is an architectural thing, so
> > > it is illegal even if it happens during kernel boot.  Dave's idea is the kernel
> > > should  speak out loudly if physical CPU hotplug indeed happened on (BIOS) TDX-
> > > enabled platforms.  Otherwise perhaps we can just give up initializing the ACPI
> > > CPU hotplug in acpi_processor_init(), something like below?
> >
> > The thing is that by the time ACPI machinery kicks in, physical hotplug
> > has already happened and in case of (kvm+qemu+ovmf hypervisor combo)
> > firmware has already handled it somehow and handed it over to ACPI.
> > If you say it's architectural thing then cpu hotplug is platform/firmware
> > bug and should be disabled there instead of working around it in the kernel.
> >
> > Perhaps instead of 'preventing' hotplug, complain/panic and be done with it.
>
> Hi Igor,
>
> Thanks for feedback.  Yes the current implementation actually reports CPU hot-
> add as BIOS bug.  I think I can report BIOS bug for hot-removal too.  And
> currently I actually used BUG() for the hot-removal case.  For hot-add I didn't
> use BUG() but rejected the new CPU as the latter is more conservative.
>
> Hi Rafael,
>
> I am not sure I got what you mean by "This will affect initialization, not just
> hotplug AFAICS", could you elaborate a little bit?  Thanks.

So acpi_processor_add() is called for CPUs that are already present at
init time, not just for the hot-added ones.

One of the things it does is to associate an ACPI companion with the given CPU.

Don't you need that to happen?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-23  0:08     ` Kai Huang
@ 2022-06-28 17:55       ` Rafael J. Wysocki
  0 siblings, 0 replies; 114+ messages in thread
From: Rafael J. Wysocki @ 2022-06-28 17:55 UTC (permalink / raw)
  To: Kai Huang
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky

On Thu, Jun 23, 2022 at 2:09 AM Kai Huang <kai.huang@intel.com> wrote:
>
> On Wed, 2022-06-22 at 13:45 +0200, Rafael J. Wysocki wrote:
> > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> > >
> > > Platforms with confidential computing technology may not support ACPI
> > > memory hotplug when such technology is enabled by the BIOS.  Examples
> > > include Intel platforms which support Intel Trust Domain Extensions
> > > (TDX).
> > >
> > > If the kernel ever receives ACPI memory hotplug event, it is likely a
> > > BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> > > a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> > > just assume the kernel cannot continue to work normally, and just BUG().
> > >
> > > Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> > > platform doesn't support ACPI memory hotplug, so that kernel can handle
> > > ACPI memory hotplug events for such platform.
> > >
> > > In acpi_memory_device_{add|remove}(), add early check against this
> > > attribute and handle accordingly if it is set.
> > >
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
> > >  include/linux/cc_platform.h    | 10 ++++++++++
> > >  2 files changed, 33 insertions(+)
> > >
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 24f662d8bd39..94d6354ea453 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/acpi.h>
> > >  #include <linux/memory.h>
> > >  #include <linux/memory_hotplug.h>
> > > +#include <linux/cc_platform.h>
> > >
> > >  #include "internal.h"
> > >
> > > @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
> > >         if (!device)
> > >                 return -EINVAL;
> > >
> > > +       /*
> > > +        * If the confidential computing platform doesn't support ACPI
> > > +        * memory hotplug, the BIOS should never deliver such event to
> > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > +        * the memory device.
> > > +        */
> > > +       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED)) {
> >
> > Same comment as for the acpi_processor driver: this will affect the
> > initialization too and it would be cleaner to reset the
> > .hotplug.enabled flag of the scan handler.
> >
> >
>
> Hi Rafael,
>
> Thanks for review.  The same to the ACPI CPU hotplug handling, this is illegal
> also during kernel boot.

What do you mean?

Is it not correct to enumerate any memory device through ACPI at all?

>  If we just want to disable, then perhaps something like below?
>
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -366,6 +366,9 @@ static bool __initdata acpi_no_memhotplug;
>
>  void __init acpi_memory_hotplug_init(void)
>  {
> +       if (cc_platform_has(CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED))
> +               acpi_no_memhotplug = true;
> +

This looks fine to me if the above is the case, but you need to modify
the changelog to match.

>         if (acpi_no_memhotplug) {
>                 memory_device_handler.attach = NULL;
>                 acpi_scan_add_handler(&memory_device_handler);
>
>
> --

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-28 17:33           ` Rafael J. Wysocki
@ 2022-06-28 23:41             ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-28 23:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Igor Mammedov, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky, Tianyu.Lan, Randy Dunlap, Jason A. Donenfeld,
	Juri Lelli, Mark Rutland, Frederic Weisbecker, Yue Haibing,
	dongli.zhang

On Tue, 2022-06-28 at 19:33 +0200, Rafael J. Wysocki wrote:
> > Hi Rafael,
> > 
> > I am not sure I got what you mean by "This will affect initialization, not
> > just
> > hotplug AFAICS", could you elaborate a little bit?  Thanks.
> 
> So acpi_processor_add() is called for CPUs that are already present at
> init time, not just for the hot-added ones.
> 
> One of the things it does is to associate an ACPI companion with the given
> CPU.
> 
> Don't you need that to happen?

You are right.  I did test again and yes it was also called after boot-time
present cpus are up (after smp_init()).  I didn't check this carefully at my
previous test.  Thanks for catching.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-28 12:01     ` Igor Mammedov
@ 2022-06-28 23:49       ` Kai Huang
  2022-06-29  8:48         ` Igor Mammedov
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-28 23:49 UTC (permalink / raw)
  To: Igor Mammedov, Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, kvm-devel, ACPI Devel Maling List,
	Sean Christopherson, Paolo Bonzini, Dave Hansen, Len Brown,
	Tony Luck, Rafael Wysocki, Reinette Chatre, Dan Williams,
	Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, isaku.yamahata, Tom Lendacky

On Tue, 2022-06-28 at 14:01 +0200, Igor Mammedov wrote:
> On Wed, 22 Jun 2022 13:45:01 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> 
> > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:
> > > 
> > > Platforms with confidential computing technology may not support ACPI
> > > memory hotplug when such technology is enabled by the BIOS.  Examples
> > > include Intel platforms which support Intel Trust Domain Extensions
> > > (TDX).
> > > 
> > > If the kernel ever receives ACPI memory hotplug event, it is likely a
> > > BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> > > a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> > > just assume the kernel cannot continue to work normally, and just BUG().
> > > 
> > > Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> > > platform doesn't support ACPI memory hotplug, so that kernel can handle
> > > ACPI memory hotplug events for such platform.
> > > 
> > > In acpi_memory_device_{add|remove}(), add early check against this
> > > attribute and handle accordingly if it is set.
> > > 
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
> > >  include/linux/cc_platform.h    | 10 ++++++++++
> > >  2 files changed, 33 insertions(+)
> > > 
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 24f662d8bd39..94d6354ea453 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/acpi.h>
> > >  #include <linux/memory.h>
> > >  #include <linux/memory_hotplug.h>
> > > +#include <linux/cc_platform.h>
> > > 
> > >  #include "internal.h"
> > > 
> > > @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
> > >         if (!device)
> > >                 return -EINVAL;
> > > 
> > > +       /*
> > > +        * If the confidential computing platform doesn't support ACPI
> > > +        * memory hotplug, the BIOS should never deliver such event to
> > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > +        * the memory device.
> > > +        */
> > > +       if (cc_platform_has(c)) {  
> > 
> > Same comment as for the acpi_processor driver: this will affect the
> > initialization too and it would be cleaner to reset the
> > .hotplug.enabled flag of the scan handler.
> 
> with QEMU, it is likely broken when memory is added as
>   '-device pc-dimm'
> on CLI since it's advertised only as device node in DSDT.
> 
> 

Hi Rafael,  Igor,

On my test machine, the acpi_memory_device_add() is not called for system
memory.  It probably because my machine doesn't have memory device in ACPI.

I don't know whether we can have any memory device in ACPI if such memory is
present during boot?  Any comments here?

And CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED is only true on TDX bare-metal system,
but cannot be true in Qemu guest.  But yes if this flag ever becomes true in
guest, then I think we may have problem here.  I will do more study around ACPI.
Thanks for comments!

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
  2022-06-22 11:42   ` Rafael J. Wysocki
  2022-06-24 18:57   ` Dave Hansen
@ 2022-06-29  5:33   ` Christoph Hellwig
  2022-06-29  9:09     ` Kai Huang
  2022-08-03  3:55   ` Binbin Wu
  3 siblings, 1 reply; 114+ messages in thread
From: Christoph Hellwig @ 2022-06-29  5:33 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-acpi, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, thomas.lendacky,
	Tianyu.Lan, rdunlap, Jason, juri.lelli, mark.rutland, frederic,
	yuehaibing, dongli.zhang

On Wed, Jun 22, 2022 at 11:15:43PM +1200, Kai Huang wrote:
> Platforms with confidential computing technology may not support ACPI
> CPU hotplug when such technology is enabled by the BIOS.  Examples
> include Intel platforms which support Intel Trust Domain Extensions
> (TDX).

What does this have to to wit hthe cc_platform abstraction?  This is
just an intel implementation bug because they hastended so much into
implementing this.  So the quirks should not overload the cc_platform
abstraction.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-24 11:21     ` Kai Huang
@ 2022-06-29  8:35       ` Yuan Yao
  2022-06-29  9:17         ` Kai Huang
  2022-06-29 14:22       ` Dave Hansen
  1 sibling, 1 reply; 114+ messages in thread
From: Yuan Yao @ 2022-06-29  8:35 UTC (permalink / raw)
  To: Kai Huang
  Cc: Chao Gao, linux-kernel, kvm, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, thomas.lendacky,
	Tianyu.Lan

On Fri, Jun 24, 2022 at 11:21:59PM +1200, Kai Huang wrote:
> On Fri, 2022-06-24 at 09:41 +0800, Chao Gao wrote:
> > On Wed, Jun 22, 2022 at 11:16:07PM +1200, Kai Huang wrote:
> > > -static bool intel_cc_platform_has(enum cc_attr attr)
> > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > +static bool intel_tdx_guest_has(enum cc_attr attr)
> > > {
> > > 	switch (attr) {
> > > 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > @@ -28,6 +31,33 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > > 		return false;
> > > 	}
> > > }
> > > +#endif
> > > +
> > > +#ifdef CONFIG_INTEL_TDX_HOST
> > > +static bool intel_tdx_host_has(enum cc_attr attr)
> > > +{
> > > +	switch (attr) {
> > > +	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > > +	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > > +		return true;
> > > +	default:
> > > +		return false;
> > > +	}
> > > +}
> > > +#endif
> > > +
> > > +static bool intel_cc_platform_has(enum cc_attr attr)
> > > +{
> > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > +	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> > > +		return intel_tdx_guest_has(attr);
> > > +#endif
> > > +#ifdef CONFIG_INTEL_TDX_HOST
> > > +	if (platform_tdx_enabled())
> > > +		return intel_tdx_host_has(attr);
> > > +#endif
> > > +	return false;
> > > +}
> >
> > how about:
> >
> > static bool intel_cc_platform_has(enum cc_attr attr)
> > {
> > 	switch (attr) {
> > 	/* attributes applied to TDX guest only */
> > 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > 	...
> > 		return boot_cpu_has(X86_FEATURE_TDX_GUEST);
> >
> > 	/* attributes applied to TDX host only */
> > 	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > 	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > 		return platform_tdx_enabled();
> >
> > 	default:
> > 		return false;
> > 	}
> > }
> >
> > so that we can get rid of #ifdef/endif.
>
> Personally I don't quite like this way.  To me having separate function for host
> and guest is more clear and more flexible.  And I don't think having
> #ifdef/endif has any problem.  I would like to leave to maintainers.

I see below statement, for you reference:

"Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c"
From Documentation/process/coding-style.rst, 21) Conditional Compilation.

>
> --
> Thanks,
> -Kai
>
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-28 23:49       ` Kai Huang
@ 2022-06-29  8:48         ` Igor Mammedov
  2022-06-29  9:13           ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Igor Mammedov @ 2022-06-29  8:48 UTC (permalink / raw)
  To: Kai Huang
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky

On Wed, 29 Jun 2022 11:49:14 +1200
Kai Huang <kai.huang@intel.com> wrote:

> On Tue, 2022-06-28 at 14:01 +0200, Igor Mammedov wrote:
> > On Wed, 22 Jun 2022 13:45:01 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >   
> > > On Wed, Jun 22, 2022 at 1:16 PM Kai Huang <kai.huang@intel.com> wrote:  
> > > > 
> > > > Platforms with confidential computing technology may not support ACPI
> > > > memory hotplug when such technology is enabled by the BIOS.  Examples
> > > > include Intel platforms which support Intel Trust Domain Extensions
> > > > (TDX).
> > > > 
> > > > If the kernel ever receives ACPI memory hotplug event, it is likely a
> > > > BIOS bug.  For ACPI memory hot-add, the kernel should speak out this is
> > > > a BIOS bug and reject the new memory.  For hot-removal, for simplicity
> > > > just assume the kernel cannot continue to work normally, and just BUG().
> > > > 
> > > > Add a new attribute CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED to indicate the
> > > > platform doesn't support ACPI memory hotplug, so that kernel can handle
> > > > ACPI memory hotplug events for such platform.
> > > > 
> > > > In acpi_memory_device_{add|remove}(), add early check against this
> > > > attribute and handle accordingly if it is set.
> > > > 
> > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > ---
> > > >  drivers/acpi/acpi_memhotplug.c | 23 +++++++++++++++++++++++
> > > >  include/linux/cc_platform.h    | 10 ++++++++++
> > > >  2 files changed, 33 insertions(+)
> > > > 
> > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > index 24f662d8bd39..94d6354ea453 100644
> > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > @@ -15,6 +15,7 @@
> > > >  #include <linux/acpi.h>
> > > >  #include <linux/memory.h>
> > > >  #include <linux/memory_hotplug.h>
> > > > +#include <linux/cc_platform.h>
> > > > 
> > > >  #include "internal.h"
> > > > 
> > > > @@ -291,6 +292,17 @@ static int acpi_memory_device_add(struct acpi_device *device,
> > > >         if (!device)
> > > >                 return -EINVAL;
> > > > 
> > > > +       /*
> > > > +        * If the confidential computing platform doesn't support ACPI
> > > > +        * memory hotplug, the BIOS should never deliver such event to
> > > > +        * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> > > > +        * the memory device.
> > > > +        */
> > > > +       if (cc_platform_has(c)) {    
> > > 
> > > Same comment as for the acpi_processor driver: this will affect the
> > > initialization too and it would be cleaner to reset the
> > > .hotplug.enabled flag of the scan handler.  
> > 
> > with QEMU, it is likely broken when memory is added as
> >   '-device pc-dimm'
> > on CLI since it's advertised only as device node in DSDT.
> > 
> >   
> 
> Hi Rafael,  Igor,
> 
> On my test machine, the acpi_memory_device_add() is not called for system
> memory.  It probably because my machine doesn't have memory device in ACPI.
> 
> I don't know whether we can have any memory device in ACPI if such memory is
> present during boot?  Any comments here?

I don't see anything in ACPI spec that forbids memory device being present at boot.
Such memory may also be present in E820, but in QEMU is not done as linux used to
online all E820 memory as normal which breaks hotplug. And I don't know if it
still true.

Also NVDIMMs also use memory device, so they may be affected by this patch as well.

> 
> And CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED is only true on TDX bare-metal system,
> but cannot be true in Qemu guest.  But yes if this flag ever becomes true in

that's temporary, once TDX support lands in KVM/QEMU, this patch will silently
break usecase.

> guest, then I think we may have problem here.  I will do more study around ACPI.
> Thanks for comments!
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-29  5:33   ` Christoph Hellwig
@ 2022-06-29  9:09     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-29  9:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, kvm, linux-acpi, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, thomas.lendacky,
	Tianyu.Lan, rdunlap, Jason, juri.lelli, mark.rutland, frederic,
	yuehaibing, dongli.zhang

On Tue, 2022-06-28 at 22:33 -0700, Christoph Hellwig wrote:
> On Wed, Jun 22, 2022 at 11:15:43PM +1200, Kai Huang wrote:
> > Platforms with confidential computing technology may not support ACPI
> > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > include Intel platforms which support Intel Trust Domain Extensions
> > (TDX).
> 
> What does this have to to wit hthe cc_platform abstraction?  This is
> just an intel implementation bug because they hastended so much into
> implementing this.  So the quirks should not overload the cc_platform
> abstraction.
> 

Thanks for feedback.  I thought there might be similar technologies and it would
be better to have a common attribute.  I'll give up this approach and change to
use arch-specific check.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug
  2022-06-29  8:48         ` Igor Mammedov
@ 2022-06-29  9:13           ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-29  9:13 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, kvm-devel,
	ACPI Devel Maling List, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, Len Brown, Tony Luck, Rafael Wysocki,
	Reinette Chatre, Dan Williams, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, isaku.yamahata,
	Tom Lendacky

On Wed, 2022-06-29 at 10:48 +0200, Igor Mammedov wrote:
> > Hi Rafael,  Igor,
> > 
> > On my test machine, the acpi_memory_device_add() is not called for system
> > memory.  It probably because my machine doesn't have memory device in ACPI.
> > 
> > I don't know whether we can have any memory device in ACPI if such memory is
> > present during boot?  Any comments here?
> 
> I don't see anything in ACPI spec that forbids memory device being present at
> boot.
> Such memory may also be present in E820, but in QEMU is not done as linux used
> to
> online all E820 memory as normal which breaks hotplug. And I don't know if it
> still true.
> 
> Also NVDIMMs also use memory device, so they may be affected by this patch as
> well.

AFAICT NVDIMM uses different device ID so won't be impacted.  But right there's
no specification around "whether firmware will create ACPI memory device for
boot-time present memory", so I guess we need to treat it is possible.  So I
agree having the check at the beginning of acpi_memory_device_add() looks
incorrect.  

Also as Christoph commented I'll give up introducing new CC attribute.

> 
> > 
> > And CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED is only true on TDX bare-metal
> > system,
> > but cannot be true in Qemu guest.  But yes if this flag ever becomes true in
> 
> that's temporary, once TDX support lands in KVM/QEMU, this patch will silently
> break usecase.

I don't think so.  KVM/Qemu won't expose TDX to guest, so this code won't be
true in guest.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-29  8:35       ` Yuan Yao
@ 2022-06-29  9:17         ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-29  9:17 UTC (permalink / raw)
  To: Yuan Yao
  Cc: Chao Gao, linux-kernel, kvm, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata, thomas.lendacky,
	Tianyu.Lan

On Wed, 2022-06-29 at 16:35 +0800, Yuan Yao wrote:
> On Fri, Jun 24, 2022 at 11:21:59PM +1200, Kai Huang wrote:
> > On Fri, 2022-06-24 at 09:41 +0800, Chao Gao wrote:
> > > On Wed, Jun 22, 2022 at 11:16:07PM +1200, Kai Huang wrote:
> > > > -static bool intel_cc_platform_has(enum cc_attr attr)
> > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > +static bool intel_tdx_guest_has(enum cc_attr attr)
> > > > {
> > > > 	switch (attr) {
> > > > 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > > @@ -28,6 +31,33 @@ static bool intel_cc_platform_has(enum cc_attr attr)
> > > > 		return false;
> > > > 	}
> > > > }
> > > > +#endif
> > > > +
> > > > +#ifdef CONFIG_INTEL_TDX_HOST
> > > > +static bool intel_tdx_host_has(enum cc_attr attr)
> > > > +{
> > > > +	switch (attr) {
> > > > +	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > > > +	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > > > +		return true;
> > > > +	default:
> > > > +		return false;
> > > > +	}
> > > > +}
> > > > +#endif
> > > > +
> > > > +static bool intel_cc_platform_has(enum cc_attr attr)
> > > > +{
> > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > +	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> > > > +		return intel_tdx_guest_has(attr);
> > > > +#endif
> > > > +#ifdef CONFIG_INTEL_TDX_HOST
> > > > +	if (platform_tdx_enabled())
> > > > +		return intel_tdx_host_has(attr);
> > > > +#endif
> > > > +	return false;
> > > > +}
> > > 
> > > how about:
> > > 
> > > static bool intel_cc_platform_has(enum cc_attr attr)
> > > {
> > > 	switch (attr) {
> > > 	/* attributes applied to TDX guest only */
> > > 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > 	...
> > > 		return boot_cpu_has(X86_FEATURE_TDX_GUEST);
> > > 
> > > 	/* attributes applied to TDX host only */
> > > 	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > > 	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > > 		return platform_tdx_enabled();
> > > 
> > > 	default:
> > > 		return false;
> > > 	}
> > > }
> > > 
> > > so that we can get rid of #ifdef/endif.
> > 
> > Personally I don't quite like this way.  To me having separate function for host
> > and guest is more clear and more flexible.  And I don't think having
> > #ifdef/endif has any problem.  I would like to leave to maintainers.
> 
> I see below statement, for you reference:
> 
> "Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c"
> From Documentation/process/coding-style.rst, 21) Conditional Compilation.
> 
> > 

This is perhaps a general rule.  If you take a look at existing code, you will
immediately find AMD has a #ifdef too:

static bool amd_cc_platform_has(enum cc_attr attr)
{
#ifdef CONFIG_AMD_MEM_ENCRYPT
        switch (attr) {
        case CC_ATTR_MEM_ENCRYPT:
                return sme_me_mask;

        case CC_ATTR_HOST_MEM_ENCRYPT:
                return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);

        case CC_ATTR_GUEST_MEM_ENCRYPT:
                return sev_status & MSR_AMD64_SEV_ENABLED;

        case CC_ATTR_GUEST_STATE_ENCRYPT:
                return sev_status & MSR_AMD64_SEV_ES_ENABLED;

        /*
         * With SEV, the rep string I/O instructions need to be unrolled
         * but SEV-ES supports them through the #VC handler.
         */
        case CC_ATTR_GUEST_UNROLL_STRING_IO:
                return (sev_status & MSR_AMD64_SEV_ENABLED) &&
                        !(sev_status & MSR_AMD64_SEV_ES_ENABLED);

        default:
                return false;
        }
#else
        return false;
#endif
}

So I'll leave to maintainers.

Anyway as Christoph commented I'll give up introducing new CC attributes, so
doesn't matter anymore.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-24 11:21     ` Kai Huang
  2022-06-29  8:35       ` Yuan Yao
@ 2022-06-29 14:22       ` Dave Hansen
  2022-06-29 23:02         ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-29 14:22 UTC (permalink / raw)
  To: Kai Huang, Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan

On 6/24/22 04:21, Kai Huang wrote:
> Personally I don't quite like this way.  To me having separate function for host
> and guest is more clear and more flexible.  And I don't think having
> #ifdef/endif has any problem.  I would like to leave to maintainers.

It has problems.

Let's go through some of them.  First, this:

> +#ifdef CONFIG_INTEL_TDX_HOST
> +static bool intel_tdx_host_has(enum cc_attr attr)
> +{
> +	switch (attr) {
> +	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> +	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +#endif

What does that #ifdef get us?  I suspect you're back to trying to
silence compiler warnings with #ifdefs.  The compiler *knows* that it's
only used in this file.  It's also used all of once.  If you make it
'static inline', you'll likely get the same code generation, no
warnings, and don't need an #ifdef.

The other option is to totally lean on the compiler to figure things
out.  Compile this program, then disassemble it and see what main() does.

static void func(void)
{
	printf("I am func()\n");
}

void main(int argc, char **argv)
{
	if (0)
		func();
}

Then, do:

-	if (0)
+	if (argc)

and run it again.  What changed in the disassembly?

> +static bool intel_cc_platform_has(enum cc_attr attr)
> +{
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> +		return intel_tdx_guest_has(attr);
> +#endif

Make this check cpu_feature_enabled(X86_FEATURE_TDX_GUEST).  That has an
#ifdef built in to it.  That gets rid of this #ifdef.  You have

> +#ifdef CONFIG_INTEL_TDX_HOST
> +	if (platform_tdx_enabled())
> +		return intel_tdx_host_has(attr);
> +#endif
> +	return false;
> +}

Now, let's turn our attention to platform_tdx_enabled().  Here's its
stub and declaration:

> +#ifdef CONFIG_INTEL_TDX_HOST
> +bool platform_tdx_enabled(void);
> +#else  /* !CONFIG_INTEL_TDX_HOST */
> +static inline bool platform_tdx_enabled(void) { return false; }
> +#endif /* CONFIG_INTEL_TDX_HOST */

It already has an #ifdef CONFIG_INTEL_TDX_HOST, so that #ifdef can just
go away.

Kai, the reason that we have the rule that Yuan cited:

> "Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c"
> From Documentation/process/coding-style.rst, 21) Conditional Compilation.

is not because there are *ZERO* #ifdefs in .c files.  It's because
#ifdefs in .c files hurt readability and are usually avoidable.  How do
you avoid them?  Well, you take a moment and look at the code and see
how other folks have made it readable.  It takes refactoring of code to
banish #ifdefs to headers or replace them with compiler constructs so
that the compiler can do the work behind the scenes.

Kai, could you please take the information I gave you in this message
and try to apply it across this series?  Heck, can you please take it
and use it to review others' code to make sure they don't encounter the
same pitfalls?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-29 14:22       ` Dave Hansen
@ 2022-06-29 23:02         ` Kai Huang
  2022-06-30 15:44           ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-06-29 23:02 UTC (permalink / raw)
  To: Dave Hansen, Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan

On Wed, 2022-06-29 at 07:22 -0700, Dave Hansen wrote:
> On 6/24/22 04:21, Kai Huang wrote:
> > Personally I don't quite like this way.  To me having separate function for host
> > and guest is more clear and more flexible.  And I don't think having
> > #ifdef/endif has any problem.  I would like to leave to maintainers.
> 
> It has problems.
> 
> Let's go through some of them.  First, this:
> 
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +static bool intel_tdx_host_has(enum cc_attr attr)
> > +{
> > +	switch (attr) {
> > +	case CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED:
> > +	case CC_ATTR_ACPI_MEMORY_HOTPLUG_DISABLED:
> > +		return true;
> > +	default:
> > +		return false;
> > +	}
> > +}
> > +#endif
> 
> What does that #ifdef get us?  I suspect you're back to trying to
> silence compiler warnings with #ifdefs.  The compiler *knows* that it's
> only used in this file.  It's also used all of once.  If you make it
> 'static inline', you'll likely get the same code generation, no
> warnings, and don't need an #ifdef.

The purpose is not to avoid warning, but to make intel_cc_platform_has(enum
cc_attr attr) simple that when neither TDX host and TDX guest code is turned on,
it can be simple:

	static bool  intel_cc_platform_has(enum cc_attr attr)
	{
		return false;
	}

So I don't need to depend on how internal functions are implemented in the
header files and I don't need to guess how does compiler generate code.

And also because I personally believe it doesn't hurt readability. 

> 
> The other option is to totally lean on the compiler to figure things
> out.  Compile this program, then disassemble it and see what main() does.
> 
> static void func(void)
> {
> 	printf("I am func()\n");
> }
> 
> void main(int argc, char **argv)
> {
> 	if (0)
> 		func();
> }
> 
> Then, do:
> 
> -	if (0)
> +	if (argc)
> 
> and run it again.  What changed in the disassembly?

You mean compile it again?  I have to confess I never tried and don't know. 
I'll try when I got some spare time.  Thanks for the info.

> 
> > +static bool intel_cc_platform_has(enum cc_attr attr)
> > +{
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> > +		return intel_tdx_guest_has(attr);
> > +#endif
> 
> Make this check cpu_feature_enabled(X86_FEATURE_TDX_GUEST).  That has an
> #ifdef built in to it.  That gets rid of this #ifdef.  You have
> 
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +	if (platform_tdx_enabled())
> > +		return intel_tdx_host_has(attr);
> > +#endif
> > +	return false;
> > +}
> 
> Now, let's turn our attention to platform_tdx_enabled().  Here's its
> stub and declaration:
> 
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +bool platform_tdx_enabled(void);
> > +#else  /* !CONFIG_INTEL_TDX_HOST */
> > +static inline bool platform_tdx_enabled(void) { return false; }
> > +#endif /* CONFIG_INTEL_TDX_HOST */
> 
> It already has an #ifdef CONFIG_INTEL_TDX_HOST, so that #ifdef can just
> go away.
> 
> Kai, the reason that we have the rule that Yuan cited:
> 
> > "Wherever possible, don't use preprocessor conditionals (#if, #ifdef) in .c"
> > From Documentation/process/coding-style.rst, 21) Conditional Compilation.
> 
> is not because there are *ZERO* #ifdefs in .c files.  It's because
> #ifdefs in .c files hurt readability and are usually avoidable.  How do
> you avoid them?  Well, you take a moment and look at the code and see
> how other folks have made it readable.  It takes refactoring of code to
> banish #ifdefs to headers or replace them with compiler constructs so
> that the compiler can do the work behind the scenes.

Yes I understand the purpose of this rule. Thanks for explaining.
 
> 
> Kai, could you please take the information I gave you in this message
> and try to apply it across this series?  Heck, can you please take it
> and use it to review others' code to make sure they don't encounter the
> same pitfalls?

Thanks for the tip.  Will do.

Btw this patch is the only one in this series that has this #ifdef problem, and
it will be gone in next version based on feedbacks that I received.  But I'll
check again.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-29 23:02         ` Kai Huang
@ 2022-06-30 15:44           ` Dave Hansen
  2022-06-30 22:45             ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-06-30 15:44 UTC (permalink / raw)
  To: Kai Huang, Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan

On 6/29/22 16:02, Kai Huang wrote:
> On Wed, 2022-06-29 at 07:22 -0700, Dave Hansen wrote:
>> On 6/24/22 04:21, Kai Huang wrote:
>> What does that #ifdef get us?  I suspect you're back to trying to
>> silence compiler warnings with #ifdefs.  The compiler *knows* that it's
>> only used in this file.  It's also used all of once.  If you make it
>> 'static inline', you'll likely get the same code generation, no
>> warnings, and don't need an #ifdef.
> 
> The purpose is not to avoid warning, but to make intel_cc_platform_has(enum
> cc_attr attr) simple that when neither TDX host and TDX guest code is turned on,
> it can be simple:
> 
> 	static bool  intel_cc_platform_has(enum cc_attr attr)
> 	{
> 		return false;
> 	}
> 
> So I don't need to depend on how internal functions are implemented in the
> header files and I don't need to guess how does compiler generate code.

I hate to break it to you, but you actually need to know how the
compiler works for you to be able to write good code.  Ignoring all the
great stuff that the compiler does for you makes your code worse.

> And also because I personally believe it doesn't hurt readability. 

Are you saying that you're ignoring long-established kernel coding style
conventions because of your personal beliefs?  That seem, um, like an
approach that's unlikely to help your code get accepted.

>> The other option is to totally lean on the compiler to figure things
>> out.  Compile this program, then disassemble it and see what main() does.
>>
>> static void func(void)
>> {
>> 	printf("I am func()\n");
>> }
>>
>> void main(int argc, char **argv)
>> {
>> 	if (0)
>> 		func();
>> }
>>
>> Then, do:
>>
>> -	if (0)
>> +	if (argc)
>>
>> and run it again.  What changed in the disassembly?
> 
> You mean compile it again?  I have to confess I never tried and don't know. 
> I'll try when I got some spare time.  Thanks for the info.

Yes, compile it again and run it again.

But, seriously, it's a quick exercise.  I can help make you some spare
time if you wish.  Just let me know.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug
  2022-06-30 15:44           ` Dave Hansen
@ 2022-06-30 22:45             ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-06-30 22:45 UTC (permalink / raw)
  To: Dave Hansen, Chao Gao
  Cc: linux-kernel, kvm, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan

On Thu, 2022-06-30 at 08:44 -0700, Dave Hansen wrote:
> On 6/29/22 16:02, Kai Huang wrote:
> > On Wed, 2022-06-29 at 07:22 -0700, Dave Hansen wrote:
> > > On 6/24/22 04:21, Kai Huang wrote:
> > > What does that #ifdef get us?  I suspect you're back to trying to
> > > silence compiler warnings with #ifdefs.  The compiler *knows* that it's
> > > only used in this file.  It's also used all of once.  If you make it
> > > 'static inline', you'll likely get the same code generation, no
> > > warnings, and don't need an #ifdef.
> > 
> > The purpose is not to avoid warning, but to make intel_cc_platform_has(enum
> > cc_attr attr) simple that when neither TDX host and TDX guest code is turned on,
> > it can be simple:
> > 
> > 	static bool  intel_cc_platform_has(enum cc_attr attr)
> > 	{
> > 		return false;
> > 	}
> > 
> > So I don't need to depend on how internal functions are implemented in the
> > header files and I don't need to guess how does compiler generate code.
> 
> I hate to break it to you, but you actually need to know how the
> compiler works for you to be able to write good code.  Ignoring all the
> great stuff that the compiler does for you makes your code worse.

Agreed.

> 
> > And also because I personally believe it doesn't hurt readability. 
> 
> Are you saying that you're ignoring long-established kernel coding style
> conventions because of your personal beliefs?  That seem, um, like an
> approach that's unlikely to help your code get accepted.

Agreed.  Will keep this in mind.  Thanks.

> 
> > > The other option is to totally lean on the compiler to figure things
> > > out.  Compile this program, then disassemble it and see what main() does.
> > > 
> > > static void func(void)
> > > {
> > > 	printf("I am func()\n");
> > > }
> > > 
> > > void main(int argc, char **argv)
> > > {
> > > 	if (0)
> > > 		func();
> > > }
> > > 
> > > Then, do:
> > > 
> > > -	if (0)
> > > +	if (argc)
> > > 
> > > and run it again.  What changed in the disassembly?
> > 
> > You mean compile it again?  I have to confess I never tried and don't know. 
> > I'll try when I got some spare time.  Thanks for the info.
> 
> Yes, compile it again and run it again.
> 
> But, seriously, it's a quick exercise.  I can help make you some spare
> time if you wish.  Just let me know.

So I tried.  Took me less than 5 mins:)

The
	if (0)
		func();

never generates the code to actually call the func():

0000000000401137 <main>: 
  401137:       55                      push   %rbp      
  401138:       48 89 e5                mov    %rsp,%rbp 
  40113b:       89 7d fc                mov    %edi,-0x4(%rbp)
  40113e:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
  401142:       90                      nop
  401143:       5d                      pop    %rbp    
  401144:       c3                      ret    

While
	if (argc)
		func();

generates the code to check argc and call func():

0000000000401137 <main>: 
  401137:       55                      push   %rbp     
  401138:       48 89 e5                mov    %rsp,%rbp  
  40113b:       48 83 ec 10             sub    $0x10,%rsp
  40113f:       89 7d fc                mov    %edi,-0x4(%rbp)
  401142:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
  401146:       83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  40114a:       74 05                   je     401151 <main+0x1a>
  40114c:       e8 d5 ff ff ff          call   401126 <func>
  401151:       90                      nop                      
  401152:       c9                      leave                
  401153:       c3                      ret   

This is kinda no surprise.

Were you trying to make point that

	if (false)
		func();

doesn't generate any additional code?

I get your point now.  Thanks :)

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-06-27  6:16     ` Kai Huang
@ 2022-07-07  2:37       ` Kai Huang
  2022-07-07 14:26       ` Dave Hansen
  1 sibling, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-07  2:37 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata


> > > +/*
> > > + * Walks over all memblock memory regions that are intended to be
> > > + * converted to TDX memory.  Essentially, it is all memblock memory
> > > + * regions excluding the low memory below 1MB.
> > > + *
> > > + * This is because on some TDX platforms the low memory below 1MB is
> > > + * not included in CMRs.  Excluding the low 1MB can still guarantee
> > > + * that the pages managed by the page allocator are always TDX memory,
> > > + * as the low 1MB is reserved during kernel boot and won't end up to
> > > + * the ZONE_DMA (see reserve_real_mode()).
> > > + */
> > > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> > 
> > Let's summarize where we are at this point:
> > 
> > 1. All RAM is described in memblocks
> > 2. Some memblocks are reserved and some are free
> > 3. The lower 1MB is marked reserved
> > 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
> >    have to exclude the lower 1MB as a special case.
> > 
> > That seems superficially rather ridiculous.  Shouldn't we just pick a
> > memblock iterator that skips the 1MB?  Surely there is such a thing.
> 
> Perhaps you are suggesting we should always loop the _free_ ranges so we don't
> need to care about the first 1MB which is reserved?
> 
> The problem is some reserved memory regions are actually later freed to the page
> allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
> regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().
> 
> Btw, I do have a checkpatch warning around this code:
> 
> ERROR: Macros with complex values should be enclosed in parentheses
> #109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> +		if (!pfn_range_skip_lowmem(p_start, p_end))
> 
> But it looks like a false positive to me.

Hi Dave,

Sorry to ping. Just double check, any comments around here, ..

> 
> > Or, should we be doing something different with the 1MB in the memblock
> > structure?
> 
> memblock APIs are used by other kernel components.  I don't think we should
> modify memblock code behaviour for TDX.  Do you have any specific suggestion?
> 
> One possible option I can think is explicitly "register" memory regions as TDX
> memory when they are firstly freed to the page allocator.  

[...]

> 
> This will require new data structures to represent TDX memblock and the code to
> create, insert and merge contiguous TDX memblocks, etc.  The advantage is we can
> just iterate those TDX memblocks when constructing TDMRs.
> 
> 

And here?


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-06-27  6:16     ` Kai Huang
  2022-07-07  2:37       ` Kai Huang
@ 2022-07-07 14:26       ` Dave Hansen
  2022-07-07 14:36         ` Juergen Gross
  2022-07-07 23:34         ` Kai Huang
  1 sibling, 2 replies; 114+ messages in thread
From: Dave Hansen @ 2022-07-07 14:26 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 6/26/22 23:16, Kai Huang wrote:
> On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
>>> +/*
>>> + * Walks over all memblock memory regions that are intended to be
>>> + * converted to TDX memory.  Essentially, it is all memblock memory
>>> + * regions excluding the low memory below 1MB.
>>> + *
>>> + * This is because on some TDX platforms the low memory below 1MB is
>>> + * not included in CMRs.  Excluding the low 1MB can still guarantee
>>> + * that the pages managed by the page allocator are always TDX memory,
>>> + * as the low 1MB is reserved during kernel boot and won't end up to
>>> + * the ZONE_DMA (see reserve_real_mode()).
>>> + */
>>> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
>>> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
>>> +		if (!pfn_range_skip_lowmem(p_start, p_end))
>>
>> Let's summarize where we are at this point:
>>
>> 1. All RAM is described in memblocks
>> 2. Some memblocks are reserved and some are free
>> 3. The lower 1MB is marked reserved
>> 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
>>    have to exclude the lower 1MB as a special case.
>>
>> That seems superficially rather ridiculous.  Shouldn't we just pick a
>> memblock iterator that skips the 1MB?  Surely there is such a thing.
> 
> Perhaps you are suggesting we should always loop the _free_ ranges so we don't
> need to care about the first 1MB which is reserved?
> 
> The problem is some reserved memory regions are actually later freed to the page
> allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
> regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().

Why not just entirely remove the lower 1MB from the memblock structure
on TDX systems?  Do something equivalent to adding this on the kernel
command line:

	memmap=1M$0x0

> Btw, I do have a checkpatch warning around this code:
> 
> ERROR: Macros with complex values should be enclosed in parentheses
> #109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> +		if (!pfn_range_skip_lowmem(p_start, p_end))
> 
> But it looks like a false positive to me.

I think it doesn't like the if().

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-07-07 14:26       ` Dave Hansen
@ 2022-07-07 14:36         ` Juergen Gross
  2022-07-07 23:42           ` Kai Huang
  2022-07-07 23:34         ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Juergen Gross @ 2022-07-07 14:36 UTC (permalink / raw)
  To: Dave Hansen, Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata


[-- Attachment #1.1.1: Type: text/plain, Size: 2630 bytes --]

On 07.07.22 16:26, Dave Hansen wrote:
> On 6/26/22 23:16, Kai Huang wrote:
>> On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
>>>> +/*
>>>> + * Walks over all memblock memory regions that are intended to be
>>>> + * converted to TDX memory.  Essentially, it is all memblock memory
>>>> + * regions excluding the low memory below 1MB.
>>>> + *
>>>> + * This is because on some TDX platforms the low memory below 1MB is
>>>> + * not included in CMRs.  Excluding the low 1MB can still guarantee
>>>> + * that the pages managed by the page allocator are always TDX memory,
>>>> + * as the low 1MB is reserved during kernel boot and won't end up to
>>>> + * the ZONE_DMA (see reserve_real_mode()).
>>>> + */
>>>> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
>>>> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
>>>> +		if (!pfn_range_skip_lowmem(p_start, p_end))
>>>
>>> Let's summarize where we are at this point:
>>>
>>> 1. All RAM is described in memblocks
>>> 2. Some memblocks are reserved and some are free
>>> 3. The lower 1MB is marked reserved
>>> 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
>>>     have to exclude the lower 1MB as a special case.
>>>
>>> That seems superficially rather ridiculous.  Shouldn't we just pick a
>>> memblock iterator that skips the 1MB?  Surely there is such a thing.
>>
>> Perhaps you are suggesting we should always loop the _free_ ranges so we don't
>> need to care about the first 1MB which is reserved?
>>
>> The problem is some reserved memory regions are actually later freed to the page
>> allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
>> regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().
> 
> Why not just entirely remove the lower 1MB from the memblock structure
> on TDX systems?  Do something equivalent to adding this on the kernel
> command line:
> 
> 	memmap=1M$0x0
> 
>> Btw, I do have a checkpatch warning around this code:
>>
>> ERROR: Macros with complex values should be enclosed in parentheses
>> #109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
>> +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
>> +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
>> +		if (!pfn_range_skip_lowmem(p_start, p_end))
>>
>> But it looks like a false positive to me.
> 
> I think it doesn't like the if().

I think it is right.

Consider:

if (a)
     memblock_for_each_tdx_mem_pfn_range(...)
         func();
else
     other_func();


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-07-07 14:26       ` Dave Hansen
  2022-07-07 14:36         ` Juergen Gross
@ 2022-07-07 23:34         ` Kai Huang
  2022-08-03  1:30           ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-07-07 23:34 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-07-07 at 07:26 -0700, Dave Hansen wrote:
> On 6/26/22 23:16, Kai Huang wrote:
> > On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
> > > > +/*
> > > > + * Walks over all memblock memory regions that are intended to be
> > > > + * converted to TDX memory.  Essentially, it is all memblock memory
> > > > + * regions excluding the low memory below 1MB.
> > > > + *
> > > > + * This is because on some TDX platforms the low memory below 1MB is
> > > > + * not included in CMRs.  Excluding the low 1MB can still guarantee
> > > > + * that the pages managed by the page allocator are always TDX memory,
> > > > + * as the low 1MB is reserved during kernel boot and won't end up to
> > > > + * the ZONE_DMA (see reserve_real_mode()).
> > > > + */
> > > > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > > > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> > > 
> > > Let's summarize where we are at this point:
> > > 
> > > 1. All RAM is described in memblocks
> > > 2. Some memblocks are reserved and some are free
> > > 3. The lower 1MB is marked reserved
> > > 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
> > >    have to exclude the lower 1MB as a special case.
> > > 
> > > That seems superficially rather ridiculous.  Shouldn't we just pick a
> > > memblock iterator that skips the 1MB?  Surely there is such a thing.
> > 
> > Perhaps you are suggesting we should always loop the _free_ ranges so we don't
> > need to care about the first 1MB which is reserved?
> > 
> > The problem is some reserved memory regions are actually later freed to the page
> > allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
> > regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().
> 
> Why not just entirely remove the lower 1MB from the memblock structure
> on TDX systems?  Do something equivalent to adding this on the kernel
> command line:
> 
> 	memmap=1M$0x0

I will explore this option.  Thanks!

> 
> > Btw, I do have a checkpatch warning around this code:
> > 
> > ERROR: Macros with complex values should be enclosed in parentheses
> > #109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
> > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> > 
> > But it looks like a false positive to me.
> 
> I think it doesn't like the if().

Yes. I'll explore your suggestion above and I hope this can be avoided.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-07-07 14:36         ` Juergen Gross
@ 2022-07-07 23:42           ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-07 23:42 UTC (permalink / raw)
  To: Juergen Gross, Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-07-07 at 16:36 +0200, Juergen Gross wrote:
> On 07.07.22 16:26, Dave Hansen wrote:
> > On 6/26/22 23:16, Kai Huang wrote:
> > > On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
> > > > > +/*
> > > > > + * Walks over all memblock memory regions that are intended to be
> > > > > + * converted to TDX memory.  Essentially, it is all memblock memory
> > > > > + * regions excluding the low memory below 1MB.
> > > > > + *
> > > > > + * This is because on some TDX platforms the low memory below 1MB is
> > > > > + * not included in CMRs.  Excluding the low 1MB can still guarantee
> > > > > + * that the pages managed by the page allocator are always TDX memory,
> > > > > + * as the low 1MB is reserved during kernel boot and won't end up to
> > > > > + * the ZONE_DMA (see reserve_real_mode()).
> > > > > + */
> > > > > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > > > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > > > > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> > > > 
> > > > Let's summarize where we are at this point:
> > > > 
> > > > 1. All RAM is described in memblocks
> > > > 2. Some memblocks are reserved and some are free
> > > > 3. The lower 1MB is marked reserved
> > > > 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
> > > >     have to exclude the lower 1MB as a special case.
> > > > 
> > > > That seems superficially rather ridiculous.  Shouldn't we just pick a
> > > > memblock iterator that skips the 1MB?  Surely there is such a thing.
> > > 
> > > Perhaps you are suggesting we should always loop the _free_ ranges so we don't
> > > need to care about the first 1MB which is reserved?
> > > 
> > > The problem is some reserved memory regions are actually later freed to the page
> > > allocator, for example, initrd.  So to cover all those 'late-freed-reserved-
> > > regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().
> > 
> > Why not just entirely remove the lower 1MB from the memblock structure
> > on TDX systems?  Do something equivalent to adding this on the kernel
> > command line:
> > 
> > 	memmap=1M$0x0
> > 
> > > Btw, I do have a checkpatch warning around this code:
> > > 
> > > ERROR: Macros with complex values should be enclosed in parentheses
> > > #109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
> > > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid)	\
> > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid)	\
> > > +		if (!pfn_range_skip_lowmem(p_start, p_end))
> > > 
> > > But it looks like a false positive to me.
> > 
> > I think it doesn't like the if().
> 
> I think it is right.
> 
> Consider:
> 
> if (a)
>      memblock_for_each_tdx_mem_pfn_range(...)
>          func();
> else
>      other_func();
> 
> 
> 
Interesting case.  Thanks.

Yes we will require explicit { } around memblock_for_each_tdx_mem_pfn_range() in
this case.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-27  5:05     ` Kai Huang
@ 2022-07-13 11:09       ` Kai Huang
  2022-07-19 17:46         ` Dave Hansen
  2022-08-03  3:40       ` Binbin Wu
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-07-13 11:09 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Mon, 2022-06-27 at 17:05 +1200, Kai Huang wrote:
> On Fri, 2022-06-24 at 11:57 -0700, Dave Hansen wrote:
> > On 6/22/22 04:15, Kai Huang wrote:
> > > Platforms with confidential computing technology may not support ACPI
> > > CPU hotplug when such technology is enabled by the BIOS.  Examples
> > > include Intel platforms which support Intel Trust Domain Extensions
> > > (TDX).
> > > 
> > > If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
> > > bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
> > > bug and reject the new CPU.  For hot-removal, for simplicity just assume
> > > the kernel cannot continue to work normally, and BUG().
> > 
> > So, the kernel is now declaring ACPI CPU hotplug and TDX to be
> > incompatible and even BUG()'ing if we see them together.  Has anyone
> > told the firmware guys about this?  Is this in a spec somewhere?  When
> > the kernel goes boom, are the firmware folks going to cry "Kernel bug!!"?
> > 
> > This doesn't seem like something the kernel should be doing unilaterally.
> 
> TDX doesn't support ACPI CPU hotplug (both hot-add and hot-removal) is an
> architectural behaviour.  The public specs doesn't explicitly say  it, but it is
> implied:
> 
> 1) During platform boot MCHECK verifies all logical CPUs on all packages that
> they are TDX compatible, and it keeps some information, such as total CPU
> packages and total logical cpus at some location of SEAMRR so it can later be
> used by P-SEAMLDR and TDX module.  Please see "3.4 SEAMLDR_SEAMINFO" in the P-
> SEAMLDR spec:
> 
> https://cdrdv2.intel.com/v1/dl/getContent/733584
> 
> 2) Also some SEAMCALLs must be called on all logical CPUs or CPU packages that
> the platform has (such as such as TDH.SYS.INIT.LP and TDH.SYS.KEY.CONFIG),
> otherwise the further step of TDX module initialization will fail.
> 
> Unfortunately there's no public spec mentioning what's the behaviour of ACPI CPU
> hotplug on TDX enabled platform.  For instance, whether BIOS will ever get the
> ACPI CPU hot-plug event, or if BIOS gets the event, will it suppress it.  What I
> got from Intel internally is a non-buggy BIOS should never report such event to
> the kernel, so if kernel receives such event, it should be fair enough to treat
> it as BIOS bug.
> 
> But theoretically, the BIOS isn't in TDX's TCB, and can be from 3rd party..
> 
> Also, I was told "CPU hot-plug is a system feature, not a CPU feature or Intel
> architecture feature", so Intel doesn't have an architectural specification for
> CPU hot-plug. 
> 
> At the meantime, I am pushing Intel internally to add some statements regarding
> to the TDX and CPU hotplug interaction to the BIOS write guide and make it
> public.  I guess this is the best thing we can do.
> 
> Regarding to the code change, I agree the BUG() isn't good.  I used it because:
> 1) this basically on a theoretical problem and shouldn't happen in practice; 2)
> because there's no architectural specification regarding to the behaviour of TDX
> when CPU hot-removal, so I just used BUG() in assumption that TDX isn't safe to
> use anymore.

Hi Dave,

Trying to close how to handle ACPI CPU hotplug for TDX.  Could you give some
suggestion?

After discussion with TDX guys, they have agreed they will add below to either
the TDX module spec or TDX whitepaper:

"TDX doesn’t support adding or removing CPUs from TDX security perimeter. The
BIOS should prevent CPUs from being hot-added or hot-removed after platform
boots."

This means if TDX is enabled in BIOS, a non-buggy BIOS should never deliver ACPI
CPU hotplug event to kernel, otherwise it is a BIOS bug.  And this is only
related to whether TDX is enabled in BIOS, no matter whether the TDX module has
been loaded/initialized or not.

So I think the proper way to handle is: we still have code to detect whether TDX
is enabled by BIOS (patch 01 in this series), and when ACPI CPU hotplug happens
on TDX enabled platform, we print out error message saying it is a BIOS bug.

Specifically, for CPU hot-add, we can print error message and reject the new
CPU.  For CPU hot-removal, when the kernel receives this event, the CPU hot-
removal has already handled by BIOS so the kernel cannot reject it.  So I think
we can either BUG(), or say "TDX is broken and please reboot the machine".

I guess BUG() would catch a lot of eyeball, so how about choose the latter, like
below?

--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -799,6 +799,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include <acpi/processor.h>
+#include <asm/tdx.h>
 
 static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
@@ -819,6 +820,12 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid,
u32 acpi_id,
 {
        int cpu;
 
+       if (platform_tdx_enabled()) {
+               pr_err("BIOS bug: CPU (physid %u) hot-added on TDX enabled
platform. Reject it.\n",
+                               physid);
+               return -EINVAL;
+       }
+
        cpu = acpi_register_lapic(physid, acpi_id, ACPI_MADT_ENABLED);
        if (cpu < 0) {
                pr_info("Unable to map lapic to logical cpu number\n");
@@ -835,6 +842,10 @@ EXPORT_SYMBOL(acpi_map_cpu);
 
 int acpi_unmap_cpu(int cpu)
 {
+       if (platform_tdx_enabled())
+               pr_err("BIOS bug: CPU %d hot-removed on TDX enabled platform.
TDX is broken. Please reboot the machine.\n",
+                               cpu);
+
 #ifdef CONFIG_ACPI_NUMA
        set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
 #endif


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-07-13 11:09       ` Kai Huang
@ 2022-07-19 17:46         ` Dave Hansen
  2022-07-19 23:54           ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-07-19 17:46 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On 7/13/22 04:09, Kai Huang wrote:
...
> "TDX doesn’t support adding or removing CPUs from TDX security perimeter. The
> BIOS should prevent CPUs from being hot-added or hot-removed after platform
> boots."

That's a start.  It also probably needs to say that the security
perimeter includes all logical CPUs, though.

>  static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  {
> @@ -819,6 +820,12 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid,
> u32 acpi_id,
>  {
>         int cpu;
>  
> +       if (platform_tdx_enabled()) {
> +               pr_err("BIOS bug: CPU (physid %u) hot-added on TDX enabled
> platform. Reject it.\n",
> +                               physid);
> +               return -EINVAL;
> +       }

Is this the right place?  There are other sanity checks in
acpi_processor_hotadd_init() and it seems like a better spot.

>         cpu = acpi_register_lapic(physid, acpi_id, ACPI_MADT_ENABLED);
>         if (cpu < 0) {
>                 pr_info("Unable to map lapic to logical cpu number\n");
> @@ -835,6 +842,10 @@ EXPORT_SYMBOL(acpi_map_cpu);
>  
>  int acpi_unmap_cpu(int cpu)
>  {
> +       if (platform_tdx_enabled())
> +               pr_err("BIOS bug: CPU %d hot-removed on TDX enabled platform.
> TDX is broken. Please reboot the machine.\n",
> +                               cpu);
> +
>  #ifdef CONFIG_ACPI_NUMA
>         set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
>  #endif


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-27 22:10         ` Kai Huang
@ 2022-07-19 19:39           ` Dan Williams
  2022-07-19 23:28             ` Kai Huang
  2022-07-20 10:18           ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Dan Williams @ 2022-07-19 19:39 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

Kai Huang wrote:
> On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > On 6/26/22 22:23, Kai Huang wrote:
> > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > before calling __seamcall().
> > > > 
> > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > Basically, patch 01/22 could go away.
> > ...
> > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > 
> > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > before initializing the TDX Module:
> > > 
> > > 1) There are requirements from customers to report whether platform supports TDX
> > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > cloud software can use this information to do something.  Sorry I cannot find
> > > the lore link now.
> > 
> > <sigh>
> > 
> > Never listen to customers literally.  It'll just lead you down the wrong
> > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > without understanding why.  The fact that you even *need* to find the
> > lore link is because you didn't bother to realize what they really needed.
> > 
> > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > do it with a *REAL* ABI.  Not dmesg.
> 
> Showing in the dmesg is the first step, but later we have plan to expose keyID
> info via /sysfs.  Of course, it's always arguable customer's such requirement is
> absolutely needed, but to me it's still a good thing to have code to detect TDX
> during boot.  The code isn't complicated as you can see.
> 
> > 
> > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > managed memory hotplug.  Kexec() support patch also can use it.
> > > 
> > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > during boot.
> > 
> > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > what the OS and firmware should do when TDX is in varying states of decay.
> 
> Yes I am working on it to make it public.
> 
> > 
> > Does the mere presence of the TDX module prevent hotplug?  
> > 
> 
> For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> Whether SEAMRR is enabled determines.
> 
> For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> work with others internally to get some public statement.
> 
> > Or, if a
> > system has the TDX module loaded but no intent to ever use TDX, why
> > can't it just use hotplug like a normal system which is not addled with
> > the TDX albatross around its neck?
> 
> I think if a machine has enabled TDX in the BIOS, the user of the machine very
> likely has intention to actually use TDX.
> 
> Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> use TDX, it's better to not disable it.  But to me it's also not a disaster if
> we just disable driver-managed memory hotplug if TDX is enabled by BIOS.

No, driver-managed memory hotplug is how Linux handles "dedicated
memory" management. The architecture needs to comprehend that end users
may want to move address ranges into and out of Linux core-mm management
independently of whether those address ranges are also covered by a SEAM
range.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-19 19:39           ` Dan Williams
@ 2022-07-19 23:28             ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-19 23:28 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-07-19 at 12:39 -0700, Dan Williams wrote:
> Kai Huang wrote:
> > On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > > On 6/26/22 22:23, Kai Huang wrote:
> > > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > > before calling __seamcall().
> > > > > 
> > > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > > Basically, patch 01/22 could go away.
> > > ...
> > > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > > 
> > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > > before initializing the TDX Module:
> > > > 
> > > > 1) There are requirements from customers to report whether platform supports TDX
> > > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > > cloud software can use this information to do something.  Sorry I cannot find
> > > > the lore link now.
> > > 
> > > <sigh>
> > > 
> > > Never listen to customers literally.  It'll just lead you down the wrong
> > > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > > without understanding why.  The fact that you even *need* to find the
> > > lore link is because you didn't bother to realize what they really needed.
> > > 
> > > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > > do it with a *REAL* ABI.  Not dmesg.
> > 
> > Showing in the dmesg is the first step, but later we have plan to expose keyID
> > info via /sysfs.  Of course, it's always arguable customer's such requirement is
> > absolutely needed, but to me it's still a good thing to have code to detect TDX
> > during boot.  The code isn't complicated as you can see.
> > 
> > > 
> > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > > managed memory hotplug.  Kexec() support patch also can use it.
> > > > 
> > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > > during boot.
> > > 
> > > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > > what the OS and firmware should do when TDX is in varying states of decay.
> > 
> > Yes I am working on it to make it public.
> > 
> > > 
> > > Does the mere presence of the TDX module prevent hotplug?  
> > > 
> > 
> > For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> > Whether SEAMRR is enabled determines.
> > 
> > For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> > work with others internally to get some public statement.
> > 
> > > Or, if a
> > > system has the TDX module loaded but no intent to ever use TDX, why
> > > can't it just use hotplug like a normal system which is not addled with
> > > the TDX albatross around its neck?
> > 
> > I think if a machine has enabled TDX in the BIOS, the user of the machine very
> > likely has intention to actually use TDX.
> > 
> > Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> > use TDX, it's better to not disable it.  But to me it's also not a disaster if
> > we just disable driver-managed memory hotplug if TDX is enabled by BIOS.
> 
> No, driver-managed memory hotplug is how Linux handles "dedicated
> memory" management. The architecture needs to comprehend that end users
> may want to move address ranges into and out of Linux core-mm management
> independently of whether those address ranges are also covered by a SEAM
> range.

But to avoid GFP_TDX (and ZONE_TDX) staff, we need to guarantee all memory pages
in page allocator are TDX pages.  To me it's at least quite fair that user needs
to *choose* to use driver-managed memory hotplug or TDX.

If automatically disable driver-managed memory hotplug on a TDX BIOS enabled
platform isn't desired, how about we introduce a kernel command line (i.e.
use_tdx={on|off}) to let user to choose?

If user specifies use_tdx=on, then user cannot use driver-managed memory
hotplug.  if use_tdx=off, then user cannot use TDX even it is enabled by BIOS.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-07-19 17:46         ` Dave Hansen
@ 2022-07-19 23:54           ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-19 23:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Tue, 2022-07-19 at 10:46 -0700, Dave Hansen wrote:
> On 7/13/22 04:09, Kai Huang wrote:
> ...
> > "TDX doesn’t support adding or removing CPUs from TDX security perimeter. The
> > BIOS should prevent CPUs from being hot-added or hot-removed after platform
> > boots."
> 
> That's a start.  It also probably needs to say that the security
> perimeter includes all logical CPUs, though.

To me it is kinda implied.  But I have sent email to TDX spec owner to see
whether we can say it more explicitly.

> 
> >  static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> >  {
> > @@ -819,6 +820,12 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid,
> > u32 acpi_id,
> >  {
> >         int cpu;
> >  
> > +       if (platform_tdx_enabled()) {
> > +               pr_err("BIOS bug: CPU (physid %u) hot-added on TDX enabled
> > platform. Reject it.\n",
> > +                               physid);
> > +               return -EINVAL;
> > +       }
> 
> Is this the right place?  There are other sanity checks in
> acpi_processor_hotadd_init() and it seems like a better spot.

It has below additional check:

        if (invalid_phys_cpuid(pr->phys_id))
                return -ENODEV;
        
        status = acpi_evaluate_integer(pr->handle, "_STA", NULL, &sta);
        if (ACPI_FAILURE(status) || !(sta & ACPI_STA_DEVICE_PRESENT))
                return -ENODEV;


I don't know exactly when will the first "invalid_phys_cpuid()" case happen, but
the CPU is enumerated as "present" only after the second check.  I.e. if BIOS is
buggy and somehow sends a ACPI CPU hot-add event to kernel w/o having the CPU
being actually hot-added, the kernel just returns -ENODEV here.

So to me, adding to acpi_map_cpu() is more reasonable, because by reaching here,
it is sure that a real CPU is being hot-added.


> 
> >         cpu = acpi_register_lapic(physid, acpi_id, ACPI_MADT_ENABLED);
> >         if (cpu < 0) {
> >                 pr_info("Unable to map lapic to logical cpu number\n");
> > @@ -835,6 +842,10 @@ EXPORT_SYMBOL(acpi_map_cpu);
> >  
> >  int acpi_unmap_cpu(int cpu)
> >  {
> > +       if (platform_tdx_enabled())
> > +               pr_err("BIOS bug: CPU %d hot-removed on TDX enabled platform.
> > TDX is broken. Please reboot the machine.\n",
> > +                               cpu);
> > +
> >  #ifdef CONFIG_ACPI_NUMA
> >         set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> >  #endif
> 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-06-27 22:10         ` Kai Huang
  2022-07-19 19:39           ` Dan Williams
@ 2022-07-20 10:18           ` Kai Huang
  2022-07-20 16:48             ` Dave Hansen
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-07-20 10:18 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-06-28 at 10:10 +1200, Kai Huang wrote:
> On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > On 6/26/22 22:23, Kai Huang wrote:
> > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > before calling __seamcall().
> > > > 
> > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > Basically, patch 01/22 could go away.
> > ...
> > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > 
> > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > before initializing the TDX Module:
> > > 
> > > 1) There are requirements from customers to report whether platform supports TDX
> > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > cloud software can use this information to do something.  Sorry I cannot find
> > > the lore link now.
> > 
> > <sigh>
> > 
> > Never listen to customers literally.  It'll just lead you down the wrong
> > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > without understanding why.  The fact that you even *need* to find the
> > lore link is because you didn't bother to realize what they really needed.
> > 
> > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > do it with a *REAL* ABI.  Not dmesg.
> 
> Showing in the dmesg is the first step, but later we have plan to expose keyID
> info via /sysfs.  Of course, it's always arguable customer's such requirement is
> absolutely needed, but to me it's still a good thing to have code to detect TDX
> during boot.  The code isn't complicated as you can see.
> 
> > 
> > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > managed memory hotplug.  Kexec() support patch also can use it.
> > > 
> > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > during boot.
> > 
> > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > what the OS and firmware should do when TDX is in varying states of decay.
> 
> Yes I am working on it to make it public.
> 
> > 
> > Does the mere presence of the TDX module prevent hotplug?  
> > 
> 
> For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> Whether SEAMRR is enabled determines.
> 
> For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> work with others internally to get some public statement.
> 
> > Or, if a
> > system has the TDX module loaded but no intent to ever use TDX, why
> > can't it just use hotplug like a normal system which is not addled with
> > the TDX albatross around its neck?
> 
> I think if a machine has enabled TDX in the BIOS, the user of the machine very
> likely has intention to actually use TDX.
> 
> Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> use TDX, it's better to not disable it.  But to me it's also not a disaster if
> we just disable driver-managed memory hotplug if TDX is enabled by BIOS.
> 
> For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but
> I'll get some public statement around this.
> 

Hi Dave,

Try to close on how to handle memory hotplug.  After discussion, below will be
architectural behaviour of TDX in terms of ACPI memory hotplug:

1) During platform boot, CMRs must be physically present. MCHECK verifies all
CMRs are physically present and are actually TDX convertible memory.
2) CMRs are static after platform boots and don't change at runtime.  TDX
architecture doesn't support hot-add or hot-removal of CMR memory.
3) TDX architecture doesn't forbid non-CMR memory hotplug.

Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
should prevent CMR memory from being hot-removed.  If kernel ever receives such
event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.

As a result, the kernel should also never receive event of hot-add CMR memory. 
It is very much likely TDX is under attack (physical attack) in such case, i.e.
someone is trying to physically replace any CMR memory.

In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
by BIOS, we can do below:

- For memory hot-removal, if the removed memory falls into any CMR, then kernel
can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
been handled by BIOS thus kernel cannot actually prevent, so kernel can either
BUG(), or just print error message.  If the removed memory doesn't fall into
CMR, we do nothing.

- For memory hot-add, if the new memory falls into any CMR, then kernel should
speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
possible when CMR memory has been previously hot-removed.  And kernel should
reject the new memory for security reason.  If the new memory doesn't fall into
any CMR, then we (also) just reject the new memory, as we want to guarantee all
memory in page allocator are TDX pages.  But this is basically due to kernel
policy but not due to TDX architecture.

BUT, since as the first step, we cannot get the CMR during kernel boot (as it
requires additional code to put CPU into VMX operation), I think for now we can
handle ACPI memory hotplug in below way:

- For memory hot-removal, we do nothing.
- For memory hot-add, we simply reject the new memory when TDX is enabled by
BIOS.  This not only prevents the potential "physical attack of replacing any
CMR memory", but also makes sure no non-CMR memory will be added to page
allocator during runtime via ACPI memory hot-add.

We can improve this in next stage when we can get CMRs during kernel boot.

For the concern that on a TDX BIOS enabled system, people may not want to use
TDX at all but just use it as normal system, as I replied to Dan regarding to
the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
as normal but refuse to initialize TDX module.

Any comments?


 



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-20 10:18           ` Kai Huang
@ 2022-07-20 16:48             ` Dave Hansen
  2022-07-21  1:52               ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-07-20 16:48 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 7/20/22 03:18, Kai Huang wrote:
> Try to close on how to handle memory hotplug.  After discussion, below will be
> architectural behaviour of TDX in terms of ACPI memory hotplug:
> 
> 1) During platform boot, CMRs must be physically present. MCHECK verifies all
> CMRs are physically present and are actually TDX convertible memory.

I doubt this is strictly true.  This makes it sound like MCHECK is doing
*ACTUAL* verification that the memory is, in practice, convertible.
That would mean actually writing to it, which would take a long time for
a large system.

Does it *ACTUALLY* verify this?

Also, it's very odd to say that "CMRs must be physically present".  A
CMR itself is a logical construct.  The physical memory *backing* a CMR
is, something else entirely.

> 2) CMRs are static after platform boots and don't change at runtime.  TDX
> architecture doesn't support hot-add or hot-removal of CMR memory.
> 3) TDX architecture doesn't forbid non-CMR memory hotplug.
> 
> Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
> should prevent CMR memory from being hot-removed.  If kernel ever receives such
> event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.
> 
> As a result, the kernel should also never receive event of hot-add CMR memory. 
> It is very much likely TDX is under attack (physical attack) in such case, i.e.
> someone is trying to physically replace any CMR memory.
> 
> In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
> kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
> by BIOS, we can do below:
> 
> - For memory hot-removal, if the removed memory falls into any CMR, then kernel
> can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
> been handled by BIOS thus kernel cannot actually prevent, so kernel can either
> BUG(), or just print error message.  If the removed memory doesn't fall into
> CMR, we do nothing.

Hold on a sec.  Hot-removal is a two-step process.  The kernel *MUST*
know in advance that the removal is going to occur.  It follows that up
with evacuating the memory, giving the "all clear", then the actual
physical removal can occur.

I'm not sure what you're getting at with the "kernel cannot actually
prevent" bit.  No sane system actively destroys perfect good memory
content and tells the kernel about it after the fact.

> - For memory hot-add, if the new memory falls into any CMR, then kernel should
> speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
> possible when CMR memory has been previously hot-removed.

I don't think this is strictly true.  It's totally possible to get a
hot-add *event* for memory which is in a CMR.  It would be another BIOS
bug, of course, but hot-remove is not a prerequisite purely for an event.

> And kernel should
> reject the new memory for security reason.  If the new memory doesn't fall into
> any CMR, then we (also) just reject the new memory, as we want to guarantee all
> memory in page allocator are TDX pages.  But this is basically due to kernel
> policy but not due to TDX architecture.

Agreed.

> BUT, since as the first step, we cannot get the CMR during kernel boot (as it
> requires additional code to put CPU into VMX operation), I think for now we can
> handle ACPI memory hotplug in below way:
> 
> - For memory hot-removal, we do nothing.

This doesn't seem right to me.  *If* we get a known-bogus hot-remove
event, we need to reject it.  Remember, removal is a two-step process.

> - For memory hot-add, we simply reject the new memory when TDX is enabled by
> BIOS.  This not only prevents the potential "physical attack of replacing any
> CMR memory",

I don't think there's *any* meaningful attack mitigation here.  Even if
someone managed to replace the physical address space that backed some
private memory, the integrity checksums won't match.  Memory integrity
mitigates physical replacement, not software.

> but also makes sure no non-CMR memory will be added to page
> allocator during runtime via ACPI memory hot-add.

Agreed.  This one _is_ important and since it supports an existing
policy, it makes sense to enforce this in the kernel.

> We can improve this in next stage when we can get CMRs during kernel boot.
> 
> For the concern that on a TDX BIOS enabled system, people may not want to use
> TDX at all but just use it as normal system, as I replied to Dan regarding to
> the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
> use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
> When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
> as normal but refuse to initialize TDX module.

That doesn't sound like a good resolution to me.

It conflates pure "software" hotplug operations like transitioning
memory ownership from the core mm to a driver (like device DAX).

TDX should not have *ANY* impact on purely software operations.  Period.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-20 16:48             ` Dave Hansen
@ 2022-07-21  1:52               ` Kai Huang
  2022-07-27  0:34                 ` Kai Huang
  2022-08-03  2:37                 ` Kai Huang
  0 siblings, 2 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-21  1:52 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-07-20 at 09:48 -0700, Dave Hansen wrote:
> On 7/20/22 03:18, Kai Huang wrote:
> > Try to close on how to handle memory hotplug.  After discussion, below will be
> > architectural behaviour of TDX in terms of ACPI memory hotplug:
> > 
> > 1) During platform boot, CMRs must be physically present. MCHECK verifies all
> > CMRs are physically present and are actually TDX convertible memory.
> 
> I doubt this is strictly true.  This makes it sound like MCHECK is doing
> *ACTUAL* verification that the memory is, in practice, convertible.
> That would mean actually writing to it, which would take a long time for
> a large system.

The "verify" is used in many places in the specs.  In the public TDX module
spec, it is also used:

Table 1.1: Intel TDX Glossary
CMR: A range of physical memory configured by BIOS and verified by MCHECK

I guess to verify, MCHECK doesn't need to actually write something to memory. 
For example, when memory is present, it can know what type it is so it can
determine.

> 
> Does it *ACTUALLY* verify this?

Yes.  This is what spec says.  And this is what Intel colleagues said.

> 
> Also, it's very odd to say that "CMRs must be physically present".  A
> CMR itself is a logical construct.  The physical memory *backing* a CMR
> is, something else entirely.

OK.  But I think it is easy to interpret this actually means the physical memory
*backing* a CMR.

> 
> > 2) CMRs are static after platform boots and don't change at runtime.  TDX
> > architecture doesn't support hot-add or hot-removal of CMR memory.
> > 3) TDX architecture doesn't forbid non-CMR memory hotplug.
> > 
> > Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
> > should prevent CMR memory from being hot-removed.  If kernel ever receives such
> > event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.
> > 
> > As a result, the kernel should also never receive event of hot-add CMR memory. 
> > It is very much likely TDX is under attack (physical attack) in such case, i.e.
> > someone is trying to physically replace any CMR memory.
> > 
> > In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
> > kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
> > by BIOS, we can do below:
> > 
> > - For memory hot-removal, if the removed memory falls into any CMR, then kernel
> > can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
> > been handled by BIOS thus kernel cannot actually prevent, so kernel can either
> > BUG(), or just print error message.  If the removed memory doesn't fall into
> > CMR, we do nothing.
> 
> Hold on a sec.  Hot-removal is a two-step process.  The kernel *MUST*
> know in advance that the removal is going to occur.  It follows that up
> with evacuating the memory, giving the "all clear", then the actual
> physical removal can occur.

After looking more, looks "the hot-removal has been handled by BIOS" is wrong. 
And you are right there's a previous step must be done (it is device offline). 
But the "kernel cannot actually prevent" means in the device removal callback,
the kernel cannot prevent it from being removed.

This is my understanding by reading the ACPI spec and the code:

Firstly, the BIOS will send a "Eject Request" notification to the kernel. Upon
receiving this event, the kernel will firstly try to offline the device (which
can fail due to -EBUSY, etc).  If offline is successful, the kernel will call
device's remove callback to remove the device.  But this remove callback doesn't
return error code (which means it doesn't fail).  Instead, after the remove
callback is done, the kernel calls _EJ0 ACPI method to actually do the ejection.

> 
> I'm not sure what you're getting at with the "kernel cannot actually
> prevent" bit.  No sane system actively destroys perfect good memory
> content and tells the kernel about it after the fact.

The kernel will offline the device first.  This guarantees all good memory
content has been migrated.

> 
> > - For memory hot-add, if the new memory falls into any CMR, then kernel should
> > speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
> > possible when CMR memory has been previously hot-removed.
> 
> I don't think this is strictly true.  It's totally possible to get a
> hot-add *event* for memory which is in a CMR.  It would be another BIOS
> bug, of course, but hot-remove is not a prerequisite purely for an event.

OK.

> 
> > And kernel should
> > reject the new memory for security reason.  If the new memory doesn't fall into
> > any CMR, then we (also) just reject the new memory, as we want to guarantee all
> > memory in page allocator are TDX pages.  But this is basically due to kernel
> > policy but not due to TDX architecture.
> 
> Agreed.
> 
> > BUT, since as the first step, we cannot get the CMR during kernel boot (as it
> > requires additional code to put CPU into VMX operation), I think for now we can
> > handle ACPI memory hotplug in below way:
> > 
> > - For memory hot-removal, we do nothing.
> 
> This doesn't seem right to me.  *If* we get a known-bogus hot-remove
> event, we need to reject it.  Remember, removal is a two-step process.

If so, we need to reject the (CMR) memory offline.  Or we just BUG() in the ACPI
memory removal  callback?

But either way this will requires us to get the CMRs during kernel boot.

Do you think we need to add this support in the first series?

> 
> > - For memory hot-add, we simply reject the new memory when TDX is enabled by
> > BIOS.  This not only prevents the potential "physical attack of replacing any
> > CMR memory",
> 
> I don't think there's *any* meaningful attack mitigation here.  Even if
> someone managed to replace the physical address space that backed some
> private memory, the integrity checksums won't match.  Memory integrity
> mitigates physical replacement, not software.

My thinking is rejecting the new memory is a more aggressive defence than
waiting until integrity checksum failure.

Btw, the integrity checksum support isn't a mandatory requirement for TDX
architecture.  In fact, TDX also supports a mode which doesn't require integrity
check (for instance, TDX on client machines).

> 
> > but also makes sure no non-CMR memory will be added to page
> > allocator during runtime via ACPI memory hot-add.
> 
> Agreed.  This one _is_ important and since it supports an existing
> policy, it makes sense to enforce this in the kernel.
> 
> > We can improve this in next stage when we can get CMRs during kernel boot.
> > 
> > For the concern that on a TDX BIOS enabled system, people may not want to use
> > TDX at all but just use it as normal system, as I replied to Dan regarding to
> > the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
> > use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
> > When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
> > as normal but refuse to initialize TDX module.
> 
> That doesn't sound like a good resolution to me.
> 
> It conflates pure "software" hotplug operations like transitioning
> memory ownership from the core mm to a driver (like device DAX).
> 
> TDX should not have *ANY* impact on purely software operations.  Period.

The hard requirement is: Once TDX module gets initialized, we cannot add any
*new* memory to core-mm.

But if some memory block is included to TDX memory when the module gets
initialized, then we should be able to move it from core-mm to driver or vice
versa.  In this case, we can select all memory regions that the kernel wants to
use as TDX memory at some point (during kernel boot I guess).  

Adding any non-selected-TDX memory regions to core-mm should always be rejected
(therefore there's no removal of them from core-mm either), although it is
"software" hotplug.  If user wants this, he/she cannot use TDX.  This is what I
mean we can provide command line to allow user to *choose*.

Also, if I understand correctly above, your suggestion is we want to prevent any
CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
during boot).  This looks contradicts to the requirement of being able to allow
moving memory from core-mm to driver.  When we offline the memory, we cannot
know whether the memory will be used by driver, or later hot-removed.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-21  1:52               ` Kai Huang
@ 2022-07-27  0:34                 ` Kai Huang
  2022-07-27  0:50                   ` Dave Hansen
  2022-08-03  2:37                 ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-07-27  0:34 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > BUT, since as the first step, we cannot get the CMR during kernel boot (as
> > > it
> > > requires additional code to put CPU into VMX operation), I think for now
> > > we can
> > > handle ACPI memory hotplug in below way:
> > > 
> > > - For memory hot-removal, we do nothing.
> > 
> > This doesn't seem right to me.  *If* we get a known-bogus hot-remove
> > event, we need to reject it.  Remember, removal is a two-step process.
> 
> If so, we need to reject the (CMR) memory offline.  Or we just BUG() in the
> ACPI
> memory removal  callback?
> 
> But either way this will requires us to get the CMRs during kernel boot.
> 
> Do you think we need to add this support in the first series?

Hi Dave,

In terms of whether we should get CMRs during kernel boot (which requires we do
VMXON/VMXOFF during kernel boot around SEAMCALL), I forgot one thing:

Technically, ACPI memory hotplug is related to whether TDX is enabled in BIOS,
but not related to whether TDX module is loaded or not.  With doing
VMXON/VMXOFF, we can get CMRs during kernel boot by calling P-SEAMLDR's
SEAMCALL.  But theoretically, from TDX architecture's point of view, the P-
SEAMLDR may not be loaded even TDX is enabled by BIOS (in practice, the P-
SEAMLDR is always loaded by BIOS when TDX is enabled), in which case there's no
way we can get CMRs.  But in this case, I think we can just treat TDX isn't
enabled by BIOS as kernel should never try to load P-SEAMLDR.

Other advantages of being able to do VMXON/VMXOFF and getting CMRs during kernel
boot:

1) We can just shut down the TDX module in kexec();
2) We can choose to trim any non-CMR memory out of memblock.memory instead of
having to manually verify all memory regions in memblock are CMR memory.

Comments?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-27  0:34                 ` Kai Huang
@ 2022-07-27  0:50                   ` Dave Hansen
  2022-07-27 12:46                     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-07-27  0:50 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 7/26/22 17:34, Kai Huang wrote:
>> This doesn't seem right to me.  *If* we get a known-bogus
>> hot-remove event, we need to reject it.  Remember, removal is a
>> two-step process.
> If so, we need to reject the (CMR) memory offline.  Or we just BUG()
> in the ACPI memory removal  callback?
> 
> But either way this will requires us to get the CMRs during kernel boot.

I don't get the link there between CMRs at boot and handling hotplug.

We don't need to go to extreme measures just to get a message out of the
kernel that the BIOS is bad.  If we don't have the data to do it
already, then I don't really see the nee to warn about it.

Think of a system that has TDX enabled in the BIOS, but is running an
old kernel.  It will have *ZERO* idea that hotplug doesn't work.  It'll
run blissfully along.  I don't see any reason that a kernel with TDX
support, but where TDX is disabled should actively go out and try to be
better than those old pre-TDX kernels.

Further, there's nothing to stop non-CMR memory from being added to a
system with TDX enabled in the BIOS but where the kernel is not using
it.  If we actively go out and keep good old DRAM from being added, then
we unnecessarily addle those systems.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-27  0:50                   ` Dave Hansen
@ 2022-07-27 12:46                     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-07-27 12:46 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-07-26 at 17:50 -0700, Dave Hansen wrote:
> On 7/26/22 17:34, Kai Huang wrote:
> > > This doesn't seem right to me.  *If* we get a known-bogus
> > > hot-remove event, we need to reject it.  Remember, removal is a
> > > two-step process.
> > If so, we need to reject the (CMR) memory offline.  Or we just BUG()
> > in the ACPI memory removal  callback?
> > 
> > But either way this will requires us to get the CMRs during kernel boot.
> 
> I don't get the link there between CMRs at boot and handling hotplug.
> 
> We don't need to go to extreme measures just to get a message out of the
> kernel that the BIOS is bad.  If we don't have the data to do it
> already, then I don't really see the nee to warn about it.
> 
> Think of a system that has TDX enabled in the BIOS, but is running an
> old kernel.  It will have *ZERO* idea that hotplug doesn't work.  It'll
> run blissfully along.  I don't see any reason that a kernel with TDX
> support, but where TDX is disabled should actively go out and try to be
> better than those old pre-TDX kernels.

Agreed, assuming "where TDX is disabled" you mean TDX isn't usable (i.e. when
TDX module isn't loaded, or won't be initialized at all).

> 
> Further, there's nothing to stop non-CMR memory from being added to a
> system with TDX enabled in the BIOS but where the kernel is not using
> it.  If we actively go out and keep good old DRAM from being added, then
> we unnecessarily addle those systems.
> 

OK.

Then for memory hot-add, perhaps we can just go with the "winner-take-all"
approach you mentioned before?

For memory hot-removal, as I replied previously, looks the kernel cannot reject
the removal if it allows memory offline.  Any suggestion on this?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 1/22] x86/virt/tdx: Detect TDX during kernel boot
  2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
  2022-06-23  5:57   ` Chao Gao
@ 2022-08-02  2:01   ` Wu, Binbin
  2022-08-03  9:25     ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Wu, Binbin @ 2022-08-02  2:01 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata


On 2022/6/22 19:15, Kai Huang wrote:
> +	/*
> +	 * TDX guarantees at least two TDX KeyIDs are configured by
> +	 * BIOS, otherwise SEAMRR is disabled.  Invalid TDX private
> +	 * range means kernel bug (TDX is broken).
> +	 */
> +	if (WARN_ON(!tdx_keyid_start || tdx_keyid_num < 2)) {
Do you think it's better to define a meaningful macro instead of the 
number here and below?
> +		tdx_keyid_start = tdx_keyid_num = 0;
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +

> +
> +/**
> + * platform_tdx_enabled() - Return whether BIOS has enabled TDX
> + *
> + * Return whether BIOS has enabled TDX regardless whether the TDX module
> + * has been loaded or not.
> + */
> +bool platform_tdx_enabled(void)
> +{
> +	return tdx_keyid_num >= 2;
> +}
>
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-07-07 23:34         ` Kai Huang
@ 2022-08-03  1:30           ` Kai Huang
  2022-08-03 14:22             ` Dave Hansen
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-08-03  1:30 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-07-08 at 11:34 +1200, Kai Huang wrote:
> > Why not just entirely remove the lower 1MB from the memblock structure
> > on TDX systems?  Do something equivalent to adding this on the kernel
> > command line:
> > 
> >  	memmap=1M$0x0
> 
> I will explore this option.  Thanks!

Hi Dave,

After investigating and testing, we cannot simply remove first 1MB from e820
table which is similar to what 'memmap=1M$0x0' does, as the kernel needs low
memory as trampoline to bring up all APs.

Currently I am doing below:

--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -65,6 +65,17 @@ void __init reserve_real_mode(void)
         * setup_arch().
         */
        memblock_reserve(0, SZ_1M);
+
+       /*
+        * As one step of initializing the TDX module (on-demand), the
+        * kernel will later verify all memory regions in memblock are
+        * truly TDX-capable and convert all of them to TDX memory.
+        * The first 1MB may not be enumerated as TDX-capable memory.
+        * To avoid failure to verify, explicitly remove the first 1MB
+        * from memblock for a TDX (BIOS) enabled system.
+        */
+       if (platform_tdx_enabled())
+               memblock_remove(0, SZ_1M);

I tested an it worked (I didn't observe any problem), but am I missing
something?

Also, regarding to whether we can remove platform_tdx_enabled() at all, I looked
into the spec again and there's no MSR or CPUID from which we can check TDX is
enabled by BIOS -- except checking the SEAMRR_MASK MSR, which is basically
platform_tdx_enabled() also did.

Checking MSR_MTRRcap.SEAMRR bit isn't enough as it will be true as long as the
hardware supports SEAMRR, but it doesn't tell whether SEAMRR(TDX) is enabled by
BIOS.

So if above code is reasonable, I think we can still detect TDX during boot and
keep platform_tdx_enabled().  

It also detects TDX KeyIDs, which isn't necessary for removing the first 1MB
here (nor for kexec() support), but detecting TDX KeyIDs must be done anyway
either during kernel boot or during initializing TDX module.

Detecting TDX KeyID at boot time also has an advantage that in the future we can
expose KeyIDs via /sysfs and userspace can know how many TDs the machine can
support w/o having to initializing the  TDX module first (we received such
requirement from customer but yes it is arguable).

Any comments?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-07-21  1:52               ` Kai Huang
  2022-07-27  0:34                 ` Kai Huang
@ 2022-08-03  2:37                 ` Kai Huang
  2022-08-03 14:20                   ` Dave Hansen
  1 sibling, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-08-03  2:37 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> Also, if I understand correctly above, your suggestion is we want to prevent any
> CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> during boot).  This looks contradicts to the requirement of being able to allow
> moving memory from core-mm to driver.  When we offline the memory, we cannot
> know whether the memory will be used by driver, or later hot-removed.

Hi Dave,

The high level flow of device hot-removal is:

acpi_scan_hot_remove()
	-> acpi_scan_try_to_offline()
		-> acpi_bus_offline()
			-> device_offline()
				-> memory_subsys_offline()
	-> acpi_bus_trim()
		-> acpi_memory_device_remove()


And memory_subsys_offline() can also be triggered via /sysfs:

	echo 0 > /sys/devices/system/memory/memory30/online

After the memory block is offline, my understanding is kernel can theoretically
move it to, i.e. ZONE_DEVICE via memremap_pages().

As you can see memory_subsys_offline() is the entry point of memory device
offline (before it the code is generic for all ACPI device), and it cannot
distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
we are unable to refuse to offline memory in  memory_subsys_offline() when it is
called from ACPI event.

Any comments?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-27  5:05     ` Kai Huang
  2022-07-13 11:09       ` Kai Huang
@ 2022-08-03  3:40       ` Binbin Wu
  2022-08-03  9:20         ` Kai Huang
  1 sibling, 1 reply; 114+ messages in thread
From: Binbin Wu @ 2022-08-03  3:40 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang


On 2022/6/27 13:05, Kai Huang wrote:
> On Fri, 2022-06-24 at 11:57 -0700, Dave Hansen wrote:
>> On 6/22/22 04:15, Kai Huang wrote:
>>> Platforms with confidential computing technology may not support ACPI
>>> CPU hotplug when such technology is enabled by the BIOS.  Examples
>>> include Intel platforms which support Intel Trust Domain Extensions
>>> (TDX).
>>>
>>> If the kernel ever receives ACPI CPU hotplug event, it is likely a BIOS
>>> bug.  For ACPI CPU hot-add, the kernel should speak out this is a BIOS
>>> bug and reject the new CPU.  For hot-removal, for simplicity just assume
>>> the kernel cannot continue to work normally, and BUG().
>> So, the kernel is now declaring ACPI CPU hotplug and TDX to be
>> incompatible and even BUG()'ing if we see them together.  Has anyone
>> told the firmware guys about this?  Is this in a spec somewhere?  When
>> the kernel goes boom, are the firmware folks going to cry "Kernel bug!!"?
>>
>> This doesn't seem like something the kernel should be doing unilaterally.
> TDX doesn't support ACPI CPU hotplug (both hot-add and hot-removal) is an
> architectural behaviour.  The public specs doesn't explicitly say  it, but it is
> implied:
>
> 1) During platform boot MCHECK verifies all logical CPUs on all packages that
> they are TDX compatible, and it keeps some information, such as total CPU
> packages and total logical cpus at some location of SEAMRR so it can later be
> used by P-SEAMLDR and TDX module.  Please see "3.4 SEAMLDR_SEAMINFO" in the P-
> SEAMLDR spec:
>
> https://cdrdv2.intel.com/v1/dl/getContent/733584
>
> 2) Also some SEAMCALLs must be called on all logical CPUs or CPU packages that
> the platform has (such as such as TDH.SYS.INIT.LP and TDH.SYS.KEY.CONFIG),
> otherwise the further step of TDX module initialization will fail.
>
> Unfortunately there's no public spec mentioning what's the behaviour of ACPI CPU
> hotplug on TDX enabled platform.  For instance, whether BIOS will ever get the
> ACPI CPU hot-plug event, or if BIOS gets the event, will it suppress it.  What I
> got from Intel internally is a non-buggy BIOS should never report such event to
> the kernel, so if kernel receives such event, it should be fair enough to treat
> it as BIOS bug.
>
> But theoretically, the BIOS isn't in TDX's TCB, and can be from 3rd party..
>
> Also, I was told "CPU hot-plug is a system feature, not a CPU feature or Intel
> architecture feature", so Intel doesn't have an architectural specification for
> CPU hot-plug.
>
> At the meantime, I am pushing Intel internally to add some statements regarding
> to the TDX and CPU hotplug interaction to the BIOS write guide and make it
> public.  I guess this is the best thing we can do.
>
> Regarding to the code change, I agree the BUG() isn't good.  I used it because:
> 1) this basically on a theoretical problem and shouldn't happen in practice; 2)
> because there's no architectural specification regarding to the behaviour of TDX
> when CPU hot-removal, so I just used BUG() in assumption that TDX isn't safe to
> use anymore.

host kernel is also not in TDX's TCB either, what would happen if kernel 
doesn't
do anything in case of buggy BIOS? How does TDX handle the case to 
enforce the
secure of TDs?


>
> But Rafael doesn't like current code change either. I think maybe we can just
> disable CPU hotplug code when TDX is enabled by BIOS (something like below):
>
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -707,6 +707,10 @@ bool acpi_duplicate_processor_id(int proc_id)
>   void __init acpi_processor_init(void)
>   {
>          acpi_processor_check_duplicates();
> +
> +       if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED))
> +               return;
> +
>          acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
>          acpi_scan_add_handler(&processor_container_handler);
>   }
>
> This approach is cleaner I think, but we won't be able to report "BIOS bug" when
> ACPI CPU hotplug happens.  But to me it's OK as perhaps it's arguable to treat
> it as BIOS bug (as theoretically BIOS can be from 3rd party).
>
> What's your opinion?
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
                     ` (2 preceding siblings ...)
  2022-06-29  5:33   ` Christoph Hellwig
@ 2022-08-03  3:55   ` Binbin Wu
  2022-08-03  9:21     ` Kai Huang
  3 siblings, 1 reply; 114+ messages in thread
From: Binbin Wu @ 2022-08-03  3:55 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang


On 2022/6/22 19:15, Kai Huang wrote:
>   
> @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
>   	struct device *dev;
>   	int result = 0;
>   
> +	/*
> +	 * If the confidential computing platform doesn't support ACPI
> +	 * memory hotplug, the BIOS should never deliver such event to
memory or cpu hotplug?


> +	 * the kernel.  Report ACPI CPU hot-add as a BIOS bug and ignore
> +	 * the new CPU.
> +	 */
> +	if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> +		dev_err(&device->dev, "[BIOS bug]: Platform doesn't support ACPI CPU hotplug.  New CPU ignored.\n");
> +		return -EINVAL;
> +	}
> +
>   	pr = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL);
>   	if (!pr)
>   		return -ENOMEM;
> @@ -434,6 +446,17 @@ static void acpi_processor_remove(struct acpi_device *device)
>   	if (!device || !acpi_driver_data(device))
>   		return;
>   
> +	/*
> +	 * The confidential computing platform is broken if ACPI memory
ditto


> +	 * hot-removal isn't supported but it happened anyway.  Assume
> +	 * it's not guaranteed that the kernel can continue to work
> +	 * normally.  Just BUG().
> +	 */
> +	if (cc_platform_has(CC_ATTR_ACPI_CPU_HOTPLUG_DISABLED)) {
> +		dev_err(&device->dev, "Platform doesn't support ACPI CPU hotplug. BUG().\n");
> +		BUG();
> +	}
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-08-03  3:40       ` Binbin Wu
@ 2022-08-03  9:20         ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-08-03  9:20 UTC (permalink / raw)
  To: Binbin Wu, Dave Hansen, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Wed, 2022-08-03 at 11:40 +0800, Binbin Wu wrote:
> host kernel is also not in TDX's TCB either, what would happen if kernel 
> doesn't
> do anything in case of buggy BIOS? How does TDX handle the case to 
> enforce the
> secure of TDs?

TDX doesn't support hot-add or hot-removal CPU from TDX' security perimeter at
runtime.  Even BIOS/kernel can ever bring up new CPUs at runtime, the new CPUs
cannot run within TDX's security domain, in which case TDX's security isn't
compromised.  If kernel schedules a TD to a new added CPU, then AFAICT the
behaviour is TDX module implementation specific but not architectural.  A
reasonable behaviour would be the TDENTER should refuse to run when the CPU
isn't verified by TDX during boot.

If any CPU is hot-removed, then the security's TDX isn't compromised, but TDX is
not guaranteed to functionally work anymore.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug
  2022-08-03  3:55   ` Binbin Wu
@ 2022-08-03  9:21     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-08-03  9:21 UTC (permalink / raw)
  To: Binbin Wu, linux-kernel, kvm
  Cc: linux-acpi, seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	thomas.lendacky, Tianyu.Lan, rdunlap, Jason, juri.lelli,
	mark.rutland, frederic, yuehaibing, dongli.zhang

On Wed, 2022-08-03 at 11:55 +0800, Binbin Wu wrote:
> On 2022/6/22 19:15, Kai Huang wrote:
> >   
> > @@ -357,6 +358,17 @@ static int acpi_processor_add(struct acpi_device *device,
> >   	struct device *dev;
> >   	int result = 0;
> >   
> > +	/*
> > +	 * If the confidential computing platform doesn't support ACPI
> > +	 * memory hotplug, the BIOS should never deliver such event to
> memory or cpu hotplug?

Sorry typo.  Should be CPU.

Anyway this patch will be dropped in next version.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 1/22] x86/virt/tdx: Detect TDX during kernel boot
  2022-08-02  2:01   ` [PATCH v5 1/22] " Wu, Binbin
@ 2022-08-03  9:25     ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-08-03  9:25 UTC (permalink / raw)
  To: Wu, Binbin, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-08-02 at 10:01 +0800, Wu, Binbin wrote:
> On 2022/6/22 19:15, Kai Huang wrote:
> > +	/*
> > +	 * TDX guarantees at least two TDX KeyIDs are configured by
> > +	 * BIOS, otherwise SEAMRR is disabled.  Invalid TDX private
> > +	 * range means kernel bug (TDX is broken).
> > +	 */
> > +	if (WARN_ON(!tdx_keyid_start || tdx_keyid_num < 2)) {
> Do you think it's better to define a meaningful macro instead of the 
> number here and below?
> > 

Personally I don't think we need a macro.  The comment already said "two", so 
having a macro doesn't help readability here (and below).  But I am open on
this.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-08-03  2:37                 ` Kai Huang
@ 2022-08-03 14:20                   ` Dave Hansen
  2022-08-03 22:35                     ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-08-03 14:20 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 8/2/22 19:37, Kai Huang wrote:
> On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
>> Also, if I understand correctly above, your suggestion is we want to prevent any
>> CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
>> during boot).  This looks contradicts to the requirement of being able to allow
>> moving memory from core-mm to driver.  When we offline the memory, we cannot
>> know whether the memory will be used by driver, or later hot-removed.
> Hi Dave,
> 
> The high level flow of device hot-removal is:
> 
> acpi_scan_hot_remove()
> 	-> acpi_scan_try_to_offline()
> 		-> acpi_bus_offline()
> 			-> device_offline()
> 				-> memory_subsys_offline()
> 	-> acpi_bus_trim()
> 		-> acpi_memory_device_remove()
> 
> 
> And memory_subsys_offline() can also be triggered via /sysfs:
> 
> 	echo 0 > /sys/devices/system/memory/memory30/online
> 
> After the memory block is offline, my understanding is kernel can theoretically
> move it to, i.e. ZONE_DEVICE via memremap_pages().
> 
> As you can see memory_subsys_offline() is the entry point of memory device
> offline (before it the code is generic for all ACPI device), and it cannot
> distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> called from ACPI event.
> 
> Any comments?

I suggest refactoring the code in a way that makes it possible to
distinguish the two cases.

It's not like you have some binary kernel.  You have the source code for
the whole thing and can propose changes *ANYWHERE* you need.  Even better:

$ grep -A2 ^ACPI\$ MAINTAINERS
ACPI
M:	"Rafael J. Wysocki" <rafael@kernel.org>
R:	Len Brown <lenb@kernel.org>

The maintainer of ACPI works for our employer.  Plus, he's a nice
helpful guy that you can go ask how you might refactor this or
approaches you might take.  Have you talked to Rafael about this issue?

Also, from a two-minute grepping session, I noticed this:

> static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
>                                     void **ret_p)
> {
...
>         if (device->handler && !device->handler->hotplug.enabled) {
>                 *ret_p = &device->dev;
>                 return AE_SUPPORT;
>         }

It looks to me like if you simply set:

	memory_device_handler->hotplug.enabled = false;

you'll get most of the behavior you want.  ACPI memory hotplug would not
work and the changes would be confined to the ACPI world.  The
"lower-level" bus-based hotplug would be unaffected.

Now, I don't know what kind of locking would be needed to muck with a
global structure like that.  But, it's a start.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-08-03  1:30           ` Kai Huang
@ 2022-08-03 14:22             ` Dave Hansen
  2022-08-03 22:14               ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Hansen @ 2022-08-03 14:22 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 8/2/22 18:30, Kai Huang wrote:
> On Fri, 2022-07-08 at 11:34 +1200, Kai Huang wrote:
>>> Why not just entirely remove the lower 1MB from the memblock structure
>>> on TDX systems?  Do something equivalent to adding this on the kernel
>>> command line:
>>>
>>>  	memmap=1M$0x0
>> I will explore this option.  Thanks!
> Hi Dave,
> 
> After investigating and testing, we cannot simply remove first 1MB from e820
> table which is similar to what 'memmap=1M$0x0' does, as the kernel needs low
> memory as trampoline to bring up all APs.

OK, so don't remove it, but reserve it so that the trampoline code can
use it.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory
  2022-08-03 14:22             ` Dave Hansen
@ 2022-08-03 22:14               ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-08-03 22:14 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-08-03 at 07:22 -0700, Dave Hansen wrote:
> On 8/2/22 18:30, Kai Huang wrote:
> > On Fri, 2022-07-08 at 11:34 +1200, Kai Huang wrote:
> > > > Why not just entirely remove the lower 1MB from the memblock structure
> > > > on TDX systems?  Do something equivalent to adding this on the kernel
> > > > command line:
> > > > 
> > > >  	memmap=1M$0x0
> > > I will explore this option.  Thanks!
> > Hi Dave,
> > 
> > After investigating and testing, we cannot simply remove first 1MB from e820
> > table which is similar to what 'memmap=1M$0x0' does, as the kernel needs low
> > memory as trampoline to bring up all APs.
> 
> OK, so don't remove it, but reserve it so that the trampoline code can
> use it.

It's already reserved in the existing reserve_real_mode().  What we need is to
*remove* the first 1MB from memblock.memory, so that the
for_each_mem_pfn_range() will just not get any memory below 1MB.  Otherwise we
need to explicitly skip the first 1MB in TDX code like what I did  in this
series.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-08-03 14:20                   ` Dave Hansen
@ 2022-08-03 22:35                     ` Kai Huang
  2022-08-04 10:06                       ` Kai Huang
  0 siblings, 1 reply; 114+ messages in thread
From: Kai Huang @ 2022-08-03 22:35 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote:
> On 8/2/22 19:37, Kai Huang wrote:
> > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > Also, if I understand correctly above, your suggestion is we want to prevent any
> > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> > > during boot).  This looks contradicts to the requirement of being able to allow
> > > moving memory from core-mm to driver.  When we offline the memory, we cannot
> > > know whether the memory will be used by driver, or later hot-removed.
> > Hi Dave,
> > 
> > The high level flow of device hot-removal is:
> > 
> > acpi_scan_hot_remove()
> > 	-> acpi_scan_try_to_offline()
> > 		-> acpi_bus_offline()
> > 			-> device_offline()
> > 				-> memory_subsys_offline()
> > 	-> acpi_bus_trim()
> > 		-> acpi_memory_device_remove()
> > 
> > 
> > And memory_subsys_offline() can also be triggered via /sysfs:
> > 
> > 	echo 0 > /sys/devices/system/memory/memory30/online
> > 
> > After the memory block is offline, my understanding is kernel can theoretically
> > move it to, i.e. ZONE_DEVICE via memremap_pages().
> > 
> > As you can see memory_subsys_offline() is the entry point of memory device
> > offline (before it the code is generic for all ACPI device), and it cannot
> > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> > we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> > called from ACPI event.
> > 
> > Any comments?
> 
> I suggest refactoring the code in a way that makes it possible to
> distinguish the two cases.
> 
> It's not like you have some binary kernel.  You have the source code for
> the whole thing and can propose changes *ANYWHERE* you need.  Even better:
> 
> $ grep -A2 ^ACPI\$ MAINTAINERS
> ACPI
> M:	"Rafael J. Wysocki" <rafael@kernel.org>
> R:	Len Brown <lenb@kernel.org>
> 
> The maintainer of ACPI works for our employer.  Plus, he's a nice
> helpful guy that you can go ask how you might refactor this or
> approaches you might take.  Have you talked to Rafael about this issue?

Rafael once also suggested to set hotplug.enabled to 0 as your code shows below,
but we just got the TDX architecture behaviour of memory hotplug clarified from
Intel TDX guys recently. 

> Also, from a two-minute grepping session, I noticed this:
> 
> > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
> >                                     void **ret_p)
> > {
> ...
> >         if (device->handler && !device->handler->hotplug.enabled) {
> >                 *ret_p = &device->dev;
> >                 return AE_SUPPORT;
> >         }
> 
> It looks to me like if you simply set:
> 
> 	memory_device_handler->hotplug.enabled = false;
> 
> you'll get most of the behavior you want.  ACPI memory hotplug would not
> work and the changes would be confined to the ACPI world.  The
> "lower-level" bus-based hotplug would be unaffected.
> 
> Now, I don't know what kind of locking would be needed to muck with a
> global structure like that.  But, it's a start.

This has two problems:

1) This approach cannot distinguish non-CMR memory hotplug and CMR memory
hotplug, as it disables ACPI memory hotplug for all.  But this is fine as we
want to reject non-CMR memory hotplug anyway.  We just need to explain clearly
in changelog.

2) This won't allow the kernel to speak out "BIOS  bug" when CMR memory hotplug
actually happens.  Instead, we can only print out "hotplug is disabled due to
TDX is enabled by BIOS." when we set hotplug.enable to false.

Assuming above is OK, I'll explore this option.  I'll also do some research to
see if it's still possible to speak out "BIOS bug" in this approach but it's not
a mandatory requirement to me now.

Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we
can just detect TDX during kernel boot, and disable hotplug when TDX is enabled
by BIOS, but don't need to use "winner-take-all" approach.  The former is
clearer and easier to implement.  I'll go with the former approach if I don't
hear objection from you.

And ACPI CPU hotplug can also use the same way.

Please let me know any comments.  Thanks!

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
  2022-08-03 22:35                     ` Kai Huang
@ 2022-08-04 10:06                       ` Kai Huang
  0 siblings, 0 replies; 114+ messages in thread
From: Kai Huang @ 2022-08-04 10:06 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-08-04 at 10:35 +1200, Kai Huang wrote:
> On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote:
> > On 8/2/22 19:37, Kai Huang wrote:
> > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > > Also, if I understand correctly above, your suggestion is we want to prevent any
> > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> > > > during boot).  This looks contradicts to the requirement of being able to allow
> > > > moving memory from core-mm to driver.  When we offline the memory, we cannot
> > > > know whether the memory will be used by driver, or later hot-removed.
> > > Hi Dave,
> > > 
> > > The high level flow of device hot-removal is:
> > > 
> > > acpi_scan_hot_remove()
> > > 	-> acpi_scan_try_to_offline()
> > > 		-> acpi_bus_offline()
> > > 			-> device_offline()
> > > 				-> memory_subsys_offline()
> > > 	-> acpi_bus_trim()
> > > 		-> acpi_memory_device_remove()
> > > 
> > > 
> > > And memory_subsys_offline() can also be triggered via /sysfs:
> > > 
> > > 	echo 0 > /sys/devices/system/memory/memory30/online
> > > 
> > > After the memory block is offline, my understanding is kernel can theoretically
> > > move it to, i.e. ZONE_DEVICE via memremap_pages().
> > > 
> > > As you can see memory_subsys_offline() is the entry point of memory device
> > > offline (before it the code is generic for all ACPI device), and it cannot
> > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> > > we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> > > called from ACPI event.
> > > 
> > > Any comments?
> > 
> > I suggest refactoring the code in a way that makes it possible to
> > distinguish the two cases.
> > 
> > It's not like you have some binary kernel.  You have the source code for
> > the whole thing and can propose changes *ANYWHERE* you need.  Even better:
> > 
> > $ grep -A2 ^ACPI\$ MAINTAINERS
> > ACPI
> > M:	"Rafael J. Wysocki" <rafael@kernel.org>
> > R:	Len Brown <lenb@kernel.org>
> > 
> > The maintainer of ACPI works for our employer.  Plus, he's a nice
> > helpful guy that you can go ask how you might refactor this or
> > approaches you might take.  Have you talked to Rafael about this issue?
> 
> Rafael once also suggested to set hotplug.enabled to 0 as your code shows below,
> but we just got the TDX architecture behaviour of memory hotplug clarified from
> Intel TDX guys recently. 
> 
> > Also, from a two-minute grepping session, I noticed this:
> > 
> > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
> > >                                     void **ret_p)
> > > {
> > ...
> > >         if (device->handler && !device->handler->hotplug.enabled) {
> > >                 *ret_p = &device->dev;
> > >                 return AE_SUPPORT;
> > >         }
> > 
> > It looks to me like if you simply set:
> > 
> > 	memory_device_handler->hotplug.enabled = false;
> > 
> > you'll get most of the behavior you want.  ACPI memory hotplug would not
> > work and the changes would be confined to the ACPI world.  The
> > "lower-level" bus-based hotplug would be unaffected.
> > 
> > Now, I don't know what kind of locking would be needed to muck with a
> > global structure like that.  But, it's a start.
> 
> This has two problems:
> 
> 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory
> hotplug, as it disables ACPI memory hotplug for all.  But this is fine as we
> want to reject non-CMR memory hotplug anyway.  We just need to explain clearly
> in changelog.
> 
> 2) This won't allow the kernel to speak out "BIOS  bug" when CMR memory hotplug
> actually happens.  Instead, we can only print out "hotplug is disabled due to
> TDX is enabled by BIOS." when we set hotplug.enable to false.
> 
> Assuming above is OK, I'll explore this option.  I'll also do some research to
> see if it's still possible to speak out "BIOS bug" in this approach but it's not
> a mandatory requirement to me now.
> 
> Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we
> can just detect TDX during kernel boot, and disable hotplug when TDX is enabled
> by BIOS, but don't need to use "winner-take-all" approach.  The former is
> clearer and easier to implement.  I'll go with the former approach if I don't
> hear objection from you.
> 
> And ACPI CPU hotplug can also use the same way.
> 
> Please let me know any comments.  Thanks!
> 

One more reason why "winner-take-all" approach doesn't work: 

If we allow ACPI memory hotplug to happen but choose to disable it in the
handler using "winner-take-all", then at the beginning the ACPI code will
actually create a /sysfs entry for hotplug.enabled to allow userspace to change
it:

	/sys/firmware/acpi/hotplug/memory/enabled

Which means even we set hotplug.enabled to false at some point, userspace can
turn it on again.  The only way is to not create this /sysfs entry at the
beginning.

With "winner-take-all" approach, I don't think we should avoid creating the
/sysfs entry.  Nor we should introduce arch-specific hook to, i.e. prevent
/sysfs entry being changed by userspace.

So instead of "winner-take-all" approach, I'll introduce a new kernel command
line to allow user to choose between ACPI CPU/memory hotplug vs TDX.  This
command line should not impact the "software" CPU/memory hotplug even when user
choose to use TDX.  In this case, this is similar to "winner-take-all" anyway.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
  2022-06-24 20:13   ` Dave Hansen
@ 2022-08-17 22:46   ` Sagi Shahar
  2022-08-17 23:43     ` Huang, Kai
  1 sibling, 1 reply; 114+ messages in thread
From: Sagi Shahar @ 2022-08-17 22:46 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dave Hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, Yamahata, Isaku

On Wed, Jun 22, 2022 at 4:19 AM Kai Huang <kai.huang@intel.com> wrote:
>
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory.  This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> up front.  They must be allocated by the kernel and then given to the
> TDX module.
>
> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
>
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> mitigation is to launch a TD guest early during system boot to get those
> PAMTs allocated at early time, but the only way to fix is to add a boot
> option to allocate or reserve PAMTs during kernel boot.
>
> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
>
> Adopt the following policies when allocating PAMTs for a given TDMR:
>
>   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>     the total number of reserved areas consumed for PAMTs.
>   - Try to first allocate PAMT from the local node of the TDMR for better
>     NUMA locality.
>
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> - v3 -> v5 (no feedback on v4):
>  - Used memblock to get the NUMA node for given TDMR.
>  - Removed tdmr_get_pamt_sz() helper but use open-code instead.
>  - Changed to use 'switch .. case..' for each TDX supported page size in
>    tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
>  - Added printing out memory used for PAMT allocation when TDX module is
>    initialized successfully.
>  - Explained downside of alloc_contig_pages() in changelog.
>  - Addressed other minor comments.
>
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 200 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 201 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4988a91d5283..ec496e96d120 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
>         depends on CPU_SUP_INTEL
>         depends on X86_64
>         depends on KVM_INTEL
> +       depends on CONTIG_ALLOC
>         select ARCH_HAS_CC_PLATFORM
>         select ARCH_KEEP_MEMBLOCK
>         help
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fd9f449b5395..36260dd7e69f 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -558,6 +558,196 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
>         return 0;
>  }
>
> +/* Page sizes supported by TDX */
> +enum tdx_page_sz {
> +       TDX_PG_4K,
> +       TDX_PG_2M,
> +       TDX_PG_1G,
> +       TDX_PG_MAX,
> +};
> +
> +/*
> + * Calculate PAMT size given a TDMR and a page size.  The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> +                                     enum tdx_page_sz pgsz)
> +{
> +       unsigned long pamt_sz;
> +       int pamt_entry_nr;
            ^
This should be an 'unsigned long'. Otherwise you get an integer
overflow for large memory machines.

> +
> +       switch (pgsz) {
> +       case TDX_PG_4K:
> +               pamt_entry_nr = tdmr->size >> PAGE_SHIFT;
> +               break;
> +       case TDX_PG_2M:
> +               pamt_entry_nr = tdmr->size >> PMD_SHIFT;
> +               break;
> +       case TDX_PG_1G:
> +               pamt_entry_nr = tdmr->size >> PUD_SHIFT;
> +               break;
> +       default:
> +               WARN_ON_ONCE(1);
> +               return 0;
> +       }
> +
> +       pamt_sz = pamt_entry_nr * tdx_sysinfo.pamt_entry_size;
> +       /* TDX requires PAMT size must be 4K aligned */
> +       pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> +       return pamt_sz;
> +}
> +
> +/*
> + * Pick a NUMA node on which to allocate this TDMR's metadata.
> + *
> + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> + * not be.  If the TDMR covers more than one node, just use the _first_
> + * one.  This can lead to small areas of off-node metadata for some
> + * memory.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> +       unsigned long start_pfn, end_pfn;
> +       int i, nid;
> +
> +       /* Find the first memory region covered by the TDMR */
> +       memblock_for_each_tdx_mem_pfn_range(i, &start_pfn, &end_pfn, &nid) {
> +               if (end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> +                       return nid;
> +       }
> +
> +       /*
> +        * No memory region found for this TDMR.  It cannot happen since
> +        * when one TDMR is created, it must cover at least one (or
> +        * partial) memory region.
> +        */
> +       WARN_ON_ONCE(1);
> +       return 0;
> +}
> +
> +static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
> +{
> +       unsigned long pamt_base[TDX_PG_MAX];
> +       unsigned long pamt_size[TDX_PG_MAX];
> +       unsigned long tdmr_pamt_base;
> +       unsigned long tdmr_pamt_size;
> +       enum tdx_page_sz pgsz;
> +       struct page *pamt;
> +       int nid;
> +
> +       nid = tdmr_get_nid(tdmr);
> +
> +       /*
> +        * Calculate the PAMT size for each TDX supported page size
> +        * and the total PAMT size.
> +        */
> +       tdmr_pamt_size = 0;
> +       for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> +               pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
> +               tdmr_pamt_size += pamt_size[pgsz];
> +       }
> +
> +       /*
> +        * Allocate one chunk of physically contiguous memory for all
> +        * PAMTs.  This helps minimize the PAMT's use of reserved areas
> +        * in overlapped TDMRs.
> +        */
> +       pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> +                       nid, &node_online_map);
> +       if (!pamt)
> +               return -ENOMEM;
> +
> +       /* Calculate PAMT base and size for all supported page sizes. */
> +       tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> +       for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> +               pamt_base[pgsz] = tdmr_pamt_base;
> +               tdmr_pamt_base += pamt_size[pgsz];
> +       }
> +
> +       tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> +       tdmr->pamt_4k_size = pamt_size[TDX_PG_4K];
> +       tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> +       tdmr->pamt_2m_size = pamt_size[TDX_PG_2M];
> +       tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> +       tdmr->pamt_1g_size = pamt_size[TDX_PG_1G];
> +
> +       return 0;
> +}
> +
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> +                         unsigned long *pamt_npages)
> +{
> +       unsigned long pamt_base, pamt_sz;
> +
> +       /*
> +        * The PAMT was allocated in one contiguous unit.  The 4K PAMT
> +        * should always point to the beginning of that allocation.
> +        */
> +       pamt_base = tdmr->pamt_4k_base;
> +       pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> +       *pamt_pfn = pamt_base >> PAGE_SHIFT;
> +       *pamt_npages = pamt_sz >> PAGE_SHIFT;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +       unsigned long pamt_pfn, pamt_npages;
> +
> +       tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
> +
> +       /* Do nothing if PAMT hasn't been allocated for this TDMR */
> +       if (!pamt_npages)
> +               return;
> +
> +       if (WARN_ON_ONCE(!pamt_pfn))
> +               return;
> +
> +       free_contig_range(pamt_pfn, pamt_npages);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
> +{
> +       int i;
> +
> +       for (i = 0; i < tdmr_num; i++)
> +               tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
> +{
> +       int i, ret = 0;
> +
> +       for (i = 0; i < tdmr_num; i++) {
> +               ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
> +               if (ret)
> +                       goto err;
> +       }
> +
> +       return 0;
> +err:
> +       tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> +       return ret;
> +}
> +
> +static unsigned long tdmrs_get_pamt_pages(struct tdmr_info *tdmr_array,
> +                                         int tdmr_num)
> +{
> +       unsigned long pamt_npages = 0;
> +       int i;
> +
> +       for (i = 0; i < tdmr_num; i++) {
> +               unsigned long pfn, npages;
> +
> +               tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
> +               pamt_npages += npages;
> +       }
> +
> +       return pamt_npages;
> +}
> +
>  /*
>   * Construct an array of TDMRs to cover all memory regions in memblock.
>   * This makes sure all pages managed by the page allocator are TDX
> @@ -572,8 +762,13 @@ static int construct_tdmrs_memeblock(struct tdmr_info *tdmr_array,
>         if (ret)
>                 goto err;
>
> +       ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
> +       if (ret)
> +               goto err;
> +
>         /* Return -EINVAL until constructing TDMRs is done */
>         ret = -EINVAL;
> +       tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
>  err:
>         return ret;
>  }
> @@ -644,6 +839,11 @@ static int init_tdx_module(void)
>          * process are done.
>          */
>         ret = -EINVAL;
> +       if (ret)
> +               tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> +       else
> +               pr_info("%lu pages allocated for PAMT.\n",
> +                               tdmrs_get_pamt_pages(tdmr_array, tdmr_num));
>  out_free_tdmrs:
>         /*
>          * The array of TDMRs is freed no matter the initialization is
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-08-17 22:46   ` Sagi Shahar
@ 2022-08-17 23:43     ` Huang, Kai
  0 siblings, 0 replies; 114+ messages in thread
From: Huang, Kai @ 2022-08-17 23:43 UTC (permalink / raw)
  To: Shahar, Sagi
  Cc: Brown, Len, Hansen, Dave, Christopherson,,
	Sean, ak, peterz, Chatre, Reinette, linux-kernel, Williams,
	Dan J, Luck, Tony, kvm, pbonzini, Wysocki, Rafael J,
	sathyanarayanan.kuppuswamy, kirill.shutemov, Yamahata, Isaku

On Wed, 2022-08-17 at 15:46 -0700, Sagi Shahar wrote:
> > +/*
> > + * Calculate PAMT size given a TDMR and a page size.  The returned
> > + * PAMT size is always aligned up to 4K page boundary.
> > + */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> > +                                     enum tdx_page_sz pgsz)
> > +{
> > +       unsigned long pamt_sz;
> > +       int pamt_entry_nr;
>             ^
> This should be an 'unsigned long'. Otherwise you get an integer
> overflow for large memory machines.

Agreed.  Thanks.

> 
> > +
> > +       switch (pgsz) {
> > +       case TDX_PG_4K:
> > +               pamt_entry_nr = tdmr->size >> PAGE_SHIFT;
> > +               break;
> > +       case TDX_PG_2M:
> > +               pamt_entry_nr = tdmr->size >> PMD_SHIFT;
> > +               break;
> > +       case TDX_PG_1G:
> > +               pamt_entry_nr = tdmr->size >> PUD_SHIFT;
> > +               break;
> > +       default:
> > +               WARN_ON_ONCE(1);
> > +               return 0;
> > +       }
> > +
> > +       pamt_sz = pamt_entry_nr * tdx_sysinfo.pamt_entry_size;
> > +       /* TDX requires PAMT size must be 4K aligned */
> > +       pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> > +
> > +       return pamt_sz;
> > +}
> > +


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support
  2022-06-22 11:17 ` [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
@ 2022-08-18  4:07   ` Bagas Sanjaya
  2022-08-18  9:33     ` Huang, Kai
  0 siblings, 1 reply; 114+ messages in thread
From: Bagas Sanjaya @ 2022-08-18  4:07 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, linux-doc, kvm, seanjc, pbonzini, dave.hansen,
	len.brown, tony.luck, rafael.j.wysocki, reinette.chatre,
	dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

[-- Attachment #1: Type: text/plain, Size: 5866 bytes --]

On Wed, Jun 22, 2022 at 11:17:50PM +1200, Kai Huang wrote:
> +Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
> +can see below dmesg if TDX is enabled by BIOS:
> +
> +|  [..] tdx: SEAMRR enabled.
> +|  [..] tdx: TDX private KeyID range: [16, 64).
> +|  [..] tdx: TDX enabled by BIOS.
> +
<snipped>
> +Initializing the TDX module consumes roughly ~1/256th system RAM size to
> +use it as 'metadata' for the TDX memory.  It also takes additional CPU
> +time to initialize those metadata along with the TDX module itself.  Both
> +are not trivial.  Current kernel doesn't choose to always initialize the
> +TDX module during kernel boot, but provides a function tdx_init() to
> +allow the caller to initialize TDX when it truly wants to use TDX:
> +
> +        ret = tdx_init();
> +        if (ret)
> +                goto no_tdx;
> +        // TDX is ready to use
> +

Hi,

The code block above produces Sphinx warnings:

Documentation/x86/tdx.rst:69: WARNING: Unexpected indentation.
Documentation/x86/tdx.rst:70: WARNING: Block quote ends without a blank line; unexpected unindent.

I have applied the fixup:

---- >8 ----

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index 6c6b09ca6ba407..4430912a2e4f05 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -62,7 +62,7 @@ use it as 'metadata' for the TDX memory.  It also takes additional CPU
 time to initialize those metadata along with the TDX module itself.  Both
 are not trivial.  Current kernel doesn't choose to always initialize the
 TDX module during kernel boot, but provides a function tdx_init() to
-allow the caller to initialize TDX when it truly wants to use TDX:
+allow the caller to initialize TDX when it truly wants to use TDX::
 
         ret = tdx_init();
         if (ret)

> +If the TDX module is not loaded, dmesg shows below:
> +
> +|  [..] tdx: TDX module is not loaded.
> +
> +If the TDX module is initialized successfully, dmesg shows something
> +like below:
> +
> +|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
> +|  [..] tdx: 65667 pages allocated for PAMT.
> +|  [..] tdx: TDX module initialized.
> +
> +If the TDX module failed to initialize, dmesg shows below:
> +
> +|  [..] tdx: Failed to initialize TDX module.  Shut it down.
<snipped>
> +There are basically two memory hot-add cases that need to be prevented:
> +ACPI memory hot-add and driver managed memory hot-add.  The kernel
> +rejectes the driver managed memory hot-add too when TDX is enabled by
> +BIOS.  For instance, dmesg shows below error when using kmem driver to
> +add a legacy PMEM as system RAM:
> +
> +|  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
> +|  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
> +

For dmesg ouput, use literal code block instead of line blocks, like:

---- >8 ----

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index 4430912a2e4f05..1eaeb7cd14d76f 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -41,11 +41,11 @@ TDX boot-time detection
 -----------------------
 
 Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
-can see below dmesg if TDX is enabled by BIOS:
+can see below dmesg if TDX is enabled by BIOS::
 
-|  [..] tdx: SEAMRR enabled.
-|  [..] tdx: TDX private KeyID range: [16, 64).
-|  [..] tdx: TDX enabled by BIOS.
+  [..] tdx: SEAMRR enabled.
+  [..] tdx: TDX private KeyID range: [16, 64).
+  [..] tdx: TDX enabled by BIOS.
 
 TDX module detection and initialization
 ---------------------------------------
@@ -79,20 +79,20 @@ caller.
 User can consult dmesg to see the presence of the TDX module, and whether
 it has been initialized.
 
-If the TDX module is not loaded, dmesg shows below:
+If the TDX module is not loaded, dmesg shows below::
 
-|  [..] tdx: TDX module is not loaded.
+  [..] tdx: TDX module is not loaded.
 
 If the TDX module is initialized successfully, dmesg shows something
-like below:
+like below::
 
-|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
-|  [..] tdx: 65667 pages allocated for PAMT.
-|  [..] tdx: TDX module initialized.
+  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] tdx: 65667 pages allocated for PAMT.
+  [..] tdx: TDX module initialized.
 
-If the TDX module failed to initialize, dmesg shows below:
+If the TDX module failed to initialize, dmesg shows below::
 
-|  [..] tdx: Failed to initialize TDX module.  Shut it down.
+  [..] tdx: Failed to initialize TDX module.  Shut it down.
 
 TDX Interaction to Other Kernel Components
 ------------------------------------------
@@ -143,10 +143,10 @@ There are basically two memory hot-add cases that need to be prevented:
 ACPI memory hot-add and driver managed memory hot-add.  The kernel
 rejectes the driver managed memory hot-add too when TDX is enabled by
 BIOS.  For instance, dmesg shows below error when using kmem driver to
-add a legacy PMEM as system RAM:
+add a legacy PMEM as system RAM::
 
-|  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
-|  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
+  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
+  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
 
 However, adding new memory to ZONE_DEVICE should not be prevented as
 those pages are not managed by the page allocator.  Therefore,

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support
  2022-08-18  4:07   ` Bagas Sanjaya
@ 2022-08-18  9:33     ` Huang, Kai
  0 siblings, 0 replies; 114+ messages in thread
From: Huang, Kai @ 2022-08-18  9:33 UTC (permalink / raw)
  To: bagasdotme
  Cc: kvm, Hansen, Dave, Luck, Tony, ak, Wysocki, Rafael J,
	linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, Yamahata, Isaku,
	kirill.shutemov, linux-doc, peterz, Brown, Len,
	sathyanarayanan.kuppuswamy, Williams, Dan J

On Thu, 2022-08-18 at 11:07 +0700, Bagas Sanjaya wrote:
> On Wed, Jun 22, 2022 at 11:17:50PM +1200, Kai Huang wrote:
> > +Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
> > +can see below dmesg if TDX is enabled by BIOS:
> > +
> > +|  [..] tdx: SEAMRR enabled.
> > +|  [..] tdx: TDX private KeyID range: [16, 64).
> > +|  [..] tdx: TDX enabled by BIOS.
> > +
> <snipped>
> > +Initializing the TDX module consumes roughly ~1/256th system RAM size to
> > +use it as 'metadata' for the TDX memory.  It also takes additional CPU
> > +time to initialize those metadata along with the TDX module itself.  Both
> > +are not trivial.  Current kernel doesn't choose to always initialize the
> > +TDX module during kernel boot, but provides a function tdx_init() to
> > +allow the caller to initialize TDX when it truly wants to use TDX:
> > +
> > +        ret = tdx_init();
> > +        if (ret)
> > +                goto no_tdx;
> > +        // TDX is ready to use
> > +
> 
> Hi,
> 
> The code block above produces Sphinx warnings:
> 
> Documentation/x86/tdx.rst:69: WARNING: Unexpected indentation.
> Documentation/x86/tdx.rst:70: WARNING: Block quote ends without a blank line; unexpected unindent.
> 
> I have applied the fixup:
> 

Thank you! will fix in next version.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 114+ messages in thread

end of thread, other threads:[~2022-08-18  9:33 UTC | newest]

Thread overview: 114+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2022-06-23  5:57   ` Chao Gao
2022-06-23  9:23     ` Kai Huang
2022-08-02  2:01   ` [PATCH v5 1/22] " Wu, Binbin
2022-08-03  9:25     ` Kai Huang
2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
2022-06-22 11:42   ` Rafael J. Wysocki
2022-06-23  0:01     ` Kai Huang
2022-06-27  8:01       ` Igor Mammedov
2022-06-28 10:04         ` Kai Huang
2022-06-28 11:52           ` Igor Mammedov
2022-06-28 17:33           ` Rafael J. Wysocki
2022-06-28 23:41             ` Kai Huang
2022-06-24 18:57   ` Dave Hansen
2022-06-27  5:05     ` Kai Huang
2022-07-13 11:09       ` Kai Huang
2022-07-19 17:46         ` Dave Hansen
2022-07-19 23:54           ` Kai Huang
2022-08-03  3:40       ` Binbin Wu
2022-08-03  9:20         ` Kai Huang
2022-06-29  5:33   ` Christoph Hellwig
2022-06-29  9:09     ` Kai Huang
2022-08-03  3:55   ` Binbin Wu
2022-08-03  9:21     ` Kai Huang
2022-06-22 11:15 ` [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
2022-06-22 11:45   ` Rafael J. Wysocki
2022-06-23  0:08     ` Kai Huang
2022-06-28 17:55       ` Rafael J. Wysocki
2022-06-28 12:01     ` Igor Mammedov
2022-06-28 23:49       ` Kai Huang
2022-06-29  8:48         ` Igor Mammedov
2022-06-29  9:13           ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
2022-06-24  1:41   ` Chao Gao
2022-06-24 11:21     ` Kai Huang
2022-06-29  8:35       ` Yuan Yao
2022-06-29  9:17         ` Kai Huang
2022-06-29 14:22       ` Dave Hansen
2022-06-29 23:02         ` Kai Huang
2022-06-30 15:44           ` Dave Hansen
2022-06-30 22:45             ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
2022-06-24  2:12   ` Chao Gao
2022-06-24 11:23     ` Kai Huang
2022-06-24 19:01   ` Dave Hansen
2022-06-27  5:27     ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2022-06-24  2:39   ` Chao Gao
2022-06-24 11:27     ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
2022-06-24 18:38   ` Dave Hansen
2022-06-27  5:23     ` Kai Huang
2022-06-27 20:58       ` Dave Hansen
2022-06-27 22:10         ` Kai Huang
2022-07-19 19:39           ` Dan Williams
2022-07-19 23:28             ` Kai Huang
2022-07-20 10:18           ` Kai Huang
2022-07-20 16:48             ` Dave Hansen
2022-07-21  1:52               ` Kai Huang
2022-07-27  0:34                 ` Kai Huang
2022-07-27  0:50                   ` Dave Hansen
2022-07-27 12:46                     ` Kai Huang
2022-08-03  2:37                 ` Kai Huang
2022-08-03 14:20                   ` Dave Hansen
2022-08-03 22:35                     ` Kai Huang
2022-08-04 10:06                       ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-06-24 18:50   ` Dave Hansen
2022-06-27  5:26     ` Kai Huang
2022-06-27 20:46       ` Dave Hansen
2022-06-27 22:34         ` Kai Huang
2022-06-27 22:56           ` Dave Hansen
2022-06-27 23:59             ` Kai Huang
2022-06-28  0:03               ` Dave Hansen
2022-06-28  0:11                 ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization Kai Huang
2022-06-22 11:16 ` [PATCH v5 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-06-22 11:17 ` [PATCH v5 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2022-06-22 11:17 ` [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
2022-06-24 19:40   ` Dave Hansen
2022-06-27  6:16     ` Kai Huang
2022-07-07  2:37       ` Kai Huang
2022-07-07 14:26       ` Dave Hansen
2022-07-07 14:36         ` Juergen Gross
2022-07-07 23:42           ` Kai Huang
2022-07-07 23:34         ` Kai Huang
2022-08-03  1:30           ` Kai Huang
2022-08-03 14:22             ` Dave Hansen
2022-08-03 22:14               ` Kai Huang
2022-06-22 11:17 ` [PATCH v5 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock Kai Huang
2022-06-22 11:17 ` [PATCH v5 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions Kai Huang
2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-06-24 20:13   ` Dave Hansen
2022-06-27 10:31     ` Kai Huang
2022-06-27 20:41       ` Dave Hansen
2022-06-27 22:50         ` Kai Huang
2022-06-27 22:57           ` Dave Hansen
2022-06-27 23:05             ` Kai Huang
2022-06-28  0:48         ` Xiaoyao Li
2022-06-28 17:03           ` Dave Hansen
2022-08-17 22:46   ` Sagi Shahar
2022-08-17 23:43     ` Huang, Kai
2022-06-22 11:17 ` [PATCH v5 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-06-22 11:17 ` [PATCH v5 17/22] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-06-22 11:17 ` [PATCH v5 18/22] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-06-22 11:17 ` [PATCH v5 19/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-06-22 11:17 ` [PATCH v5 20/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-06-22 11:17 ` [PATCH v5 21/22] x86/virt/tdx: Support kexec() Kai Huang
2022-06-22 11:17 ` [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
2022-08-18  4:07   ` Bagas Sanjaya
2022-08-18  9:33     ` Huang, Kai
2022-06-24 19:47 ` [PATCH v5 00/22] TDX host kernel support Dave Hansen
2022-06-27  4:09   ` Kai Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).