linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/21] TDX host kernel support
@ 2022-04-06  4:49 Kai Huang
  2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
                   ` (22 more replies)
  0 siblings, 23 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  This series provides support for
initializing the TDX module in the host kernel.  KVM support for TDX is
being developed separately[1].

The code has been tested on couple of TDX-capable machines.  I would
consider it as ready for review. I highly appreciate if anyone can help
to review this series (from high level design to detail implementations).
For Intel reviewers (CC'ed), please help to review, and I would
appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Thanks in advance.

This series is based on Kirill's TDX guest series[2]. The reason is host
side SEAMCALL implementation can share TDCALL's implementation which is
implemented in TDX guest series.

You can find TDX related specs here:
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

You can also find this series in below repo in github:
https://github.com/intel/tdx/tree/host-upstream

Changelog history:

- V2 -> v3:

 - Rebased to latest TDX guest code, which is based on 5.18-rc1.
 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

V2:
https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/
RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the existing VMX architecture.  It defines a new
VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM
VMX non-root).

SEAM VMX root operation is designed to host a CPU-attested, software
module called the 'TDX module' which implements functions to manage
crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
designed to host a CPU-attested, software module called the 'Intel
Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via a new
SEAMCALL instruction.  SEAMCALLs are host-side interface functions
defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
instruction.  They are similar to a hypercall, except they are made by
host kernel to the SEAM software modules.

TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
KeyIDs, which can only be used by software runs in SEAM.  The physical
address bits for encoding TDX private KeyID are treated as reserved bits
when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
private KeyIDs is configured by BIOS.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized using SEAMCALLs defined by TDX architecture.
This series assumes both the P-SEAMLDR and the TDX module are loaded by
BIOS before the kernel boots.

There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
SEAMCALL further tells whether TDX module is loaded.

The TDX module is initialized in multiple steps:

        1) Global initialization;
        2) Logical-CPU scope initialization;
        3) Enumerate the TDX module capabilities;
        4) Configure the TDX module about usable memory ranges and
           global KeyID information;
        5) Package-scope configuration for the global KeyID;
        6) Initialize TDX metadata for usable memory ranges based on 4).

Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
table) logical cpus, otherwise step 4) will fail.  Step 5) requires
calling SEAMCALL on at least one cpu on all packages.

TDX module can also be shut down at any time during module's lifetime, by
calling SEAMCALL on all "BIOS-enabled" logical cpus.

== Design Considerations ==

1. Lazy TDX module initialization on-demand by caller

None of the steps in the TDX module initialization process must be done
during kernel boot.  This series doesn't initialize TDX at boot time, but
instead, provides two functions to allow caller to detect and initialize
TDX on demand:

        if (tdx_detect())
                goto no_tdx;
        if (tdx_init())
                goto no_tdx;

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) Both detecting and initializing the TDX module require calling
SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
(VMXON has been done).  So far, KVM is the only user of TDX, and it
already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
approach is needed since more kernel components will need to handle
VMXON/VMXONFF.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.  For the new module,
it's possible the metadata allocated for the old module cannot be reused
for the new module, and needs to be re-allocated again.

2. Kernel policy on TDX memory

Host kernel is responsible for choosing which memory regions can be used
as TDX memory, and configuring those memory regions to the TDX module by
using an array of "TD Memory Regions" (TDMR), which is a data structure
defined by TDX architecture.

The first generation of TDX essentially guarantees that all system RAM
memory regions (excluding the memory below 1MB) can be used as TDX
memory.  To avoid having to modify the page allocator to distinguish TDX
and non-TDX allocation, this series chooses to use all system RAM as TDX
memory.

E820 table is used to find all system RAM entries.  Following
e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
types are treated as TDX memory, and contiguous ranges in the same NUMA
node are merged together (similar to memblock_add()) before trimming the
non-page-aligned part.

X86 Legacy PMEMs (E820_TYPE_PRAM) also unconditionally treated as TDX
memory as underneath they are RAM and can be potentially used as TD guest
memory.

Memblock is not used to find all RAM regions as: 1) it is gone after
kernel boots; 2) it doesn't have legacy PMEM.

3. Memory hotplug

The first generation of TDX architecturally doesn't support memory
hotplug.  And the first generation of TDX-capable platforms don't support
physical memory hotplug.  Since it physically cannot happen, this series
doesn't add any check in ACPI memory hotplug code path to disable it.

A special case of memory hotplug is adding NVDIMM as system RAM using
kmem driver.  However the first generation of TDX-capable platforms
cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
happen either.

Another case is admin can use 'memmap' kernel command line to create
legacy PMEMs and use them as TD guest memory, or theoretically, can use
kmem driver to add them as system RAM.  To avoid having to change memory
hotplug code to prevent this from happening, this series always include
legacy PMEMs when constructing TDMRs so they are also TDX memory.

4. CPU hotplug

The first generation of TDX architecturally doesn't support ACPI CPU
hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
first generation of TDX-capable platforms don't support ACPI CPU hotplug
either.  Since this physically cannot happen, this series doesn't add any
check in ACPI CPU hotplug code path to disable it.

Also, only TDX module initialization requires all BIOS-enabled cpus are
online.  After the initialization, any logical cpu can be brought down
and brought up to online again later.  Therefore this series doesn't
change logical CPU hotplug either.

5. TDX interaction with kexec()

If TDX is ever enabled and/or used to run any TD guests, the cachelines
of TDX private memory, including PAMTs, used by TDX module need to be
flushed before transiting to the new kernel otherwise they may silently
corrupt the new kernel.  Similar to SME, this series flushes cache in
stop_this_cpu().

The TDX module can be initialized only once during its lifetime.  The
first generation of TDX doesn't have interface to reset TDX module to
uninitialized state so it can be initialized again.

This implies:

  - If the old kernel fails to initialize TDX, the new kernel cannot
    use TDX too unless the new kernel fixes the bug which leads to
    initialization failure in the old kernel and can resume from where
    the old kernel stops. This requires certain coordination between
    the two kernels.

  - If the old kernel has initialized TDX successfully, the new kernel
    may be able to use TDX if the two kernels have the exactly same
    configurations on the TDX module. It further requires the new kernel
    to reserve the TDX metadata pages (allocated by the old kernel) in
    its page allocator. It also requires coordination between the two
    kernels.  Furthermore, if kexec() is done when there are active TD
    guests running, the new kernel cannot use TDX because it's extremely
    hard for the old kernel to pass all TDX private pages to the new
    kernel.

Given that, this series doesn't support TDX after kexec() (except the
old kernel doesn't attempt to initialize TDX at all).

And this series doesn't shut down TDX module but leaves it open during
kexec().  It is because shutting down TDX module requires CPU being in
VMX operation but there's no guarantee of this during kexec().  Leaving
the TDX module open is not the best case, but it is OK since the new
kernel won't be able to use TDX anyway (therefore TDX module won't run
at all).

[1] https://lore.kernel.org/lkml/772b20e270b3451aea9714260f2c40ddcc4afe80.1646422845.git.isaku.yamahata@intel.com/T/
[2] https://github.com/intel/tdx/tree/guest-upstream


Kai Huang (21):
  x86/virt/tdx: Detect SEAM
  x86/virt/tdx: Detect TDX private KeyIDs
  x86/virt/tdx: Implement the SEAMCALL base function
  x86/virt/tdx: Add skeleton for detecting and initializing TDX on
    demand
  x86/virt/tdx: Detect P-SEAMLDR and TDX module
  x86/virt/tdx: Shut down TDX module in case of error
  x86/virt/tdx: Do TDX module global initialization
  x86/virt/tdx: Do logical-cpu scope TDX module initialization
  x86/virt/tdx: Get information about TDX module and convertible memory
  x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  x86/virt/tdx: Choose to use all system RAM as TDX memory
  x86/virt/tdx: Create TDMRs to cover all system RAM
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Set up reserved areas for all TDMRs
  x86/virt/tdx: Reserve TDX module global KeyID
  x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86: Flush cache of TDX private memory during kexec()
  x86/virt/tdx: Add kernel command line to opt-in TDX host support
  Documentation/x86: Add documentation for TDX host support

 .../admin-guide/kernel-parameters.txt         |    6 +
 Documentation/x86/tdx.rst                     |  326 +++-
 arch/x86/Kconfig                              |   14 +
 arch/x86/Makefile                             |    2 +
 arch/x86/include/asm/tdx.h                    |   15 +
 arch/x86/kernel/cpu/intel.c                   |    3 +
 arch/x86/kernel/process.c                     |   15 +-
 arch/x86/virt/Makefile                        |    2 +
 arch/x86/virt/vmx/Makefile                    |    2 +
 arch/x86/virt/vmx/tdx/Makefile                |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S              |   52 +
 arch/x86/virt/vmx/tdx/tdx.c                   | 1717 +++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h                   |  137 ++
 13 files changed, 2279 insertions(+), 14 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

-- 
2.35.1


^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
  2022-04-26 20:21   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the VMX architecture to define a new VMX root
operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX
non-root).  SEAM VMX root operation is designed to host a CPU-attested
software module called the 'TDX module' which implements functions to
manage crypto-protected VMs called Trust Domains (TD).  It is also
designed to additionally host a CPU-attested software module called the
'Intel Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX
module.

Software modules in SEAM VMX root run in a memory region defined by the
SEAM range register (SEAMRR).  So the first thing of detecting Intel TDX
is to detect the validity of SEAMRR.

The presence of SEAMRR is reported via a new SEAMRR bit (15) of the
IA32_MTRRCAP MSR.  The SEAMRR range registers consist of a pair of MSRs:

        IA32_SEAMRR_PHYS_BASE and IA32_SEAMRR_PHYS_MASK

BIOS is expected to configure SEAMRR with the same value across all
cores.  In case of BIOS misconfiguration, detect and compare SEAMRR
on all cpus.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
crypto-protect TD guests.  Part of MKTME KeyIDs are reserved as "TDX
private KeyID" or "TDX KeyIDs" for short.  Similar to detecting SEAMRR,
detecting TDX private KeyIDs also needs to be done on all cpus to detect
any BIOS misconfiguration.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a function to detect all TDX preliminaries
(SEAMRR, TDX private KeyIDs) for a given cpu when it is brought up.  As
the first step, detect the validity of SEAMRR.

Also add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host
kernel support (to distinguish with TDX guest kernel support).

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig               |  12 ++++
 arch/x86/Makefile              |   2 +
 arch/x86/include/asm/tdx.h     |   9 +++
 arch/x86/kernel/cpu/intel.c    |   3 +
 arch/x86/virt/Makefile         |   2 +
 arch/x86/virt/vmx/Makefile     |   2 +
 arch/x86/virt/vmx/tdx/Makefile |   2 +
 arch/x86/virt/vmx/tdx/tdx.c    | 102 +++++++++++++++++++++++++++++++++
 8 files changed, 134 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7021ec725dd3..9113bf09f358 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1967,6 +1967,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	default n
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 63d50f65b828..2ca3a2a36dc5 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -234,6 +234,8 @@ head-y += arch/x86/kernel/platform-quirks.o
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..1f29813b1646 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
@@ -87,5 +89,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+void tdx_detect_cpu(struct cpuinfo_x86 *c);
+#else
+static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..b142a640fb8e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -26,6 +26,7 @@
 #include <asm/resctrl.h>
 #include <asm/numa.h>
 #include <asm/thermal.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <linux/topology.h>
@@ -715,6 +716,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 	if (cpu_has(c, X86_FEATURE_TME))
 		detect_tme(c);
 
+	tdx_detect_cpu(c);
+
 	init_intel_misc_features(c);
 
 	if (tsx_ctrl_state == TSX_CTRL_ENABLE)
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..1e1fcd7d3bd1
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..1bd688684716
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..03f35c75f439
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cpumask.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
+
+/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
+#define MTRR_CAP_SEAMRR			BIT(15)
+
+/* Core-scope Intel SEAMRR base and mask registers. */
+#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
+#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
+
+#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
+#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
+#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
+
+#define SEAMRR_ENABLED_BITS	\
+	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
+
+/* BIOS must configure SEAMRR registers for all cores consistently */
+static u64 seamrr_base, seamrr_mask;
+
+static bool __seamrr_enabled(void)
+{
+	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
+}
+
+static void detect_seam_bsp(struct cpuinfo_x86 *c)
+{
+	u64 mtrrcap, base, mask;
+
+	/* SEAMRR is reported via MTRRcap */
+	if (!boot_cpu_has(X86_FEATURE_MTRR))
+		return;
+
+	rdmsrl(MSR_MTRRcap, mtrrcap);
+	if (!(mtrrcap & MTRR_CAP_SEAMRR))
+		return;
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+	if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
+		pr_info("SEAMRR base is not configured by BIOS\n");
+		return;
+	}
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
+		pr_info("SEAMRR is not enabled by BIOS\n");
+		return;
+	}
+
+	seamrr_base = base;
+	seamrr_mask = mask;
+}
+
+static void detect_seam_ap(struct cpuinfo_x86 *c)
+{
+	u64 base, mask;
+
+	/*
+	 * Don't bother to detect this AP if SEAMRR is not
+	 * enabled after earlier detections.
+	 */
+	if (!__seamrr_enabled())
+		return;
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+
+	if (base == seamrr_base && mask == seamrr_mask)
+		return;
+
+	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
+	/* Mark SEAMRR as disabled. */
+	seamrr_base = 0;
+	seamrr_mask = 0;
+}
+
+static void detect_seam(struct cpuinfo_x86 *c)
+{
+	if (c == &boot_cpu_data)
+		detect_seam_bsp(c);
+	else
+		detect_seam_ap(c);
+}
+
+void tdx_detect_cpu(struct cpuinfo_x86 *c)
+{
+	detect_seam(c);
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
  2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
  2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
  2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.

A new MSR (IA32_MKTME_KEYID_PARTITIONING) helps to enumerate how MKTME-
enumerated "KeyID" space is distributed between TDX and legacy MKTME.
KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs'
for short.

The new MSR is per package and BIOS is responsible for partitioning
MKTME KeyIDs and TDX KeyIDs consistently among all packages.

Detect TDX private KeyIDs as a preparation to initialize TDX.  Similar
to detecting SEAMRR, detect on all cpus to detect any potential BIOS
misconfiguration among packages.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 72 +++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 03f35c75f439..ba2210001ea8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -29,9 +29,28 @@
 #define SEAMRR_ENABLED_BITS	\
 	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
 
+/*
+ * Intel Trusted Domain CPU Architecture Extension spec:
+ *
+ * IA32_MKTME_KEYID_PARTIONING:
+ *
+ *   Bit [31:0]: number of MKTME KeyIDs.
+ *   Bit [63:32]: number of TDX private KeyIDs.
+ *
+ * TDX private KeyIDs start after the last MKTME KeyID.
+ */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
+#define TDX_KEYID_START(_keyid_part)	\
+		((u32)(((_keyid_part) & 0xffffffffull) + 1))
+#define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
+
 /* BIOS must configure SEAMRR registers for all cores consistently */
 static u64 seamrr_base, seamrr_mask;
 
+static u32 tdx_keyid_start;
+static u32 tdx_keyid_num;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -96,7 +115,60 @@ static void detect_seam(struct cpuinfo_x86 *c)
 		detect_seam_ap(c);
 }
 
+static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
+{
+	u64 keyid_part;
+
+	/* TDX is built on MKTME, which is based on TME */
+	if (!boot_cpu_has(X86_FEATURE_TME))
+		return;
+
+	if (rdmsrl_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &keyid_part))
+		return;
+
+	/* If MSR value is 0, TDX is not enabled by BIOS. */
+	if (!keyid_part)
+		return;
+
+	tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
+	if (!tdx_keyid_num)
+		return;
+
+	tdx_keyid_start = TDX_KEYID_START(keyid_part);
+}
+
+static void detect_tdx_keyids_ap(struct cpuinfo_x86 *c)
+{
+	u64 keyid_part;
+
+	/*
+	 * Don't bother to detect this AP if TDX KeyIDs are
+	 * not detected or cleared after earlier detections.
+	 */
+	if (!tdx_keyid_num)
+		return;
+
+	rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);
+
+	if ((tdx_keyid_start == TDX_KEYID_START(keyid_part)) &&
+			(tdx_keyid_num == TDX_KEYID_NUM(keyid_part)))
+		return;
+
+	pr_err("Inconsistent TDX KeyID configuration among packages by BIOS\n");
+	tdx_keyid_start = 0;
+	tdx_keyid_num = 0;
+}
+
+static void detect_tdx_keyids(struct cpuinfo_x86 *c)
+{
+	if (c == &boot_cpu_data)
+		detect_tdx_keyids_bsp(c);
+	else
+		detect_tdx_keyids_ap(c);
+}
+
 void tdx_detect_cpu(struct cpuinfo_x86 *c)
 {
 	detect_seam(c);
+	detect_tdx_keyids(c);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
  2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
  2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
  2022-04-26 20:37   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are isolated from legacy VMX root
and VMX non-root mode.

A CPU-attested software module (called the 'TDX module') runs in SEAM
VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
SEAM VMX root is also used to host another CPU-attested software module
(called the 'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via the
new SEAMCALL instruction.  SEAMCALL leaf functions are host-side
interface functions defined by the P-SEAMLDR and the TDX module around
the new SEAMCALL instruction.  They are similar to a hypercall, except
they are made by host kernel to the SEAM software.

SEAMCALL leaf functions use an ABI different from the x86-64 system-v
ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
%rax is used to carry both the SEAMCALL leaf function number (input) and
the completion status code (output).  Additional GPRs (%rcx, %rdx,
%r8->%r11) may be further used as both input and output operands in
individual leaf functions.

Implement a C function __seamcall() to do SEAMCALL leaf functions using
the assembly macro used by __tdx_module_call() (the implementation of
TDCALL leaf functions).  The only exception not covered here is TDENTER
leaf function which takes all GPRs and XMM0-XMM15 as both input and
output.  The caller of TDENTER should implement its own logic to call
TDENTER directly instead of using this function.

SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
root, and it can fail with VMfailInvalid, for instance, when the SEAM
software module is not loaded.  The C function __seamcall() returns
TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
code of SEAMCALLs, to uniquely represent this case.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 11 +++++++
 3 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 1bd688684716..fd577619620e 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..327961b2dd5a
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall()  - Host-side interface functions to SEAM software module
+ *		   (the P-SEAMLDR or the TDX module)
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
+ * the SEAMCALL.  Additional output operands are saved in @out (if it is
+ * provided by caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	ret
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..9d5b6f554c20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
+
+#endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (2 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
  2022-04-26 20:53   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module Kai Huang
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The TDX module is essentially a CPU-attested software module running
in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
host and certain physical attacks.  The TDX module implements the
functions to build, tear down and start execution of the protected VMs
called Trusted Domains (TD).  Before the TDX module can be used to
create and run TD guests, it must be loaded into the SEAM Range Register
(SEAMRR) and properly initialized.  The TDX module is expected to be
loaded by BIOS before booting to the kernel, and the kernel is expected
to detect and initialize it, using the SEAMCALLs defined by TDX
architecture.

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
on-demand approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This avoids consuming the memory that must be
allocated by kernel and given to the TDX module as metadata (~1/256th of
the TDX-usable memory), and also saves the time of initializing the TDX
module (and the metadata) when TDX is not used at all.  Initializing the
TDX module at runtime on-demand also is more flexible to support TDX
module runtime updating in the future (after updating the TDX module, it
needs to be initialized again).

Introduce two placeholders tdx_detect() and tdx_init() to detect and
initialize the TDX module on demand, with a state machine introduced to
orchestrate the entire process (in case of multiple callers).

To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
TDX module is reported as not loaded if either SEAMRR is not enabled, or
there are no enough TDX private KeyIDs to create any TD guest.  The TDX
module itself requires one global TDX private KeyID to crypto protect
its metadata.

And tdx_init() is currently empty.  The TDX module will be initialized
in multi-steps defined by the TDX architecture:

  1) Global initialization;
  2) Logical-CPU scope initialization;
  3) Enumerate the TDX module capabilities and platform configuration;
  4) Configure the TDX module about usable memory ranges and global
     KeyID information;
  5) Package-scope configuration for the global KeyID;
  6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module.  It's pointless to leave the module in any intermediate state
during the initialization.

SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
operation (VMXON has been done), otherwise it generates #UD.  So far
only KVM handles VMXON/VMXOFF.  Choose to not handle VMXON/VMXOFF in
tdx_detect() and tdx_init() but depend on the caller to guarantee that,
since so far KVM is the only user of TDX.  In the long term, more kernel
components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
module runtime update), so a reference-based approach to do VMXON/VMXOFF
is likely needed.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1f29813b1646..c8af2ba6bb8a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
+int tdx_detect(void);
+int tdx_init(void);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+static inline int tdx_detect(void) { return -ENODEV; }
+static inline int tdx_init(void) { return -ENODEV; }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ba2210001ea8..53093d4ad458 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -9,6 +9,8 @@
 
 #include <linux/types.h>
 #include <linux/cpumask.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -45,12 +47,33 @@
 		((u32)(((_keyid_part) & 0xffffffffull) + 1))
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
+/*
+ * TDX module status during initialization
+ */
+enum tdx_module_status_t {
+	/* TDX module status is unknown */
+	TDX_MODULE_UNKNOWN,
+	/* TDX module is not loaded */
+	TDX_MODULE_NONE,
+	/* TDX module is loaded, but not initialized */
+	TDX_MODULE_LOADED,
+	/* TDX module is fully initialized */
+	TDX_MODULE_INITIALIZED,
+	/* TDX module is shutdown due to error during initialization */
+	TDX_MODULE_SHUTDOWN,
+};
+
 /* BIOS must configure SEAMRR registers for all cores consistently */
 static u64 seamrr_base, seamrr_mask;
 
 static u32 tdx_keyid_start;
 static u32 tdx_keyid_num;
 
+static enum tdx_module_status_t tdx_module_status;
+
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
 	detect_seam(c);
 	detect_tdx_keyids(c);
 }
+
+static bool seamrr_enabled(void)
+{
+	/*
+	 * To detect any BIOS misconfiguration among cores, all logical
+	 * cpus must have been brought up at least once.  This is true
+	 * unless 'maxcpus' kernel command line is used to limit the
+	 * number of cpus to be brought up during boot time.  However
+	 * 'maxcpus' is basically an invalid operation mode due to the
+	 * MCE broadcast problem, and it should not be used on a TDX
+	 * capable machine.  Just do paranoid check here and do not
+	 * report SEAMRR as enabled in this case.
+	 */
+	if (!cpumask_equal(&cpus_booted_once_mask,
+					cpu_present_mask))
+		return false;
+
+	return __seamrr_enabled();
+}
+
+static bool tdx_keyid_sufficient(void)
+{
+	if (!cpumask_equal(&cpus_booted_once_mask,
+					cpu_present_mask))
+		return false;
+
+	/*
+	 * TDX requires at least two KeyIDs: one global KeyID to
+	 * protect the metadata of the TDX module and one or more
+	 * KeyIDs to run TD guests.
+	 */
+	return tdx_keyid_num >= 2;
+}
+
+static int __tdx_detect(void)
+{
+	/* The TDX module is not loaded if SEAMRR is disabled */
+	if (!seamrr_enabled()) {
+		pr_info("SEAMRR not enabled.\n");
+		goto no_tdx_module;
+	}
+
+	/*
+	 * Also do not report the TDX module as loaded if there's
+	 * no enough TDX private KeyIDs to run any TD guests.
+	 */
+	if (!tdx_keyid_sufficient()) {
+		pr_info("Number of TDX private KeyIDs too small: %u.\n",
+				tdx_keyid_num);
+		goto no_tdx_module;
+	}
+
+	/* Return -ENODEV until the TDX module is detected */
+no_tdx_module:
+	tdx_module_status = TDX_MODULE_NONE;
+	return -ENODEV;
+}
+
+static int init_tdx_module(void)
+{
+	/*
+	 * Return -EFAULT until all steps of TDX module
+	 * initialization are done.
+	 */
+	return -EFAULT;
+}
+
+static void shutdown_tdx_module(void)
+{
+	/* TODO: Shut down the TDX module */
+	tdx_module_status = TDX_MODULE_SHUTDOWN;
+}
+
+static int __tdx_init(void)
+{
+	int ret;
+
+	/*
+	 * Logical-cpu scope initialization requires calling one SEAMCALL
+	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
+	 * module also has such requirement.  Further more, configuring
+	 * the key of the global KeyID requires calling one SEAMCALL for
+	 * each package.  For simplicity, disable CPU hotplug in the whole
+	 * initialization process.
+	 *
+	 * It's perhaps better to check whether all BIOS-enabled cpus are
+	 * online before starting initializing, and return early if not.
+	 * But none of 'possible', 'present' and 'online' CPU masks
+	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
+	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
+	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
+	 * online.
+	 */
+	cpus_read_lock();
+
+	ret = init_tdx_module();
+
+	/*
+	 * Shut down the TDX module in case of any error during the
+	 * initialization process.  It's meaningless to leave the TDX
+	 * module in any middle state of the initialization process.
+	 */
+	if (ret)
+		shutdown_tdx_module();
+
+	cpus_read_unlock();
+
+	return ret;
+}
+
+/**
+ * tdx_detect - Detect whether the TDX module has been loaded
+ *
+ * Detect whether the TDX module has been loaded and ready for
+ * initialization.  Only call this function when all cpus are
+ * already in VMX operation.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0:	The TDX module has been loaded and ready for
+ *		initialization.
+ * * -ENODEV:	The TDX module is not loaded.
+ * * -EPERM:	CPU is not in VMX operation.
+ * * -EFAULT:	Other internal fatal errors.
+ */
+int tdx_detect(void)
+{
+	int ret;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_detect();
+		break;
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_LOADED:
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	case TDX_MODULE_SHUTDOWN:
+		ret = -EFAULT;
+		break;
+	default:
+		WARN_ON(1);
+		ret = -EFAULT;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_detect);
+
+/**
+ * tdx_init - Initialize the TDX module
+ *
+ * Initialize the TDX module to make it ready to run TD guests.  This
+ * function should be called after tdx_detect() returns successful.
+ * Only call this function when all cpus are online and are in VMX
+ * operation.  CPU hotplug is temporarily disabled internally.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0:	The TDX module has been successfully initialized.
+ * * -ENODEV:	The TDX module is not loaded.
+ * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
+ * * -EFAULT:	Other internal fatal errors.
+ */
+int tdx_init(void)
+{
+	int ret;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_LOADED:
+		ret = __tdx_init();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	default:
+		ret = -EFAULT;
+		break;
+	}
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_init);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (3 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-26 20:56   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The P-SEAMLDR (persistent SEAM loader) is the first software module that
runs in SEAM VMX root, responsible for loading and updating the TDX
module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
before host kernel boots.

There is no CPUID or MSR to detect whether the P-SEAMLDR or the TDX
module has been loaded.  SEAMCALL instruction fails with VMfailInvalid
when the target SEAM software module is not loaded, so SEAMCALL can be
used to detect whether the P-SEAMLDR and the TDX module are loaded.

Detect the P-SEAMLDR and the TDX module by calling SEAMLDR.INFO SEAMCALL
to get the P-SEAMLDR information.  If the SEAMCALL succeeds, the
P-SEAMLDR information further tells whether the TDX module is loaded or
not.

Also add a wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR and
the TDX module with additional defensive check on SEAMRR and CR4.VMXE,
since both detecting and initializing TDX module require the caller of
TDX to handle VMXON.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 175 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  31 +++++++
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 53093d4ad458..674867bccc14 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,7 +15,9 @@
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
 #include <asm/cpufeatures.h>
+#include <asm/virtext.h>
 #include <asm/tdx.h>
+#include "tdx.h"
 
 /* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
 #define MTRR_CAP_SEAMRR			BIT(15)
@@ -74,6 +76,8 @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+static struct p_seamldr_info p_seamldr_info;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -229,6 +233,160 @@ static bool tdx_keyid_sufficient(void)
 	return tdx_keyid_num >= 2;
 }
 
+/*
+ * All error codes of both the P-SEAMLDR and the TDX module SEAMCALLs
+ * have bit 63 set if SEAMCALL fails.
+ */
+#define SEAMCALL_LEAF_ERROR(_ret)	((_ret) & BIT_ULL(63))
+
+/**
+ * seamcall - make SEAMCALL to the P-SEAMLDR or the TDX module with
+ *	      additional check on SEAMRR and CR4.VMXE
+ *
+ * @fn:			SEAMCALL leaf number.
+ * @rcx:		Input operand RCX.
+ * @rdx:		Input operand RDX.
+ * @r8:			Input operand R8.
+ * @r9:			Input operand R9.
+ * @seamcall_ret:	SEAMCALL completion status (can be NULL).
+ * @out:		Additional output operands (can be NULL).
+ *
+ * Wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR or the TDX
+ * module with additional defensive check on SEAMRR and CR4.VMXE.  Caller
+ * to make sure SEAMRR is enabled and CPU is already in VMX operation
+ * before calling this function.
+ *
+ * Unlike __seamcall(), it returns kernel error code instead of SEAMCALL
+ * completion status, which is returned via @seamcall_ret if desired.
+ *
+ * Return:
+ *
+ * * -ENODEV:	SEAMCALL failed with VMfailInvalid, or SEAMRR is not enabled.
+ * * -EPERM:	CR4.VMXE is not enabled
+ * * -EFAULT:	SEAMCALL failed
+ * * -0:	SEAMCALL succeeded
+ */
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		    u64 *seamcall_ret, struct tdx_module_output *out)
+{
+	u64 ret;
+
+	if (WARN_ON_ONCE(!seamrr_enabled()))
+		return -ENODEV;
+
+	/*
+	 * SEAMCALL instruction requires CPU being already in VMX
+	 * operation (VMXON has been done), otherwise it causes #UD.
+	 * Sanity check whether CR4.VMXE has been enabled.
+	 *
+	 * Note VMX being enabled in CR4 doesn't mean CPU is already
+	 * in VMX operation, but unfortunately there's no way to do
+	 * such check.  However in practice enabling CR4.VMXE and
+	 * doing VMXON are done together (for now) so in practice it
+	 * checks whether VMXON has been done.
+	 *
+	 * Preemption is disabled during the CR4.VMXE check and the
+	 * actual SEAMCALL so VMX doesn't get disabled by other threads
+	 * due to scheduling.
+	 */
+	preempt_disable();
+	if (WARN_ON_ONCE(!cpu_vmx_enabled())) {
+		preempt_enable_no_resched();
+		return -EPERM;
+	}
+
+	ret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	preempt_enable_no_resched();
+
+	/*
+	 * Convert SEAMCALL error code to kernel error code:
+	 *  - -ENODEV:	VMfailInvalid
+	 *  - -EFAULT:	SEAMCALL failed
+	 *  - 0:	SEAMCALL was successful
+	 */
+	if (ret == TDX_SEAMCALL_VMFAILINVALID)
+		return -ENODEV;
+
+	/* Save the completion status if caller wants to use it */
+	if (seamcall_ret)
+		*seamcall_ret = ret;
+
+	/*
+	 * TDX module SEAMCALLs may also return non-zero completion
+	 * status codes but w/o bit 63 set.  Those codes are treated
+	 * as additional information/warning while the SEAMCALL is
+	 * treated as completed successfully.  Return 0 in this case.
+	 * Caller can use @seamcall_ret to get the additional code
+	 * when it is desired.
+	 */
+	if (SEAMCALL_LEAF_ERROR(ret)) {
+		pr_err("SEAMCALL leaf %llu failed: 0x%llx\n", fn, ret);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static inline bool p_seamldr_ready(void)
+{
+	return !!p_seamldr_info.p_seamldr_ready;
+}
+
+static inline bool tdx_module_ready(void)
+{
+	/*
+	 * SEAMLDR_INFO.SEAM_READY indicates whether TDX module
+	 * is (loaded and) ready for SEAMCALL.
+	 */
+	return p_seamldr_ready() && !!p_seamldr_info.seam_ready;
+}
+
+/*
+ * Detect whether the P-SEAMLDR has been loaded by calling SEAMLDR.INFO
+ * SEAMCALL to get the P-SEAMLDR information, which further tells whether
+ * the TDX module has been loaded and ready for SEAMCALL.  Caller to make
+ * sure only calling this function when CPU is already in VMX operation.
+ */
+static int detect_p_seamldr(void)
+{
+	int ret;
+
+	/*
+	 * SEAMCALL fails with VMfailInvalid when SEAM software is not
+	 * loaded, in which case seamcall() returns -ENODEV.  Use this
+	 * to detect the P-SEAMLDR.
+	 *
+	 * Note the P-SEAMLDR SEAMCALL also fails with VMfailInvalid when
+	 * the P-SEAMLDR is already busy with another SEAMCALL.  But this
+	 * won't happen here as this function is only called once.
+	 */
+	ret = seamcall(P_SEAMCALL_SEAMLDR_INFO, __pa(&p_seamldr_info),
+			0, 0, 0, NULL, NULL);
+	if (ret) {
+		if (ret == -ENODEV)
+			pr_info("P-SEAMLDR is not loaded.\n");
+		else
+			pr_info("Failed to detect P-SEAMLDR.\n");
+
+		return ret;
+	}
+
+	/*
+	 * If SEAMLDR.INFO was successful, it must be ready for SEAMCALL.
+	 * Otherwise it's either kernel or firmware bug.
+	 */
+	if (WARN_ON_ONCE(!p_seamldr_ready()))
+		return -ENODEV;
+
+	pr_info("P-SEAMLDR: version 0x%x, vendor_id: 0x%x, build_date: %u, build_num %u, major %u, minor %u\n",
+		p_seamldr_info.version, p_seamldr_info.vendor_id,
+		p_seamldr_info.build_date, p_seamldr_info.build_num,
+		p_seamldr_info.major, p_seamldr_info.minor);
+
+	return 0;
+}
+
 static int __tdx_detect(void)
 {
 	/* The TDX module is not loaded if SEAMRR is disabled */
@@ -247,7 +405,22 @@ static int __tdx_detect(void)
 		goto no_tdx_module;
 	}
 
-	/* Return -ENODEV until the TDX module is detected */
+	/*
+	 * For simplicity any error during detect_p_seamldr() marks
+	 * TDX module as not loaded.
+	 */
+	if (detect_p_seamldr())
+		goto no_tdx_module;
+
+	if (!tdx_module_ready()) {
+		pr_info("TDX module is not loaded.\n");
+		goto no_tdx_module;
+	}
+
+	pr_info("TDX module detected.\n");
+	tdx_module_status = TDX_MODULE_LOADED;
+	return 0;
+
 no_tdx_module:
 	tdx_module_status = TDX_MODULE_NONE;
 	return -ENODEV;
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9d5b6f554c20..6990c93198b3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,37 @@
 #define _X86_VIRT_TDX_H
 
 #include <linux/types.h>
+#include <linux/compiler.h>
+
+/*
+ * TDX architectural data structures
+ */
+
+#define P_SEAMLDR_INFO_ALIGNMENT	256
+
+struct p_seamldr_info {
+	u32	version;
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor;
+	u16	major;
+	u8	reserved0[2];
+	u32	acm_x2apicid;
+	u8	reserved1[4];
+	u8	seaminfo[128];
+	u8	seam_ready;
+	u8	seam_debug;
+	u8	p_seamldr_ready;
+	u8	reserved2[88];
+} __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
+
+/*
+ * P-SEAMLDR SEAMCALL leaf function
+ */
+#define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
+#define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
 
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (4 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
  2022-04-26 20:59   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization Kai Huang
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX supports shutting down the TDX module at any time during its
lifetime.  After TDX module is shut down, no further SEAMCALL can be
made on any logical cpu.

Shut down the TDX module in case of any error happened during the
initialization process.  It's pointless to leave the TDX module in some
middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled cpus, and the SEMACALL can run concurrently on different
cpus.  Implement a mechanism to run SEAMCALL concurrently on all online
cpus.  Logical-cpu scope initialization will use it too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 674867bccc14..faf8355965a5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -11,6 +11,8 @@
 #include <linux/cpumask.h>
 #include <linux/mutex.h>
 #include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	return 0;
 }
 
+/* Data structure to make SEAMCALL on multiple CPUs concurrently */
+struct seamcall_ctx {
+	u64 fn;
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	atomic_t err;
+	u64 seamcall_ret;
+	struct tdx_module_output out;
+};
+
+static void seamcall_smp_call_function(void *data)
+{
+	struct seamcall_ctx *sc = data;
+	int ret;
+
+	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
+			&sc->seamcall_ret, &sc->out);
+	if (ret)
+		atomic_set(&sc->err, ret);
+}
+
+/*
+ * Call the SEAMCALL on all online cpus concurrently.
+ * Return error if SEAMCALL fails on any cpu.
+ */
+static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+	on_each_cpu(seamcall_smp_call_function, sc, true);
+	return atomic_read(&sc->err);
+}
+
 static inline bool p_seamldr_ready(void)
 {
 	return !!p_seamldr_info.p_seamldr_ready;
@@ -437,7 +472,10 @@ static int init_tdx_module(void)
 
 static void shutdown_tdx_module(void)
 {
-	/* TODO: Shut down the TDX module */
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+	seamcall_on_each_cpu(&sc);
+
 	tdx_module_status = TDX_MODULE_SHUTDOWN;
 }
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6990c93198b3..dcc1f6dfe378 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -35,6 +35,11 @@ struct p_seamldr_info {
 #define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
 #define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN	44
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (5 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-20 22:27   ` Sathyanarayanan Kuppuswamy
  2022-04-06  4:49 ` [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Do the TDX module global initialization which requires calling
TDH.SYS.INIT once on any logical cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index faf8355965a5..5c2f3a30be2f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -463,11 +463,20 @@ static int __tdx_detect(void)
 
 static int init_tdx_module(void)
 {
+	int ret;
+
+	/* TDX module global initialization */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
-	return -EFAULT;
+	ret = -EFAULT;
+out:
+	return ret;
 }
 
 static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index dcc1f6dfe378..f0983b1936d8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -38,6 +38,7 @@ struct p_seamldr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INIT		33
 #define TDH_SYS_LP_SHUTDOWN	44
 
 struct tdx_module_output;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (6 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-24  1:27   ` Sathyanarayanan Kuppuswamy
  2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.
TDH.SYS.LP.INIT can be called concurrently on all cpus.

Following global initialization, do the logical-cpu scope initialization
by calling TDH.SYS.LP.INIT on all online cpus.  Whether all BIOS-enabled
cpus are online is not checked here for simplicity.  The caller of
tdx_init() should guarantee all BIOS-enabled cpus are online.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5c2f3a30be2f..ef2718423f0f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -461,6 +461,13 @@ static int __tdx_detect(void)
 	return -ENODEV;
 }
 
+static int tdx_module_init_cpus(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+	return seamcall_on_each_cpu(&sc);
+}
+
 static int init_tdx_module(void)
 {
 	int ret;
@@ -470,6 +477,11 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Logical-cpu scope initialization */
+	ret = tdx_module_init_cpus();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f0983b1936d8..b8cfdd6e12f3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -39,6 +39,7 @@ struct p_seamldr_info {
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
 struct tdx_module_output;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (7 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
  2022-04-27 22:15   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module via TDH.SYS.INFO SEAMCALL.

Host kernel can choose whether or not to use all convertible memory
regions as TDX memory.  Before TDX module is ready to create any TD
guests, all TDX memory regions that host kernel intends to use must be
configured to the TDX module, using specific data structures defined by
TDX architecture.  Constructing those structures requires information of
both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
to get this information as preparation to construct those structures.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
 2 files changed, 192 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ef2718423f0f..482e6d858181 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
 
 static struct p_seamldr_info p_seamldr_info;
 
+/* Base address of CMR array needs to be 512 bytes aligned. */
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+static int tdx_cmr_num;
+static struct tdsysinfo_struct tdx_sysinfo;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
 	return seamcall_on_each_cpu(&sc);
 }
 
+static inline bool cmr_valid(struct cmr_info *cmr)
+{
+	return !!cmr->size;
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
+		       const char *name)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		pr_info("%s : [0x%llx, 0x%llx)\n", name,
+				cmr->base, cmr->base + cmr->size);
+	}
+}
+
+static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
+{
+	int i, j;
+
+	/*
+	 * Intel TDX module spec, 20.7.3 CMR_INFO:
+	 *
+	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
+	 *   array of CMR_INFO entries. The CMRs are sorted from the
+	 *   lowest base address to the highest base address, and they
+	 *   are non-overlapping.
+	 *
+	 * This implies that BIOS may generate invalid empty entries
+	 * if total CMRs are less than 32.  Skip them manually.
+	 */
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+		struct cmr_info *prev_cmr = NULL;
+
+		/* Skip further invalid CMRs */
+		if (!cmr_valid(cmr))
+			break;
+
+		if (i > 0)
+			prev_cmr = &cmr_array[i - 1];
+
+		/*
+		 * It is a TDX firmware bug if CMRs are not
+		 * in address ascending order.
+		 */
+		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
+					cmr->base)) {
+			pr_err("Firmware bug: CMRs not in address ascending order.\n");
+			return -EFAULT;
+		}
+	}
+
+	/*
+	 * Also a sane BIOS should never generate invalid CMR(s) between
+	 * two valid CMRs.  Sanity check this and simply return error in
+	 * this case.
+	 *
+	 * By reaching here @i is the index of the first invalid CMR (or
+	 * cmr_num).  Starting with next entry of @i since it has already
+	 * been checked.
+	 */
+	for (j = i + 1; j < cmr_num; j++)
+		if (cmr_valid(&cmr_array[j])) {
+			pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
+			return -EFAULT;
+		}
+
+	/*
+	 * Trim all tail invalid empty CMRs.  BIOS should generate at
+	 * least one valid CMR, otherwise it's a TDX firmware bug.
+	 */
+	tdx_cmr_num = i;
+	if (!tdx_cmr_num) {
+		pr_err("Firmware bug: No valid CMR.\n");
+		return -EFAULT;
+	}
+
+	/* Print kernel sanitized CMRs */
+	print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
+
+	return 0;
+}
+
+static int tdx_get_sysinfo(void)
+{
+	struct tdx_module_output out;
+	u64 tdsysinfo_sz, cmr_num;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
+
+	ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
+			__pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	/*
+	 * If TDH.SYS.CONFIG succeeds, RDX contains the actual bytes
+	 * written to @tdx_sysinfo and R9 contains the actual entries
+	 * written to @tdx_cmr_array.  Sanity check them.
+	 */
+	tdsysinfo_sz = out.rdx;
+	cmr_num = out.r9;
+	if (WARN_ON_ONCE((tdsysinfo_sz > sizeof(tdx_sysinfo)) || !tdsysinfo_sz ||
+				(cmr_num > MAX_CMRS) || !cmr_num))
+		return -EFAULT;
+
+	pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
+		tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
+		tdx_sysinfo.build_num);
+
+	/* Print BIOS provided CMRs */
+	print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
+
+	return sanitize_cmrs(tdx_cmr_array, cmr_num);
+}
+
 static int init_tdx_module(void)
 {
 	int ret;
@@ -482,6 +608,11 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Get TDX module information and CMRs */
+	ret = tdx_get_sysinfo();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b8cfdd6e12f3..2f21c45df6ac 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -29,6 +29,66 @@ struct p_seamldr_info {
 	u8	reserved2[88];
 } __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
+	 * is 1024B defined by TDX architecture.  Use a union with
+	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+	 * equal to 1024.
+	 */
+	union {
+		struct cpuid_config	cpuid_configs[0];
+		u8			reserved5[892];
+	};
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
 /*
  * P-SEAMLDR SEAMCALL leaf function
  */
@@ -38,6 +98,7 @@ struct p_seamldr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (8 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-20 20:48   ` Isaku Yamahata
  2022-04-27 22:24   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
                   ` (12 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module.

In order to provide crypto protection to TD guests, the TDX architecture
also needs additional metadata to record things like which TD guest
"owns" a given page of memory.  This metadata essentially serves as the
'struct page' for the TDX module.  The space for this metadata is not
reserved by the hardware upfront and must be allocated by the kernel
and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be put into.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the memory regions that TDX
module can use must be configured to the TDX module via an array of
TDMRs.

Constructing TDMRs to build the TDX memory consists below steps:

1) Create TDMRs to cover all memory regions that TDX module can use;
2) Allocate and set up PAMT for each TDMR;
3) Set up reserved areas for each TDMR.

Add a placeholder right after getting TDX module and CMRs information to
construct TDMRs to do the above steps, as the preparation to configure
the TDX module.  Always free TDMRs at the end of the initialization (no
matter successful or not), as TDMRs are only used during the
initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 47 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 482e6d858181..ec27350d53c1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,7 @@
 #include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/atomic.h>
+#include <linux/slab.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -594,8 +595,29 @@ static int tdx_get_sysinfo(void)
 	return sanitize_cmrs(tdx_cmr_array, cmr_num);
 }
 
+static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tdmr = tdmr_array[i];
+
+		/* kfree() works with NULL */
+		kfree(tdmr);
+		tdmr_array[i] = NULL;
+	}
+}
+
+static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
+{
+	/* Return -EFAULT until constructing TDMRs is done */
+	return -EFAULT;
+}
+
 static int init_tdx_module(void)
 {
+	struct tdmr_info **tdmr_array;
+	int tdmr_num;
 	int ret;
 
 	/* TDX module global initialization */
@@ -613,11 +635,36 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * Prepare enough space to hold pointers of TDMRs (TDMR_INFO).
+	 * TDX requires TDMR_INFO being 512 aligned.  Each TDMR is
+	 * allocated individually within construct_tdmrs() to meet
+	 * this requirement.
+	 */
+	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
+			GFP_KERNEL);
+	if (!tdmr_array) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Construct TDMRs to build TDX memory */
+	ret = construct_tdmrs(tdmr_array, &tdmr_num);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+out_free_tdmrs:
+	/*
+	 * TDMRs are only used during initializing TDX module.  Always
+	 * free them no matter the initialization was successful or not.
+	 */
+	free_tdmrs(tdmr_array, tdmr_num);
+	kfree(tdmr_array);
 out:
 	return ret;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2f21c45df6ac..05bf9fe6bd00 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -89,6 +89,29 @@ struct tdsysinfo_struct {
 	};
 } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	struct tdmr_reserved_area reserved_areas[0];
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * P-SEAMLDR SEAMCALL leaf function
  */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (9 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-20 20:55   ` Isaku Yamahata
  2022-04-28 15:54   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM Kai Huang
                   ` (11 subsequent siblings)
  22 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

As one step of initializing the TDX module, the memory regions that the
TDX module can use must be configured to it via an array of 'TD Memory
Regions' (TDMR).  The kernel is responsible for choosing which memory
regions to be used as TDX memory and building the array of TDMRs to
cover those memory regions.

The first generation of TDX-capable platforms basically guarantees all
system RAM regions during machine boot are Convertible Memory Regions
(excluding the memory below 1MB) and can be used by TDX.  The memory
pages allocated to TD guests can be any pages managed by the page
allocator.  To avoid having to modify the page allocator to distinguish
TDX and non-TDX memory allocation, adopt a simple policy to use all
system RAM regions as TDX memory.  The low 1MB pages are excluded from
TDX memory since they are not in CMRs in some platforms (those pages are
reserved at boot time and won't be managed by page allocator anyway).

This policy could be revised later if future TDX generations break
the guarantee or when the size of the metadata (~1/256th of the size of
the TDX usable memory) becomes a concern.  At that time a CMR-aware
page allocator may be necessary.

Also, on the first generation of TDX-capable machine, the system RAM
ranges discovered during boot time are all memory regions that kernel
can use during its runtime.  This is because the first generation of TDX
architecturally doesn't support ACPI memory hotplug (CMRs are generated
during machine boot and are static during machine's runtime).  Also, the
first generation of TDX-capable platform doesn't support TDX and ACPI
memory hotplug at the same time on a single machine.  Another case of
memory hotplug is user may use NVDIMM as system RAM via kmem driver.
But the first generation of TDX-capable machine doesn't support TDX and
NVDIMM simultaneously, therefore in practice it cannot happen.  One
special case is user may use 'memmap' kernel command line to reserve
part of system RAM as x86 legacy PMEMs, and user can theoretically add
them as system RAM via kmem driver.  This can be resolved by always
treating legacy PMEMs as TDX memory.

Implement a helper to loop over all RAM entries in e820 table to find
all system RAM ranges, as a preparation to covert all of them to TDX
memory.  Use 'e820_table', rather than 'e820_table_firmware' to honor
'mem' and 'memmap' command lines.  Following e820__memblock_setup(),
both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN types are treated as TDX
memory, and contiguous ranges in the same NUMA node are merged together.

One difference is, as mentioned above, x86 legacy PMEMs (E820_TYPE_PRAM)
are also always treated as TDX memory.  They are underneath RAM, and
they could be used as TD guest memory.  Always including them as TDX
memory also avoids having to modify memory hotplug code to handle adding
them as system RAM via kmem driver.

To begin with, sanity check all memory regions found in e820 are fully
covered by any CMR and can be used as TDX memory.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 228 +++++++++++++++++++++++++++++++++++-
 2 files changed, 228 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9113bf09f358..7414625b938f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1972,6 +1972,7 @@ config INTEL_TDX_HOST
 	default n
 	depends on CPU_SUP_INTEL
 	depends on X86_64
+	select NUMA_KEEP_MEMINFO if NUMA
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ec27350d53c1..6b0c51aaa7f2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -14,11 +14,13 @@
 #include <linux/smp.h>
 #include <linux/atomic.h>
 #include <linux/slab.h>
+#include <linux/math.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
 #include <asm/cpufeatures.h>
 #include <asm/virtext.h>
+#include <asm/e820/api.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -595,6 +597,222 @@ static int tdx_get_sysinfo(void)
 	return sanitize_cmrs(tdx_cmr_array, cmr_num);
 }
 
+/* Check whether one e820 entry is RAM and could be used as TDX memory */
+static bool e820_entry_is_ram(struct e820_entry *entry)
+{
+	/*
+	 * Besides E820_TYPE_RAM, E820_TYPE_RESERVED_KERN type entries
+	 * are also treated as TDX memory as they are also added to
+	 * memblock.memory in e820__memblock_setup().
+	 *
+	 * E820_TYPE_SOFT_RESERVED type entries are excluded as they are
+	 * marked as reserved and are not later freed to page allocator
+	 * (only part of kernel image, initrd, etc are freed to page
+	 * allocator).
+	 *
+	 * Also unconditionally treat x86 legacy PMEMs (E820_TYPE_PRAM)
+	 * as TDX memory since they are RAM underneath, and could be used
+	 * as TD guest memory.
+	 */
+	return (entry->type == E820_TYPE_RAM) ||
+		(entry->type == E820_TYPE_RESERVED_KERN) ||
+		(entry->type == E820_TYPE_PRAM);
+}
+
+/*
+ * The low memory below 1MB is not covered by CMRs on some TDX platforms.
+ * In practice, this range cannot be used for guest memory because it is
+ * not managed by the page allocator due to boot-time reservation.  Just
+ * skip the low 1MB so this range won't be treated as TDX memory.
+ *
+ * Return true if the e820 entry is completely skipped, in which case
+ * caller should ignore this entry.  Otherwise the actual memory range
+ * after skipping the low 1MB is returned via @start and @end.
+ */
+static bool e820_entry_skip_lowmem(struct e820_entry *entry, u64 *start,
+				   u64 *end)
+{
+	u64 _start = entry->addr;
+	u64 _end = entry->addr + entry->size;
+
+	if (_start < SZ_1M)
+		_start = SZ_1M;
+
+	*start = _start;
+	*end = _end;
+
+	return _start >= _end;
+}
+
+/*
+ * Trim away non-page-aligned memory at the beginning and the end for a
+ * given region.  Return true when there are still pages remaining after
+ * trimming, and the trimmed region is returned via @start and @end.
+ */
+static bool e820_entry_trim(u64 *start, u64 *end)
+{
+	u64 s, e;
+
+	s = round_up(*start, PAGE_SIZE);
+	e = round_down(*end, PAGE_SIZE);
+
+	if (s >= e)
+		return false;
+
+	*start = s;
+	*end = e;
+
+	return true;
+}
+
+/*
+ * Get the next memory region (excluding low 1MB) in e820.  @idx points
+ * to the entry to start to walk with.  Multiple memory regions in the
+ * same NUMA node that are contiguous are merged together (following
+ * e820__memblock_setup()).  The merged range is returned via @start and
+ * @end.  After return, @idx points to the next entry of the last RAM
+ * entry that has been walked, or table->nr_entries (indicating all
+ * entries in the e820 table have been walked).
+ */
+static void e820_next_mem(struct e820_table *table, int *idx, u64 *start,
+			  u64 *end)
+{
+	u64 rs, re;
+	int rnid, i;
+
+again:
+	rs = re = 0;
+	for (i = *idx; i < table->nr_entries; i++) {
+		struct e820_entry *entry = &table->entries[i];
+		u64 s, e;
+		int nid;
+
+		if (!e820_entry_is_ram(entry))
+			continue;
+
+		if (e820_entry_skip_lowmem(entry, &s, &e))
+			continue;
+
+		/*
+		 * Found the first RAM entry.  Record it and keep
+		 * looping to find other RAM entries that can be
+		 * merged.
+		 */
+		if (!rs) {
+			rs = s;
+			re = e;
+			rnid = phys_to_target_node(rs);
+			if (WARN_ON_ONCE(rnid == NUMA_NO_NODE))
+				rnid = 0;
+			continue;
+		}
+
+		/*
+		 * Try to merge with previous RAM entry.  E820 entries
+		 * are not necessarily page aligned.  For instance, the
+		 * setup_data elements in boot_params are marked as
+		 * E820_TYPE_RESERVED_KERN, and they may not be page
+		 * aligned.  In e820__memblock_setup() all adjancent
+		 * memory regions within the same NUMA node are merged to
+		 * a single one, and the non-page-aligned parts (at the
+		 * beginning and the end) are trimmed.  Follow the same
+		 * rule here.
+		 */
+		nid = phys_to_target_node(s);
+		if (WARN_ON_ONCE(nid == NUMA_NO_NODE))
+			nid = 0;
+		if ((nid == rnid) && (s == re)) {
+			/* Merge with previous range and update the end */
+			re = e;
+			continue;
+		}
+
+		/*
+		 * Stop if current entry cannot be merged with previous
+		 * one (or more) entries.
+		 */
+		break;
+	}
+
+	/*
+	 * @i is either the RAM entry that cannot be merged with previous
+	 * one (or more) entries, or table->nr_entries.
+	 */
+	*idx = i;
+	/*
+	 * Trim non-page-aligned parts of [@rs, @re), which is either a
+	 * valid memory region, or empty.  If there's nothing left after
+	 * trimming and there are still entries that have not been
+	 * walked, continue to walk.
+	 */
+	if (!e820_entry_trim(&rs, &re) && i < table->nr_entries)
+		goto again;
+
+	*start = rs;
+	*end = re;
+}
+
+/*
+ * Helper to loop all e820 RAM entries with low 1MB excluded
+ * in a given e820 table.
+ */
+#define _e820_for_each_mem(_table, _i, _start, _end)				\
+	for ((_i) = 0, e820_next_mem((_table), &(_i), &(_start), &(_end));	\
+		(_start) < (_end);						\
+		e820_next_mem((_table), &(_i), &(_start), &(_end)))
+
+/*
+ * Helper to loop all e820 RAM entries with low 1MB excluded
+ * in kernel modified 'e820_table' to honor 'mem' and 'memmap' kernel
+ * command lines.
+ */
+#define e820_for_each_mem(_i, _start, _end)	\
+	_e820_for_each_mem(e820_table, _i, _start, _end)
+
+/* Check whether first range is the subrange of the second */
+static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
+{
+	return (r1_start >= r2_start && r1_end <= r2_end) ? true : false;
+}
+
+/* Check whether address range is covered by any CMR or not. */
+static bool range_covered_by_cmr(struct cmr_info *cmr_array, int cmr_num,
+				 u64 start, u64 end)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		if (is_subrange(start, end, cmr->base, cmr->base + cmr->size))
+			return true;
+	}
+
+	return false;
+}
+
+/* Sanity check whether all e820 RAM entries are fully covered by CMRs. */
+static int e820_check_against_cmrs(void)
+{
+	u64 start, end;
+	int i;
+
+	/*
+	 * Loop over e820_table to find all RAM entries and check
+	 * whether they are all fully covered by any CMR.
+	 */
+	e820_for_each_mem(i, start, end) {
+		if (!range_covered_by_cmr(tdx_cmr_array, tdx_cmr_num,
+					start, end)) {
+			pr_err("[0x%llx, 0x%llx) is not fully convertible memory\n",
+					start, end);
+			return -EFAULT;
+		}
+	}
+
+	return 0;
+}
+
 static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 {
 	int i;
@@ -610,8 +828,16 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
+	int ret;
+
+	ret = e820_check_against_cmrs();
+	if (ret)
+		goto err;
+
 	/* Return -EFAULT until constructing TDMRs is done */
-	return -EFAULT;
+	ret = -EFAULT;
+err:
+	return ret;
 }
 
 static int init_tdx_module(void)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (10 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-28 16:22   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

The kernel configures TDX usable memory regions to the TDX module via
an array of "TD Memory Region" (TDMR).  Each TDMR entry (TDMR_INFO)
contains the information of the base/size of a memory region, the
base/size of the associated Physical Address Metadata Table (PAMT) and
a list of reserved areas in the region.

Create a number of TDMRs according to the verified e820 RAM entries.
As the first step only set up the base/size information for each TDMR.

TDMR must be 1G aligned and the size must be in 1G granularity.  This
implies that one TDMR could cover multiple e820 RAM entries.  If a RAM
entry spans the 1GB boundary and the former part is already covered by
the previous TDMR, just create a new TDMR for the latter part.

TDX only supports a limited number of TDMRs (currently 64).  Abort the
TDMR construction process when the number of TDMRs exceeds this
limitation.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6b0c51aaa7f2..82534e70df96 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -54,6 +54,18 @@
 		((u32)(((_keyid_part) & 0xffffffffull) + 1))
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+/* TDMR's start and end address */
+#define TDMR_START(_tdmr)	((_tdmr)->base)
+#define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
+
 /*
  * TDX module status during initialization
  */
@@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
 	return 0;
 }
 
+/* The starting offset of reserved areas within TDMR_INFO */
+#define TDMR_RSVD_START		64
+
+static struct tdmr_info *__alloc_tdmr(void)
+{
+	int tdmr_sz;
+
+	/*
+	 * TDMR_INFO's actual size depends on maximum number of reserved
+	 * areas that one TDMR supports.
+	 */
+	tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
+		sizeof(struct tdmr_reserved_area);
+
+	/*
+	 * TDX requires TDMR_INFO to be 512 aligned.  Always align up
+	 * TDMR_INFO size to 512 so the memory allocated via kzalloc()
+	 * can meet the alignment requirement.
+	 */
+	tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+
+	return kzalloc(tdmr_sz, GFP_KERNEL);
+}
+
+/* Create a new TDMR at given index in the TDMR array */
+static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
+{
+	struct tdmr_info *tdmr;
+
+	if (WARN_ON_ONCE(tdmr_array[idx]))
+		return NULL;
+
+	tdmr = __alloc_tdmr();
+	tdmr_array[idx] = tdmr;
+
+	return tdmr;
+}
+
 static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 {
 	int i;
@@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 	}
 }
 
+/*
+ * Create TDMRs to cover all RAM entries in e820_table.  The created
+ * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
+ * number of TDMRs.  All entries in @tdmr_array must be initially NULL.
+ */
+static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
+{
+	struct tdmr_info *tdmr;
+	u64 start, end;
+	int i, tdmr_idx;
+	int ret = 0;
+
+	tdmr_idx = 0;
+	tdmr = alloc_tdmr(tdmr_array, 0);
+	if (!tdmr)
+		return -ENOMEM;
+	/*
+	 * Loop over all RAM entries in e820 and create TDMRs to cover
+	 * them.  To keep it simple, always try to use one TDMR to cover
+	 * one RAM entry.
+	 */
+	e820_for_each_mem(i, start, end) {
+		start = TDMR_ALIGN_DOWN(start);
+		end = TDMR_ALIGN_UP(end);
+
+		/*
+		 * If the current TDMR's size hasn't been initialized, it
+		 * is a new allocated TDMR to cover the new RAM entry.
+		 * Otherwise the current TDMR already covers the previous
+		 * RAM entry.  In the latter case, check whether the
+		 * current RAM entry has been fully or partially covered
+		 * by the current TDMR, since TDMR is 1G aligned.
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to next RAM entry if the current entry
+			 * is already fully covered by the current TDMR.
+			 */
+			if (end <= TDMR_END(tdmr))
+				continue;
+
+			/*
+			 * If part of current RAM entry has already been
+			 * covered by current TDMR, skip the already
+			 * covered part.
+			 */
+			if (start < TDMR_END(tdmr))
+				start = TDMR_END(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current RAM
+			 * entry, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
+				ret = -E2BIG;
+				goto err;
+			}
+			tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
+			if (!tdmr) {
+				ret = -ENOMEM;
+				goto err;
+			}
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	*tdmr_num = tdmr_idx + 1;
+
+	return 0;
+err:
+	/*
+	 * Clean up already allocated TDMRs in case of error.  @tdmr_idx
+	 * indicates the last TDMR that wasn't created successfully,
+	 * therefore only needs to free @tdmr_idx TDMRs.
+	 */
+	free_tdmrs(tdmr_array, tdmr_idx);
+	return ret;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -834,8 +967,13 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
+	ret = create_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EFAULT until constructing TDMRs is done */
 	ret = -EFAULT;
+	free_tdmrs(tdmr_array, *tdmr_num);
 err:
 	return ret;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (11 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-28 17:12   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

In order to provide crypto protection to guests, the TDX module uses
additional metadata to record things like which guest "owns" a given
page of memory.  This metadata, referred as Physical Address Metadata
Table (PAMT), essentially serves as the 'struct page' for the TDX
module.  PAMTs are not reserved by hardware upfront.  They must be
allocated by the kernel and then given to the TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
Each PAMT must be a physically contiguous area from the Convertible
Memory Regions (CMR).  However, the PAMTs which track pages in one TDMR
do not need to reside within that TDMR but can be anywhere in CMRs.
If one PAMT overlaps with any TDMR, the overlapping part must be
reported as a reserved area in that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).

The current version of TDX supports at most 16 reserved areas per TDMR
to cover both PAMTs and potential memory holes within the TDMR.  If many
PAMTs are allocated within a single TDMR, 16 reserved areas may not be
sufficient to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
 2 files changed, 166 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7414625b938f..ff68d0829bd7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	select NUMA_KEEP_MEMINFO if NUMA
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 82534e70df96..1b807dcbc101 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -21,6 +21,7 @@
 #include <asm/cpufeatures.h>
 #include <asm/virtext.h>
 #include <asm/e820/api.h>
+#include <asm/pgtable.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -66,6 +67,16 @@
 #define TDMR_START(_tdmr)	((_tdmr)->base)
 #define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
 
+/* Page sizes supported by TDX */
+enum tdx_page_sz {
+	TDX_PG_4K = 0,
+	TDX_PG_2M,
+	TDX_PG_1G,
+	TDX_PG_MAX,
+};
+
+#define TDX_HPAGE_SHIFT	9
+
 /*
  * TDX module status during initialization
  */
@@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	return ret;
 }
 
+/* Calculate PAMT size given a TDMR and a page size */
+static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
+					enum tdx_page_sz pgsz)
+{
+	unsigned long pamt_sz;
+
+	pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
+		tdx_sysinfo.pamt_entry_size;
+	/* PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/* Calculate the size of all PAMTs for a TDMR */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
+{
+	enum tdx_page_sz pgsz;
+	unsigned long pamt_sz;
+
+	pamt_sz = 0;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
+		pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate the NUMA node containing the start of the given TDMR's first
+ * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+	u64 start, end;
+	int i;
+
+	/* Find the first RAM entry covered by the TDMR */
+	e820_for_each_mem(i, start, end)
+		if (end > TDMR_START(tdmr))
+			break;
+
+	/*
+	 * One TDMR must cover at least one (or partial) RAM entry,
+	 * otherwise it is kernel bug.  WARN_ON() in this case.
+	 */
+	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
+		return 0;
+
+	/*
+	 * The first RAM entry may be partially covered by the previous
+	 * TDMR.  In this case, use TDMR's start to find the NUMA node.
+	 */
+	if (start < TDMR_START(tdmr))
+		start = TDMR_START(tdmr);
+
+	return phys_to_target_node(start);
+}
+
+static int tdmr_setup_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
+	unsigned long pamt_sz[TDX_PG_MAX];
+	unsigned long pamt_npages;
+	struct page *pamt;
+	enum tdx_page_sz pgsz;
+	int nid;
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	nid = tdmr_get_nid(tdmr);
+	pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
+	pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
+			&node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/* Calculate PAMT base and size for all supported page sizes. */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
+		unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
+
+		pamt_base[pgsz] = tdmr_pamt_base;
+		pamt_sz[pgsz] = sz;
+
+		tdmr_pamt_base += sz;
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
+	tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
+	tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
+	tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
+
+	return 0;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_sz;
+
+	pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_sz)
+		return;
+
+	if (WARN_ON(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_free_pamt(tdmr_array[i]);
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_num; i++) {
+		ret = tdmr_setup_pamt(tdmr_array[i]);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	return -ENOMEM;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
+	ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_tdmrs;
+
 	/* Return -EFAULT until constructing TDMRs is done */
 	ret = -EFAULT;
+	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
+err_free_tdmrs:
 	free_tdmrs(tdmr_array, *tdmr_num);
 err:
 	return ret;
@@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+	/*
+	 * Free PAMTs allocated in construct_tdmrs() when TDX module
+	 * initialization fails.
+	 */
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
 out_free_tdmrs:
 	/*
 	 * TDMRs are only used during initializing TDX module.  Always
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (12 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

As the last step of constructing TDMRs, create reserved area information
for the memory region holes in each TDMR.  If any PAMT (or part of it)
resides within a particular TDMR, also mark it as reserved.

All reserved areas in each TDMR must be in address ascending order,
required by TDX architecture.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 148 +++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1b807dcbc101..bf0d13644898 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
 #include <linux/atomic.h>
 #include <linux/slab.h>
 #include <linux/math.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -1112,6 +1113,145 @@ static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
 	return -ENOMEM;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+			      u64 addr, u64 size)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	/* Cannot exceed maximum reserved areas supported by TDX */
+	if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+		return -E2BIG;
+
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  Caller should guarantee. */
+	WARN_ON(1);
+	return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_setup_rsvd_areas(struct tdmr_info *tdmr,
+				     struct tdmr_info **tdmr_array,
+				     int tdmr_num)
+{
+	u64 start, end, prev_end;
+	int rsvd_idx, i, ret = 0;
+
+	/* Mark holes between e820 RAM entries as reserved */
+	rsvd_idx = 0;
+	prev_end = TDMR_START(tdmr);
+	e820_for_each_mem(i, start, end) {
+		/* Break if this entry is after the TDMR */
+		if (start >= TDMR_END(tdmr))
+			break;
+
+		/* Exclude entries before this TDMR */
+		if (end < TDMR_START(tdmr))
+			continue;
+
+		/*
+		 * Skip if no hole exists before this entry. "<=" is
+		 * used because one e820 entry might span two TDMRs.
+		 * In that case the start address of this entry is
+		 * smaller then the start address of the second TDMR.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this e820 entry */
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				start - prev_end);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last RAM entry if it exists. */
+	if (prev_end < TDMR_END(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				TDMR_END(tdmr) - prev_end);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Walk over all TDMRs to find out whether any PAMT falls into
+	 * the given TDMR. If yes, mark it as reserved too.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tmp = tdmr_array[i];
+		u64 pamt_start, pamt_end;
+
+		pamt_start = tmp->pamt_4k_base;
+		pamt_end = pamt_start + tmp->pamt_4k_size +
+			tmp->pamt_2m_size + tmp->pamt_1g_size;
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= TDMR_START(tdmr)) ||
+				(pamt_start >= TDMR_END(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < TDMR_START(tdmr))
+			pamt_start = TDMR_START(tdmr);
+		if (pamt_end > TDMR_END(tdmr))
+			pamt_end = TDMR_END(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, pamt_start,
+				pamt_end - pamt_start);
+		if (ret)
+			return ret;
+	}
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+static int tdmrs_setup_rsvd_areas_all(struct tdmr_info **tdmr_array,
+				      int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = tdmr_setup_rsvd_areas(tdmr_array[i], tdmr_array,
+				tdmr_num);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -1128,8 +1268,12 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err_free_tdmrs;
 
-	/* Return -EFAULT until constructing TDMRs is done */
-	ret = -EFAULT;
+	ret = tdmrs_setup_rsvd_areas_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_pamts;
+
+	return 0;
+err_free_pamts:
 	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err_free_tdmrs:
 	free_tdmrs(tdmr_array, *tdmr_num);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (13 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to crypto protect TDX metadata.  The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bf0d13644898..ecd65f7014e2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -112,6 +112,9 @@ static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMEN
 static int tdx_cmr_num;
 static struct tdsysinfo_struct tdx_sysinfo;
 
+/* TDX global KeyID to protect TDX metadata */
+static u32 tdx_global_keyid;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -1320,6 +1323,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/*
+	 * Reserve the first TDX KeyID as global KeyID to protect
+	 * TDX module metadata.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (14 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

After the TDX usable memory regions are constructed in an array of TDMRs
and the global KeyID is reserved, configure them to the TDX module.  The
configuration is done via TDH.SYS.CONFIG, which is one call and can be
done on any logical cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ecd65f7014e2..2bf49d3d7cfe 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1284,6 +1284,42 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info **tdmr_array, int tdmr_num,
+			     u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	int i, array_sz;
+	int ret;
+
+	/*
+	 * TDMR_INFO entries are configured to the TDX module via an
+	 * array of the physical address of each TDMR_INFO.  TDX requires
+	 * the array itself must be 512 aligned.  Round up the array size
+	 * to 512 aligned so the buffer allocated by kzalloc() meets the
+	 * alignment requirement.
+	 */
+	array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_pa_array[i] = __pa(tdmr_array[i]);
+
+	/*
+	 * TDH.SYS.CONFIG fails when TDH.SYS.LP.INIT is not done on all
+	 * BIOS-enabled cpus.  tdx_init() only disables CPU hotplug but
+	 * doesn't do early check whether all BIOS-enabled cpus are
+	 * online, so TDH.SYS.CONFIG can fail here.
+	 */
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
+				global_keyid, 0, NULL, NULL);
+	/* Free the array as it is not required any more. */
+	kfree(tdmr_pa_array);
+
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1329,11 +1365,17 @@ static int init_tdx_module(void)
 	 */
 	tdx_global_keyid = tdx_keyid_start;
 
+	/* Config the TDX module with TDMRs and global KeyID */
+	ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+out_free_pamts:
 	/*
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
 	 * initialization fails.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 05bf9fe6bd00..d8e2800397af 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -95,6 +95,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
@@ -125,6 +126,7 @@ struct tdmr_info {
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
+#define TDH_SYS_CONFIG		45
 
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (15 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs Kai Huang
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Before TDX module can use the global KeyID to access TDX metadata, the
key of the global KeyID must be configured on all physical packages via
TDH.SYS.KEY.CONFIG.  This SEAMCALL cannot run concurrently on different
cpus since it exclusively acquires the TDX module.

Implement a helper to run SEAMCALL on one (any) cpu for all packages in
serialized way, and run TDH.SYS.KEY.CONFIG on all packages using the
helper.

The TDX module uses the global KeyID to initialize its metadata (PAMTs).
Before TDX module can do that, all cachelines of PAMTs must be flushed.
Otherwise, they may silently corrupt the PAMTs later initialized by the
TDX module.

Use wbinvd to flush cache as PAMTs can be potentially large (~1/256th of
system RAM).

Flush cache before configuring the global KeyID on all packages, as
suggested by TDX specification.  In practice, the current generation of
TDX doesn't use the global KeyID in TDH.SYS.KEY.CONFIG.  Therefore in
practice flushing cache can be done after configuring the global KeyID
is done on all packages.  But the future generation of TDX may change
this behaviour, so just follow TDX specification's suggestion to flush
cache before configuring the global KeyID on all packages.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 96 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2bf49d3d7cfe..bb15122fb8bd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -23,6 +23,7 @@
 #include <asm/virtext.h>
 #include <asm/e820/api.h>
 #include <asm/pgtable.h>
+#include <asm/smp.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -398,6 +399,46 @@ static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	return atomic_read(&sc->err);
 }
 
+/*
+ * Call the SEAMCALL on one (any) cpu for each physical package in
+ * serialized way.  Return immediately in case of any error while
+ * calling SEAMCALL on any cpu.
+ *
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
+ * to be atomic, but for simplicity just reuse it instead of adding
+ * a new one.
+ */
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
+{
+	cpumask_var_t packages;
+	int cpu, ret = 0;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		ret = smp_call_function_single(cpu, seamcall_smp_call_function,
+				sc, true);
+		if (ret)
+			break;
+
+		/*
+		 * Doesn't have to use atomic_read(), but it doesn't
+		 * hurt either.
+		 */
+		ret = atomic_read(&sc->err);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static inline bool p_seamldr_ready(void)
 {
 	return !!p_seamldr_info.p_seamldr_ready;
@@ -1320,6 +1361,21 @@ static int config_tdx_module(struct tdmr_info **tdmr_array, int tdmr_num,
 	return ret;
 }
 
+static int config_global_keyid(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
+
+	/*
+	 * Configure the key of the global KeyID on all packages by
+	 * calling TDH.SYS.KEY.CONFIG on all packages.
+	 *
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
+	 * a recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 */
+	return seamcall_on_each_package_serialized(&sc);
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1370,6 +1426,37 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * The same physical address associated with different KeyIDs
+	 * has separate cachelines.  Before using the new KeyID to access
+	 * some memory, the cachelines associated with the old KeyID must
+	 * be flushed, otherwise they may later silently corrupt the data
+	 * written with the new KeyID.  After cachelines associated with
+	 * the old KeyID are flushed, CPU speculative fetch using the old
+	 * KeyID is OK since the prefetched cachelines won't be consumed
+	 * by CPU core.
+	 *
+	 * TDX module initializes PAMTs using the global KeyID to crypto
+	 * protect them from malicious host.  Before that, the PAMTs are
+	 * used by kernel (with KeyID 0) and the cachelines associated
+	 * with the PAMTs must be flushed.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus to
+	 * flush the cache.
+	 *
+	 * In practice, the current generation of TDX doesn't use the
+	 * global KeyID in TDH.SYS.KEY.CONFIG.  Therefore in practice,
+	 * the cachelines can be flushed after configuring the global
+	 * KeyID on all pkgs is done.  But the future generation of TDX
+	 * may change this, so just follow the suggestion of TDX spec to
+	 * flush cache before TDH.SYS.KEY.CONFIG.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
@@ -1380,8 +1467,15 @@ static int init_tdx_module(void)
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
 	 * initialization fails.
 	 */
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMTs may already have been initialized by
+		 * TDX module.  Flush cache before returning them back
+		 * to kernel.
+		 */
+		wbinvd_on_all_cpus();
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	}
 out_free_tdmrs:
 	/*
 	 * TDMRs are only used during initializing TDX module.  Always
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d8e2800397af..bba8cabea4bb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -122,6 +122,7 @@ struct tdmr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (16 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 19/21] x86: Flush cache of TDX private memory during kexec() Kai Huang
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
the TDX memory can be used to run any TD guest.  The SEAMCALL internally
uses the global KeyID to initialize PAMTs in order to crypto protect
them from malicious host kernel.  TDH.SYS.TDMR.INIT can be done any cpu.

The time of initializing TDMR is proportional to the size of the TDMR.
To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
initializes an (implementation-specific) subset of PAMT entries of one
TDMR in one invocation.  The caller is responsible for calling
TDH.SYS.TDMR.INIT iteratively until all PAMT entries of the requested
TDMR are initialized.

Current implementation initializes TDMRs one by one.  It takes ~100ms on
a 2-socket machine with 2.2GHz CPUs and 64GB memory when the system is
idle.  Each TDH.SYS.TDMR.INIT takes ~7us on average.

TDX does allow different TDMRs to be initialized concurrently on
multiple CPUs. This parallel scheme could be introduced later when the
total initialization time becomes a real concern, e.g. on a platform
with a much bigger memory size.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 75 ++++++++++++++++++++++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bb15122fb8bd..11bd1daffee3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1376,6 +1376,65 @@ static int config_global_keyid(void)
 	return seamcall_on_each_package_serialized(&sc);
 }
 
+/* Initialize one TDMR */
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing PAMT entries might be time-consuming (in
+	 * proportion to the size of the requested TDMR).  To avoid long
+	 * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
+	 * an (implementation-defined) subset of PAMT entries in one
+	 * invocation.
+	 *
+	 * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
+	 * of the requested TDMR are initialized (if next-to-initialize
+	 * address matches the end address of the TDMR).
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0,
+				NULL, &out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INT succeeded.
+		 */
+		next = out.rdx;
+		if (need_resched())
+			cond_resched();
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+/* Initialize all TDMRs */
+static int init_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	/*
+	 * Initialize TDMRs one-by-one for simplicity, though the TDX
+	 * architecture does allow different TDMRs to be initialized in
+	 * parallel on multiple CPUs.  Parallel initialization could
+	 * be added later when the time spent in the serialized scheme
+	 * becomes a real concern.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_array[i]);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1457,11 +1516,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * Return -EFAULT until all steps of TDX module
-	 * initialization are done.
-	 */
-	ret = -EFAULT;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto out_free_pamts;
+
+	tdx_module_status = TDX_MODULE_INITIALIZED;
 out_free_pamts:
 	/*
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
@@ -1484,6 +1544,11 @@ static int init_tdx_module(void)
 	free_tdmrs(tdmr_array, tdmr_num);
 	kfree(tdmr_array);
 out:
+	if (ret)
+		pr_info("Failed to initialize TDX module.\n");
+	else
+		pr_info("TDX module initialized.\n");
+
 	return ret;
 }
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index bba8cabea4bb..212f83374c0a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -126,6 +126,7 @@ struct tdmr_info {
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_LP_SHUTDOWN	44
 #define TDH_SYS_CONFIG		45
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 19/21] x86: Flush cache of TDX private memory during kexec()
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (17 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-06  4:49 ` [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support Kai Huang
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

If TDX is ever enabled and/or used to run any TD guests, the cachelines
of TDX private memory, including PAMTs, used by TDX module need to be
flushed before transiting to the new kernel otherwise they may silently
corrupt the new kernel.

TDX module can only be initialized once during its lifetime.  TDX does
not have interface to reset TDX module to an uninitialized state so it
could be initialized again.  If the old kernel has enabled TDX, the new
kernel won't be able to use TDX again.  Therefore, ideally the old
kernel should shut down the TDX module if it is ever initialized so that
no SEAMCALLs can be made to it again.

However, shutting down the TDX module requires calling SEAMCALL, which
requires cpu being in VMX operation (VMXON has been done).  Currently,
only KVM does entering/leaving VMX operation, so there's no guarantee
that all cpus are in VMX operation during kexec().  Therefore, this
implementation doesn't shut down the TDX module, but only does cache
flush and leaves the TDX module open.

And it's fine to leave the module open.  If the new kernel wants to use
TDX, it needs to go through the initialization process, and it will fail
at the first SEAMCALL due to the TDX module is not in the uninitialized
state.  If the new kernel doesn't want to use TDX, then the TDX module
won't run at all.

Following the implementation of SME support, use wbinvd() to flush cache
in stop_this_cpu().  Introduce a new function platform_has_tdx() to only
check whether the platform is TDX-capable and do wbinvd() when it is
true.  platform_has_tdx() returns true when SEAMRR is enabled and there
are enough TDX private KeyIDs to run at least one TD guest (both of
which are detected at boot time).  TDX is enabled on demand at runtime
and it has a state machine with mutex to protect multiple callers to
initialize TDX in parallel.  Getting TDX module state needs to hold the
mutex but stop_this_cpu() runs in interrupt context, so just check
whether platform supports TDX and flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kernel/process.c   | 15 ++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c8af2ba6bb8a..513b9ce9a870 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,10 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
 int tdx_detect(void);
 int tdx_init(void);
+bool platform_has_tdx(void);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
 static inline int tdx_detect(void) { return -ENODEV; }
 static inline int tdx_init(void) { return -ENODEV; }
+static inline bool platform_has_tdx(void) { return false; }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dbaf12c43fe1..0238bd29af8a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -769,8 +769,21 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * In case of kexec, similar to SME, if TDX is ever enabled, the
+	 * cachelines of TDX private memory (including PAMTs) used by TDX
+	 * module need to be flushed before transiting to the new kernel,
+	 * otherwise they may silently corrupt the new kernel.
+	 *
+	 * Note TDX is enabled on demand at runtime, and enabling TDX has a
+	 * state machine protected with a mutex to prevent concurrent calls
+	 * from multiple callers.  Holding the mutex is required to get the
+	 * TDX enabling status, but this function runs in interrupt context.
+	 * So to make it simple, always flush cache when platform supports
+	 * TDX (detected at boot time), regardless whether TDX is truly
+	 * enabled by kernel.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if ((cpuid_eax(0x8000001f) & BIT(0)) || platform_has_tdx())
 		native_wbinvd();
 	for (;;) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 11bd1daffee3..031af7b83cea 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1687,3 +1687,17 @@ int tdx_init(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdx_init);
+
+/**
+ * platform_has_tdx - Whether platform supports TDX
+ *
+ * Check whether platform supports TDX (i.e. TDX is enabled in BIOS),
+ * regardless whether TDX is truly enabled by kernel.
+ *
+ * Return true if SEAMRR is enabled, and there are sufficient TDX private
+ * KeyIDs to run TD guests.
+ */
+bool platform_has_tdx(void)
+{
+	return seamrr_enabled() && tdx_keyid_sufficient();
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (18 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 19/21] x86: Flush cache of TDX private memory during kexec() Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-28 17:25   ` Dave Hansen
  2022-04-06  4:49 ` [PATCH v3 21/21] Documentation/x86: Add documentation for " Kai Huang
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Enabling TDX consumes additional memory (used by TDX as metadata) and
additional initialization time.  Introduce a kernel command line to
allow to opt-in TDX host kernel support when user truly wants to use
TDX.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 arch/x86/virt/vmx/tdx/tdx.c                     | 14 ++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..cfa5b36890ea 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5790,6 +5790,12 @@
 
 	tdfx=		[HW,DRM]
 
+	tdx_host=	[X86-64, TDX]
+			Format: {on|off}
+			on: Enable TDX host kernel support
+			off: Disable TDX host kernel support
+			Default is off.
+
 	test_suspend=	[SUSPEND][,N]
 			Specify "mem" (for Suspend-to-RAM) or "standby" (for
 			standby suspend) or "freeze" (for suspend type freeze)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 031af7b83cea..fee243cd454f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -116,6 +116,16 @@ static struct tdsysinfo_struct tdx_sysinfo;
 /* TDX global KeyID to protect TDX metadata */
 static u32 tdx_global_keyid;
 
+static bool enable_tdx_host;
+
+static int __init tdx_host_setup(char *s)
+{
+	if (!strcmp(s, "on"))
+		enable_tdx_host = true;
+	return 1;
+}
+__setup("tdx_host=", tdx_host_setup);
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -500,6 +510,10 @@ static int detect_p_seamldr(void)
 
 static int __tdx_detect(void)
 {
+	/* Disabled by kernel command line */
+	if (!enable_tdx_host)
+		goto no_tdx_module;
+
 	/* The TDX module is not loaded if SEAMRR is disabled */
 	if (!seamrr_enabled()) {
 		pr_info("SEAMRR not enabled.\n");
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v3 21/21] Documentation/x86: Add documentation for TDX host support
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (19 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support Kai Huang
@ 2022-04-06  4:49 ` Kai Huang
  2022-04-14 10:19 ` [PATCH v3 00/21] TDX host kernel support Kai Huang
  2022-04-26 20:13 ` Dave Hansen
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-06  4:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata,
	kai.huang

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Internals" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 326 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 313 insertions(+), 13 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index 8ca60256511b..d52ba7cf982d 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -7,8 +7,308 @@ Intel Trust Domain Extensions (TDX)
 Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
 the host and physical attacks by isolating the guest register state and by
 encrypting the guest memory. In TDX, a special TDX module sits between the
-host and the guest, and runs in a special mode and manages the guest/host
-separation.
+host and the guest, and runs in a special Secure Arbitration Mode (SEAM)
+and manages the guest/host separation.
+
+TDX Host Kernel Support
+=======================
+
+SEAM is an extension to the VMX architecture to define a new VMX root
+operation called 'SEAM VMX root' and a new VMX non-root operation called
+'VMX non-root'. Collectively, the SEAM VMX root and SEAM VMX non-root
+execution modes are called operation in SEAM.
+
+SEAM VMX root operation is designed to host a CPU-attested, software
+module called 'Intel TDX module' to manage virtual machine (VM) guests
+called Trust Domains (TD). The TDX module implements the functions to
+build, tear down, and start execution of TD VMs. SEAM VMX root is also
+designed to additionally host a CPU-attested, software module called the
+'Intel Persistent SEAMLDR (Intel P-SEAMLDR)' module to load and update
+the Intel TDX module.
+
+The software in SEAM VMX root runs in the memory region defined by the
+SEAM range register (SEAMRR). Access to this range is restricted to SEAM
+VMX root operation. Code fetches outside of SEAMRR when in SEAM VMX root
+operation are meant to be disallowed and lead to an unbreakable shutdown.
+
+TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
+protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
+KeyIDs, which can only be used by software runs in SEAM. The physical
+address bits reserved for encoding TDX private KeyID are treated as
+reserved bits when not in SEAM operation. The partitioning of MKTME
+KeyIDs and TDX private KeyIDs is configured by BIOS.
+
+Host kernel transits to either the P-SEAMLDR or the TDX module via the
+new SEAMCALL instruction. SEAMCALL leaf functions are host-side interface
+functions defined by the P-SEAMLDR and the TDX module around the new
+SEAMCALL instruction. They are similar to a hypercall, except they are
+made by host kernel to the SEAM software modules.
+
+Before being able to manage TD guests, the TDX module must be loaded
+into SEAMRR and properly initialized using SEAMCALLs defined by TDX
+architecture. The current implementation assumes both P-SEAMLDR and
+TDX module are loaded by BIOS before the kernel boots.
+
+Detection and Initialization
+----------------------------
+
+The presence of SEAMRR is reported via a new SEAMRR bit (15) of the
+IA32_MTRRCAP MSR. The SEAMRR range registers consist of a pair of MSRs:
+IA32_SEAMRR_PHYS_BASE (0x1400) and IA32_SEAMRR_PHYS_MASK (0x1401).
+SEAMRR is enabled when bit 3 of IA32_SEAMRR_PHYS_BASE is set and
+bit 10/11 of IA32_SEAMRR_PHYS_MASK are set.
+
+However, there is no CPUID or MSR for querying the presence of the TDX
+module or the P-SEAMLDR. SEAMCALL fails with VMfailInvalid when SEAM
+software is not loaded, so SEAMCALL can be used to detect P-SEAMLDR and
+TDX module. SEAMLDR.INFO SEAMCALL is used to detect both P-SEAMLDR and
+TDX module.  Success of the SEAMCALL means P-SEAMLDR is loaded, and the
+P-SEAMLDR information returned by the SEAMCALL further tells whether TDX
+module is loaded or not.
+
+User can check whether the TDX module is initialized via dmesg:
+
+|  [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209, build_num 160, major 1, minor 0
+|  [..] tdx: TDX module detected.
+|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+|  [..] tdx: TDX module initialized.
+
+Initializing TDX takes time (in seconds) and additional memory space (for
+metadata). Both are affected by the size of total usable memory which the
+TDX module is configured with. In particular, the TDX metadata consumes
+~1/256 of TDX usable memory. This leads to a non-negligible burden as the
+current implementation simply treats all E820 RAM ranges as TDX usable
+memory (all system RAM meets the security requirements on the first
+generation of TDX-capable platforms).
+
+Therefore, kernel uses lazy TDX initialization to avoid such burden for
+all users on a TDX-capable platform. The software component (e.g. KVM)
+which wants to use TDX is expected to call two helpers below to detect
+and initialize the TDX module until TDX is truly needed:
+
+        if (tdx_detect())
+                goto no_tdx;
+        if (tdx_init())
+                goto no_tdx;
+
+TDX detection and initialization are done via SEAMCALLs which require the
+CPU in VMX operation. The caller of the above two helpers should ensure
+that condition.
+
+Currently, only KVM is the only user of TDX and KVM already handles
+entering/leaving VMX operation. Letting KVM initialize TDX on demand
+avoids handling entering/leaving VMX operation, which isn't trivial, in
+core-kernel.
+
+In addition, a new kernel parameter 'tdx_host={on/off}' can be used to
+force disabling the TDX capability by the admin.
+
+TDX initialization includes a step where certain SEAMCALL must be called
+on every BIOS-enabled CPU (with a ACPI MADT entry marked as enabled).  As
+a result, CPU hotplug is temporarily disabled during initializing the TDX
+module.  Also, user should avoid using kernel command lines which impact
+kernel usable cpus and/or online cpus (such as 'maxcpus', 'nr_cpus' and
+'possible_cpus'), or offlining CPUs before initializing TDX. Doing so
+will lead to the mismatch between online CPUs and BIOS-enabled CPUs,
+resulting TDX module initialization failure.
+
+TDX Memory Management
+---------------------
+
+TDX architecture manages TDX memory via below data structures:
+
+- Convertible Memory Regions (CMRs)
+
+TDX provides increased levels of memory confidentiality and integrity.
+This requires special hardware support for features like memory
+encryption and storage of memory integrity checksums. A CMR represents a
+memory range that meets those requirements and can be used as TDX memory.
+The list of CMRs can be queried from TDX module.
+
+- TD Memory Regions (TDMRs)
+
+The TDX module manages TDX usable memory via TD Memory Regions (TDMR).
+Each TDMR has information of its base and size, its metadata (PAMT)'s
+base and size, and an array of reserved areas to hold the memory region
+address holes and PAMTs. TDMR must be 1G aligned and in 1G granularity.
+
+Host kernel is responsible for choosing which convertible memory regions
+(reside in CMRs) to use as TDX memory, and constructing a list of TDMRs
+to cover all those memory regions, and configure the TDMRs to TDX module.
+
+- Physical Address Metadata Tables (PAMTs)
+
+This metadata essentially serves as the 'struct page' for the TDX module,
+recording things like which TD guest 'owns' a given page of memory. Each
+TDMR has a dedicated PAMT.
+
+PAMT is not reserved by the hardware upfront and must be allocated by the
+kernel and given to the TDX module. PAMT for a given TDMR doesn't have
+to be within that TDMR, but a PAMT must be within one CMR.  Additionally,
+if a PAMT overlaps with a TDMR, the overlapping part must be marked as
+reserved in that particular TDMR.
+
+Kernel Policy of TDX Memory
+---------------------------
+
+The first generation of TDX essentially guarantees that all system RAM
+memory regions (excluding the memory below 1MB) are covered by CMRs.
+Currently, to avoid having to modify the page allocator to support both
+TDX and non-TDX allocation, the kernel choose to use all system RAM as
+TDX memory. A list of TDMRs are constructed based on all RAM entries in
+e820 table and configured to the TDX module.
+
+Limitations
+-----------
+
+Constructing TDMRs
+~~~~~~~~~~~~~~~~~~
+
+Currently, the kernel tries to create one TDMR for each RAM entry in
+e820. 'e820_table' is used to find all RAM entries to honor 'mem' and
+'memmap' kernel command line. However, 'memmap' command line may also
+result in many discrete RAM entries. TDX architecturally only supports a
+limited number of TDMRs (currently 64). In this case, constructing TDMRs
+may fail due to exceeding the maximum number of TDMRs. The user is
+responsible for not doing so otherwise TDX may not be available. This
+can be further enhanced by supporting merging adjacent TDMRs.
+
+PAMT allocation
+~~~~~~~~~~~~~~~
+
+Currently, the kernel allocates PAMT for each TDMR separately using
+alloc_contig_pages(). alloc_contig_pages() only guarantees the PAMT is
+allocated from a given NUMA node, but doesn't have control over
+allocating PAMT from a given TDMR range. This may result in all PAMTs
+on one NUMA node being within one single TDMR. PAMTs overlapping with
+a given TDMR must be put into the TDMR's reserved areas too. However TDX
+only supports a limited number of reserved areas per TDMR (currently 16),
+thus too many PAMTs in one NUMA node may result in constructing TDMR
+failure due to exceeding TDMR's maximum reserved areas.
+
+The user is responsible for not creating too many discrete RAM entries
+on one NUMA node, which may result in having too many TDMRs on one node,
+which eventually results in constructing TDMR failure due to exceeding
+the maximum reserved areas. This can be further enhanced to support
+per-NUMA-node PAMT allocation, which could reduce the number of PAMT to
+1 for each node.
+
+TDMR initialization
+~~~~~~~~~~~~~~~~~~~
+
+Currently, the kernel initialize TDMRs one by one. This may take couple
+of seconds to finish on large memory systems (TBs). This can be further
+enhanced by allowing initializing different TDMRs in parallel on multiple
+cpus.
+
+CPU hotplug
+~~~~~~~~~~~
+
+The first generation of TDX architecturally doesn't support ACPI CPU
+hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
+first generation of TDX-capable platforms don't support ACPI CPU hotplug
+either. Since this physically cannot happen, currently kernel doesn't
+have any check in ACPI CPU hotplug code path to disable it.
+
+Also, only TDX module initialization requires all BIOS-enabled cpus are
+online. After the initialization, any logical cpu can be brought down
+and brought up to online again later. Therefore this series doesn't
+change logical CPU hotplug either.
+
+This can be enhanced when any future generation of TDX starts to support
+ACPI cpu hotplug.
+
+Memory hotplug
+~~~~~~~~~~~~~~
+
+The first generation of TDX architecturally doesn't support memory
+hotplug. The CMRs are generated by BIOS during boot and it is fixed
+during machine's runtime.
+
+However, the first generation of TDX-capable platforms don't support ACPI
+memory hotplug. Since it physically cannot happen, currently kernel
+doesn't have any check in ACPI memory hotplug code path to disable it.
+
+A special case of memory hotplug is adding NVDIMM as system RAM using
+kmem driver. However the first generation of TDX-capable platforms
+cannot turn on TDX and NVDIMM simultaneously, so in practice this cannot
+happen either.
+
+Another case is admin can use 'memmap' kernel command line to create
+legacy PMEMs and use them as TD guest memory, or theoretically, can use
+kmem driver to add them as system RAM. Current implementation always
+includes legacy PMEMs when constructing TDMRs so they are also TDX memory.
+So legacy PMEMs can either be used as TD guest memory directly or can be
+converted to system RAM via kmem driver.
+
+This can be enhanced when future generation of TDX starts to support ACPI
+memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
+same platform.
+
+Kexec interaction
+~~~~~~~~~~~~~~~~~
+
+The TDX module can be initialized only once during its lifetime. The
+first generation of TDX doesn't have interface to reset TDX module to
+uninitialized state so it can be initialized again.
+
+This implies:
+
+  - If the old kernel fails to initialize TDX, the new kernel cannot
+    use TDX too unless the new kernel fixes the bug which leads to
+    initialization failure in the old kernel and can resume from where
+    the old kernel stops. This requires certain coordination between
+    the two kernels.
+
+  - If the old kernel has initialized TDX successfully, the new kernel
+    may be able to use TDX if the two kernels have exactly the same
+    configurations on the TDX module. It further requires the new kernel
+    to reserve the TDX metadata pages (allocated by the old kernel) in
+    its page allocator. It also requires coordination between the two
+    kernels. Furthermore, if kexec() is done when there are active TD
+    guests running, the new kernel cannot use TDX because it's extremely
+    hard for the old kernel to pass all TDX private pages to the new
+    kernel.
+
+Given that, the current implementation doesn't support TDX after kexec()
+(except the old kernel hasn't initialized TDX at all).
+
+The current implementation doesn't shut down TDX module but leaves it
+open during kexec().  This is because shutting down TDX module requires
+CPU being in VMX operation but there's no guarantee of this during
+kexec(). Leaving the TDX module open is not the best case, but it is OK
+since the new kernel won't be able to use TDX anyway (therefore TDX
+module won't run at all).
+
+This can be further enhanced when core-kernele (non-KVM) can handle
+VMXON.
+
+If TDX is ever enabled and/or used to run any TD guests, the cachelines
+of TDX private memory, including PAMTs, used by TDX module need to be
+flushed before transiting to the new kernel otherwise they may silently
+corrupt the new kernel. Similar to SME, the current implementation
+flushes cache in stop_this_cpu().
+
+Initialization errors
+~~~~~~~~~~~~~~~~~~~~~
+
+Currently, any error happened during TDX initialization moves the TDX
+module to the SHUTDOWN state. No SEAMCALL is allowed in this state, and
+the TDX module cannot be re-initialized without a hard reset.
+
+This can be further enhanced to treat some errors as recoverable errors
+and let the caller retry later. A more detailed state machine can be
+added to record the internal state of TDX module, and the initialization
+can resume from that state in the next try.
+
+Specifically, there are three cases that can be treated as recoverable
+error: 1) -ENOMEM (i.e. due to PAMT allocation); 2) TDH.SYS.CONFIG error
+due to TDH.SYS.LP.INIT is not called on all cpus (i.e. due to offline
+cpus); 3) -EPERM when the caller doesn't guarantee all cpus are in VMX
+operation.
+
+TDX Guest Internals
+===================
 
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
@@ -20,7 +320,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +330,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +341,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +352,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +373,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -91,7 +391,7 @@ how to handle. The guest kernel may ask the hypervisor for the value with
 a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -104,7 +404,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -124,7 +424,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Accesses to private mappings can also cause #VEs.  Since all kernel memory
 is also private memory, the kernel might theoretically need to handle a
@@ -142,7 +442,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, unhandled userspace #VE's result in a SIGSEGV.
@@ -163,7 +463,7 @@ While the block is in place, #VE's are elevated to double faults (#DF)
 which are not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to
 a mapping which will cause a VMEXIT on access, and then the hypervisor emulates
@@ -185,7 +485,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However some kernel users like device
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (20 preceding siblings ...)
  2022-04-06  4:49 ` [PATCH v3 21/21] Documentation/x86: Add documentation for " Kai Huang
@ 2022-04-14 10:19 ` Kai Huang
  2022-04-26 20:13 ` Dave Hansen
  22 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-14 10:19 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-06 at 16:49 +1200, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  This series provides support for
> initializing the TDX module in the host kernel.  KVM support for TDX is
> being developed separately[1].
> 
> The code has been tested on couple of TDX-capable machines.  I would
> consider it as ready for review. I highly appreciate if anyone can help
> to review this series (from high level design to detail implementations).
> For Intel reviewers (CC'ed), please help to review, and I would
> appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Hi Intel reviewers,

Kindly ping.  Could you help to review?

--
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
@ 2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
  2022-04-18 22:50     ` Sean Christopherson
  2022-04-19  3:38     ` Kai Huang
  2022-04-26 20:21   ` Dave Hansen
  1 sibling, 2 replies; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-18 22:29 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> +/* BIOS must configure SEAMRR registers for all cores consistently */
> +static u64 seamrr_base, seamrr_mask;
> +
> +static bool __seamrr_enabled(void)
> +{
> +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> +}
> +
> +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> +{
> +	u64 mtrrcap, base, mask;
> +
> +	/* SEAMRR is reported via MTRRcap */
> +	if (!boot_cpu_has(X86_FEATURE_MTRR))
> +		return;
> +
> +	rdmsrl(MSR_MTRRcap, mtrrcap);
> +	if (!(mtrrcap & MTRR_CAP_SEAMRR))
> +		return;
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> +	if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> +		pr_info("SEAMRR base is not configured by BIOS\n");
> +		return;
> +	}
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> +		pr_info("SEAMRR is not enabled by BIOS\n");
> +		return;
> +	}
> +
> +	seamrr_base = base;
> +	seamrr_mask = mask;
> +}
> +
> +static void detect_seam_ap(struct cpuinfo_x86 *c)
> +{
> +	u64 base, mask;
> +
> +	/*
> +	 * Don't bother to detect this AP if SEAMRR is not
> +	 * enabled after earlier detections.
> +	 */
> +	if (!__seamrr_enabled())
> +		return;
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +
> +	if (base == seamrr_base && mask == seamrr_mask)
> +		return;
> +
> +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");

Do we need to panic for SEAM config issue (for security)?

> +	/* Mark SEAMRR as disabled. */
> +	seamrr_base = 0;
> +	seamrr_mask = 0
> +}
> +
> +static void detect_seam(struct cpuinfo_x86 *c)
> +{

why not do this check directly in tdx_detect_cpu()?

> +	if (c == &boot_cpu_data)
> +		detect_seam_bsp(c);
> +	else
> +		detect_seam_ap(c);
> +}
> +
> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> +{
> +	detect_seam(c);
> +}

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
@ 2022-04-18 22:50     ` Sean Christopherson
  2022-04-19  3:38     ` Kai Huang
  1 sibling, 0 replies; 156+ messages in thread
From: Sean Christopherson @ 2022-04-18 22:50 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Kai Huang, linux-kernel, kvm, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, isaku.yamahata

On Mon, Apr 18, 2022, Sathyanarayanan Kuppuswamy wrote:
> > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > +{
> > +	u64 base, mask;
> > +
> > +	/*
> > +	 * Don't bother to detect this AP if SEAMRR is not
> > +	 * enabled after earlier detections.
> > +	 */
> > +	if (!__seamrr_enabled())
> > +		return;
> > +
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > +
> > +	if (base == seamrr_base && mask == seamrr_mask)
> > +		return;
> > +
> > +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> 
> Do we need to panic for SEAM config issue (for security)?

No, clearing seamrr_mask will effectively prevent the kernel from attempting to
use TDX or any other feature that might depend on SEAM.  Panicking because the
user's BIOS is crappy would be to kicking them while they're down. 

As for security, it's the TDX Module's responsibility to validate the security
properties of the system, the kernel only cares about not dying/crashing.

> > +	/* Mark SEAMRR as disabled. */
> > +	seamrr_base = 0;
> > +	seamrr_mask = 0
> > +}

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
  2022-04-18 22:50     ` Sean Christopherson
@ 2022-04-19  3:38     ` Kai Huang
  1 sibling, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-19  3:38 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata


> > +
> > +static void detect_seam(struct cpuinfo_x86 *c)
> > +{
> 
> why not do this check directly in tdx_detect_cpu()?

The second patch will detect TDX KeyID too.  I suppose you are saying below is
better?

void tdx_detect_cpu(struct cpuinfo_x86 *c)
{
	if (c == &boot_cpu_data) {
		detect_seam_bsp(c);
		detect_tdx_keyids_bsp(c);
	} else {
		detect_seam_ap(c);
		detect_tdx_keyids_ap(c);
	}
}

I personally don't see how above is better than the current way.  Instead, I
think having SEAM and TDX KeyID detection code in single function respectively
is more flexible for future extension (if needed).


> 
> > +	if (c == &boot_cpu_data)
> > +		detect_seam_bsp(c);
> > +	else
> > +		detect_seam_ap(c);
> > +}
> > +
> > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > +{
> > +	detect_seam(c);
> > +}
> 


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
  2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
@ 2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
  2022-04-19  9:41     ` Kai Huang
  2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-19  5:39 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> + * Intel Trusted Domain CPU Architecture Extension spec:

In TDX guest code, we have been using TDX as "Intel Trust Domain
Extensions". It also aligns with spec. Maybe you should change
your patch set to use the same.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
  2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
  2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
@ 2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
  2022-04-19 10:07     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-19  5:42 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
>   	detect_seam(c);
> +	detect_tdx_keyids(c);

Do you want to add some return value to detect_seam() and not
proceed if it fails?

In case if this function is going to be extended by future
patch set, maybe do the same for detect_tdx_keyids()?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
  2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
@ 2022-04-19  9:41     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-19  9:41 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Mon, 2022-04-18 at 22:39 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > + * Intel Trusted Domain CPU Architecture Extension spec:
> 
> In TDX guest code, we have been using TDX as "Intel Trust Domain
> Extensions". It also aligns with spec. Maybe you should change
> your patch set to use the same.
> 

Yeah will change to use "Intel Trust Domain ...".  Thanks. 

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
  2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
@ 2022-04-19 10:07     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-19 10:07 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Mon, 2022-04-18 at 22:42 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> >   	detect_seam(c);
> > +	detect_tdx_keyids(c);
> 
> Do you want to add some return value to detect_seam() and not
> proceed if it fails?

I don't think this function should return.  However it may make sense to stop
detecting TDX KeyIDs when on some cpu SEAMRR is detected as not enabled (i.e. on
BSP when SEAMRR is not enabled by BIOS, or on any AP when there's BIOS bug that
BIOS doesn't configure SEAMRR consistently on all cpus).  The reason is TDX
KeyIDs can only be accessed by software runs in SEAM mode.  So if SEAMRR
configuration is broken, TDX KeyID configuration probably is broken too.

However detect_tdx_keyids() essentially only uses rdmsr_safe() to read some MSR,
so if there's any problem, rdmsr_safe() will catch it.  And SEAMRR is always
checked before doing any TDX related staff later, therefore in practice there
will be no problem.  But anyway I guess there's no harm to add additional SEAMRR
check in detect_tdx_keyids().  I'll think more on this.  Thanks.

> 
> In case if this function is going to be extended by future
> patch set, maybe do the same for detect_tdx_keyids()?
> 

I'd prefer to leaving this in current way until there's a real need.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
@ 2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
  2022-04-20  4:16     ` Kai Huang
  2022-04-26 20:37   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-19 14:07 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> ABI.  Instead, they share the same ABI with the TDCALL leaf functions.

TDCALL is a new term for this patch set. Maybe add some detail about
it in ()?.

> %rax is used to carry both the SEAMCALL leaf function number (input) and
> the completion status code (output).  Additional GPRs (%rcx, %rdx,
> %r8->%r11) may be further used as both input and output operands in
> individual leaf functions.



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
@ 2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
  2022-04-20  4:37     ` Kai Huang
  2022-04-26 20:53   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-19 14:53 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> The TDX module is essentially a CPU-attested software module running
> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> host and certain physical attacks.  The TDX module implements the

/s/host/hosts

> functions to build, tear down and start execution of the protected VMs
> called Trusted Domains (TD).  Before the TDX module can be used to

/s/Trusted/Trust

> create and run TD guests, it must be loaded into the SEAM Range Register
> (SEAMRR) and properly initialized.  The TDX module is expected to be
> loaded by BIOS before booting to the kernel, and the kernel is expected
> to detect and initialize it, using the SEAMCALLs defined by TDX
> architecture.
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> on-demand approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This avoids consuming the memory that must be
> allocated by kernel and given to the TDX module as metadata (~1/256th of

allocated by the kernel

> the TDX-usable memory), and also saves the time of initializing the TDX
> module (and the metadata) when TDX is not used at all.  Initializing the
> TDX module at runtime on-demand also is more flexible to support TDX
> module runtime updating in the future (after updating the TDX module, it
> needs to be initialized again).
> 
> Introduce two placeholders tdx_detect() and tdx_init() to detect and
> initialize the TDX module on demand, with a state machine introduced to
> orchestrate the entire process (in case of multiple callers).
> 
> To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
> TDX module is reported as not loaded if either SEAMRR is not enabled, or
> there are no enough TDX private KeyIDs to create any TD guest.  The TDX
> module itself requires one global TDX private KeyID to crypto protect
> its metadata.
> 
> And tdx_init() is currently empty.  The TDX module will be initialized
> in multi-steps defined by the TDX architecture:
> 
>    1) Global initialization;
>    2) Logical-CPU scope initialization;
>    3) Enumerate the TDX module capabilities and platform configuration;
>    4) Configure the TDX module about usable memory ranges and global
>       KeyID information;
>    5) Package-scope configuration for the global KeyID;
>    6) Initialize usable memory ranges based on 4).
> 
> The TDX module can also be shut down at any time during its lifetime.
> In case of any error during the initialization process, shut down the
> module.  It's pointless to leave the module in any intermediate state
> during the initialization.
> 
> SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> operation (VMXON has been done), otherwise it generates #UD.  So far
> only KVM handles VMXON/VMXOFF.  Choose to not handle VMXON/VMXOFF in
> tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> since so far KVM is the only user of TDX.  In the long term, more kernel
> components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> module runtime update), so a reference-based approach to do VMXON/VMXOFF
> is likely needed.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>   arch/x86/include/asm/tdx.h  |   4 +
>   arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 226 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 1f29813b1646..c8af2ba6bb8a 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
>   void tdx_detect_cpu(struct cpuinfo_x86 *c);
> +int tdx_detect(void);
> +int tdx_init(void);
>   #else
>   static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> +static inline int tdx_detect(void) { return -ENODEV; }
> +static inline int tdx_init(void) { return -ENODEV; }
>   #endif /* CONFIG_INTEL_TDX_HOST */
>   
>   #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ba2210001ea8..53093d4ad458 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -9,6 +9,8 @@
>   
>   #include <linux/types.h>
>   #include <linux/cpumask.h>
> +#include <linux/mutex.h>
> +#include <linux/cpu.h>
>   #include <asm/msr-index.h>
>   #include <asm/msr.h>
>   #include <asm/cpufeature.h>
> @@ -45,12 +47,33 @@
>   		((u32)(((_keyid_part) & 0xffffffffull) + 1))
>   #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
>   
> +/*
> + * TDX module status during initialization
> + */
> +enum tdx_module_status_t {
> +	/* TDX module status is unknown */
> +	TDX_MODULE_UNKNOWN,
> +	/* TDX module is not loaded */
> +	TDX_MODULE_NONE,
> +	/* TDX module is loaded, but not initialized */
> +	TDX_MODULE_LOADED,
> +	/* TDX module is fully initialized */
> +	TDX_MODULE_INITIALIZED,
> +	/* TDX module is shutdown due to error during initialization */
> +	TDX_MODULE_SHUTDOWN,
> +};
> +

May be adding these states when you really need will make
more sense. Currently this patch only uses SHUTDOWN and
NONE states. Other state usage is not very clear.

>   /* BIOS must configure SEAMRR registers for all cores consistently */
>   static u64 seamrr_base, seamrr_mask;
>   
>   static u32 tdx_keyid_start;
>   static u32 tdx_keyid_num;
>   
> +static enum tdx_module_status_t tdx_module_status;
> +
> +/* Prevent concurrent attempts on TDX detection and initialization */
> +static DEFINE_MUTEX(tdx_module_lock);

Any possible concurrent usage models?

> +
>   static bool __seamrr_enabled(void)
>   {
>   	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
>   	detect_seam(c);
>   	detect_tdx_keyids(c);
>   }
> +
> +static bool seamrr_enabled(void)
> +{
> +	/*
> +	 * To detect any BIOS misconfiguration among cores, all logical
> +	 * cpus must have been brought up at least once.  This is true
> +	 * unless 'maxcpus' kernel command line is used to limit the
> +	 * number of cpus to be brought up during boot time.  However
> +	 * 'maxcpus' is basically an invalid operation mode due to the
> +	 * MCE broadcast problem, and it should not be used on a TDX
> +	 * capable machine.  Just do paranoid check here and do not

a paranoid check

> +	 * report SEAMRR as enabled in this case.
> +	 */
> +	if (!cpumask_equal(&cpus_booted_once_mask,
> +					cpu_present_mask))
> +		return false;
> +
> +	return __seamrr_enabled();
> +}
> +
> +static bool tdx_keyid_sufficient(void)
> +{
> +	if (!cpumask_equal(&cpus_booted_once_mask,
> +					cpu_present_mask))
> +		return false;
> +
> +	/*
> +	 * TDX requires at least two KeyIDs: one global KeyID to
> +	 * protect the metadata of the TDX module and one or more
> +	 * KeyIDs to run TD guests.
> +	 */
> +	return tdx_keyid_num >= 2;
> +}
> +
> +static int __tdx_detect(void)
> +{
> +	/* The TDX module is not loaded if SEAMRR is disabled */
> +	if (!seamrr_enabled()) {
> +		pr_info("SEAMRR not enabled.\n");
> +		goto no_tdx_module;
> +	}
> +
> +	/*
> +	 * Also do not report the TDX module as loaded if there's
> +	 * no enough TDX private KeyIDs to run any TD guests.
> +	 */

You are not returning TDX_MODULE_LOADED under any current
scenarios. So think above comment is not accurate.

> +	if (!tdx_keyid_sufficient()) {
> +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
> +				tdx_keyid_num);
> +		goto no_tdx_module;
> +	}
> +
> +	/* Return -ENODEV until the TDX module is detected */
> +no_tdx_module:
> +	tdx_module_status = TDX_MODULE_NONE;
> +	return -ENODEV;
> +}
> +
> +static int init_tdx_module(void)
> +{
> +	/*
> +	 * Return -EFAULT until all steps of TDX module
> +	 * initialization are done.
> +	 */
> +	return -EFAULT;
> +}
> +
> +static void shutdown_tdx_module(void)
> +{
> +	/* TODO: Shut down the TDX module */
> +	tdx_module_status = TDX_MODULE_SHUTDOWN;
> +}
> +
> +static int __tdx_init(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * Logical-cpu scope initialization requires calling one SEAMCALL
> +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
> +	 * module also has such requirement.  Further more, configuring

such a requirement

> +	 * the key of the global KeyID requires calling one SEAMCALL for
> +	 * each package.  For simplicity, disable CPU hotplug in the whole
> +	 * initialization process.
> +	 *
> +	 * It's perhaps better to check whether all BIOS-enabled cpus are
> +	 * online before starting initializing, and return early if not.
> +	 * But none of 'possible', 'present' and 'online' CPU masks
> +	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
> +	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> +	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> +	 * online.
> +	 */
> +	cpus_read_lock();
> +
> +	ret = init_tdx_module();
> +
> +	/*
> +	 * Shut down the TDX module in case of any error during the
> +	 * initialization process.  It's meaningless to leave the TDX
> +	 * module in any middle state of the initialization process.
> +	 */
> +	if (ret)
> +		shutdown_tdx_module();
> +
> +	cpus_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_detect - Detect whether the TDX module has been loaded
> + *
> + * Detect whether the TDX module has been loaded and ready for
> + * initialization.  Only call this function when all cpus are
> + * already in VMX operation.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0:	The TDX module has been loaded and ready for
> + *		initialization.
> + * * -ENODEV:	The TDX module is not loaded.
> + * * -EPERM:	CPU is not in VMX operation.
> + * * -EFAULT:	Other internal fatal errors.
> + */
> +int tdx_detect(void)

Will this function be used separately or always along with
tdx_init()?

> +{
> +	int ret;
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_UNKNOWN:
> +		ret = __tdx_detect();
> +		break;
> +	case TDX_MODULE_NONE:
> +		ret = -ENODEV;
> +		break;
> +	case TDX_MODULE_LOADED:
> +	case TDX_MODULE_INITIALIZED:
> +		ret = 0;
> +		break;
> +	case TDX_MODULE_SHUTDOWN:
> +		ret = -EFAULT;
> +		break;
> +	default:
> +		WARN_ON(1);
> +		ret = -EFAULT;
> +	}
> +
> +	mutex_unlock(&tdx_module_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_detect);
> +
> +/**
> + * tdx_init - Initialize the TDX module

If it for tdx module initialization, why not call it
tdx_module_init()? If not, update the description
appropriately.

> + *
> + * Initialize the TDX module to make it ready to run TD guests.  This
> + * function should be called after tdx_detect() returns successful.
> + * Only call this function when all cpus are online and are in VMX
> + * operation.  CPU hotplug is temporarily disabled internally.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0:	The TDX module has been successfully initialized.
> + * * -ENODEV:	The TDX module is not loaded.
> + * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
> + * * -EFAULT:	Other internal fatal errors.
> + */

You return differnt error values just for debug prints or there are
other uses for it?

> +int tdx_init(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_NONE:
> +		ret = -ENODEV;
> +		break;
> +	case TDX_MODULE_LOADED:

> +		ret = __tdx_init();
> +		break;
> +	case TDX_MODULE_INITIALIZED:
> +		ret = 0;
> +		break;
> +	default:
> +		ret = -EFAULT;
> +		break;
> +	}
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_init);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
@ 2022-04-20  4:16     ` Kai Huang
  2022-04-20  7:29       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-20  4:16 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
> 
> TDCALL is a new term for this patch set. Maybe add some detail about
> it in ()?.
> 
> > 

TDCALL implementation is already in tip/tdx.  This series will be rebased to it.
I don't think we need to explain more about something that is already in the tip
tree?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
@ 2022-04-20  4:37     ` Kai Huang
  2022-04-20  5:21       ` Dave Hansen
  2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
  0 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20  4:37 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > The TDX module is essentially a CPU-attested software module running
> > in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> > host and certain physical attacks.  The TDX module implements the
> 
> /s/host/hosts

I don't quite get.  Could you explain why there are multiple hosts?

> 
> > functions to build, tear down and start execution of the protected VMs
> > called Trusted Domains (TD).  Before the TDX module can be used to
> 
> /s/Trusted/Trust

Thanks.

> 
> > create and run TD guests, it must be loaded into the SEAM Range Register
> > (SEAMRR) and properly initialized.  The TDX module is expected to be
> > loaded by BIOS before booting to the kernel, and the kernel is expected
> > to detect and initialize it, using the SEAMCALLs defined by TDX
> > architecture.
> > 
> > The TDX module can be initialized only once in its lifetime.  Instead
> > of always initializing it at boot time, this implementation chooses an
> > on-demand approach to initialize TDX until there is a real need (e.g
> > when requested by KVM).  This avoids consuming the memory that must be
> > allocated by kernel and given to the TDX module as metadata (~1/256th of
> 
> allocated by the kernel

Ok.

> 
> > the TDX-usable memory), and also saves the time of initializing the TDX
> > module (and the metadata) when TDX is not used at all.  Initializing the
> > TDX module at runtime on-demand also is more flexible to support TDX
> > module runtime updating in the future (after updating the TDX module, it
> > needs to be initialized again).
> > 
> > Introduce two placeholders tdx_detect() and tdx_init() to detect and
> > initialize the TDX module on demand, with a state machine introduced to
> > orchestrate the entire process (in case of multiple callers).
> > 
> > To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
> > TDX module is reported as not loaded if either SEAMRR is not enabled, or
> > there are no enough TDX private KeyIDs to create any TD guest.  The TDX
> > module itself requires one global TDX private KeyID to crypto protect
> > its metadata.
> > 
> > And tdx_init() is currently empty.  The TDX module will be initialized
> > in multi-steps defined by the TDX architecture:
> > 
> >    1) Global initialization;
> >    2) Logical-CPU scope initialization;
> >    3) Enumerate the TDX module capabilities and platform configuration;
> >    4) Configure the TDX module about usable memory ranges and global
> >       KeyID information;
> >    5) Package-scope configuration for the global KeyID;
> >    6) Initialize usable memory ranges based on 4).
> > 
> > The TDX module can also be shut down at any time during its lifetime.
> > In case of any error during the initialization process, shut down the
> > module.  It's pointless to leave the module in any intermediate state
> > during the initialization.
> > 
> > SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> > operation (VMXON has been done), otherwise it generates #UD.  So far
> > only KVM handles VMXON/VMXOFF.  Choose to not handle VMXON/VMXOFF in
> > tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> > since so far KVM is the only user of TDX.  In the long term, more kernel
> > components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> > module runtime update), so a reference-based approach to do VMXON/VMXOFF
> > is likely needed.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >   arch/x86/include/asm/tdx.h  |   4 +
> >   arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
> >   2 files changed, 226 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 1f29813b1646..c8af2ba6bb8a 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> >   
> >   #ifdef CONFIG_INTEL_TDX_HOST
> >   void tdx_detect_cpu(struct cpuinfo_x86 *c);
> > +int tdx_detect(void);
> > +int tdx_init(void);
> >   #else
> >   static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> > +static inline int tdx_detect(void) { return -ENODEV; }
> > +static inline int tdx_init(void) { return -ENODEV; }
> >   #endif /* CONFIG_INTEL_TDX_HOST */
> >   
> >   #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index ba2210001ea8..53093d4ad458 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -9,6 +9,8 @@
> >   
> >   #include <linux/types.h>
> >   #include <linux/cpumask.h>
> > +#include <linux/mutex.h>
> > +#include <linux/cpu.h>
> >   #include <asm/msr-index.h>
> >   #include <asm/msr.h>
> >   #include <asm/cpufeature.h>
> > @@ -45,12 +47,33 @@
> >   		((u32)(((_keyid_part) & 0xffffffffull) + 1))
> >   #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
> >   
> > +/*
> > + * TDX module status during initialization
> > + */
> > +enum tdx_module_status_t {
> > +	/* TDX module status is unknown */
> > +	TDX_MODULE_UNKNOWN,
> > +	/* TDX module is not loaded */
> > +	TDX_MODULE_NONE,
> > +	/* TDX module is loaded, but not initialized */
> > +	TDX_MODULE_LOADED,
> > +	/* TDX module is fully initialized */
> > +	TDX_MODULE_INITIALIZED,
> > +	/* TDX module is shutdown due to error during initialization */
> > +	TDX_MODULE_SHUTDOWN,
> > +};
> > +
> 
> May be adding these states when you really need will make
> more sense. Currently this patch only uses SHUTDOWN and
> NONE states. Other state usage is not very clear.

They are all used in tdx_detect() and tdx_init(), no?

> 
> >   /* BIOS must configure SEAMRR registers for all cores consistently */
> >   static u64 seamrr_base, seamrr_mask;
> >   
> >   static u32 tdx_keyid_start;
> >   static u32 tdx_keyid_num;
> >   
> > +static enum tdx_module_status_t tdx_module_status;
> > +
> > +/* Prevent concurrent attempts on TDX detection and initialization */
> > +static DEFINE_MUTEX(tdx_module_lock);
> 
> Any possible concurrent usage models?

tdx_detect() and tdx_init() are called on demand by callers, so it's possible
multiple callers can call into them concurrently.

> 
> > +
> >   static bool __seamrr_enabled(void)
> >   {
> >   	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
> >   	detect_seam(c);
> >   	detect_tdx_keyids(c);
> >   }
> > +
> > +static bool seamrr_enabled(void)
> > +{
> > +	/*
> > +	 * To detect any BIOS misconfiguration among cores, all logical
> > +	 * cpus must have been brought up at least once.  This is true
> > +	 * unless 'maxcpus' kernel command line is used to limit the
> > +	 * number of cpus to be brought up during boot time.  However
> > +	 * 'maxcpus' is basically an invalid operation mode due to the
> > +	 * MCE broadcast problem, and it should not be used on a TDX
> > +	 * capable machine.  Just do paranoid check here and do not
> 
> a paranoid check

Ok.

> 
> > +	 * report SEAMRR as enabled in this case.
> > +	 */
> > +	if (!cpumask_equal(&cpus_booted_once_mask,
> > +					cpu_present_mask))
> > +		return false;
> > +
> > +	return __seamrr_enabled();
> > +}
> > +
> > +static bool tdx_keyid_sufficient(void)
> > +{
> > +	if (!cpumask_equal(&cpus_booted_once_mask,
> > +					cpu_present_mask))
> > +		return false;
> > +
> > +	/*
> > +	 * TDX requires at least two KeyIDs: one global KeyID to
> > +	 * protect the metadata of the TDX module and one or more
> > +	 * KeyIDs to run TD guests.
> > +	 */
> > +	return tdx_keyid_num >= 2;
> > +}
> > +
> > +static int __tdx_detect(void)
> > +{
> > +	/* The TDX module is not loaded if SEAMRR is disabled */
> > +	if (!seamrr_enabled()) {
> > +		pr_info("SEAMRR not enabled.\n");
> > +		goto no_tdx_module;
> > +	}
> > +
> > +	/*
> > +	 * Also do not report the TDX module as loaded if there's
> > +	 * no enough TDX private KeyIDs to run any TD guests.
> > +	 */
> 
> You are not returning TDX_MODULE_LOADED under any current
> scenarios. So think above comment is not accurate.

This comment is to explain the logic behind of below TDX KeyID check.  I don't
see how is it related to your comments?

This patch is pretty much a placeholder to express the idea of how are
tdx_detect() and tdx_init() going to be implemented.  In below after the
tdx_keyid_sufficient() check, I also have a comment to explain the module hasn't
been detected yet which means there will be code to detect the module here, and
at that time, logically this function will return TDX_MODULE_LOADED.  I don't
see this is hard to understand?

> 
> > +	if (!tdx_keyid_sufficient()) {
> > +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > +				tdx_keyid_num);
> > +		goto no_tdx_module;
> > +	}
> > +
> > +	/* Return -ENODEV until the TDX module is detected */
> > +no_tdx_module:
> > +	tdx_module_status = TDX_MODULE_NONE;
> > +	return -ENODEV;
> > +}
> > +
> > +static int init_tdx_module(void)
> > +{
> > +	/*
> > +	 * Return -EFAULT until all steps of TDX module
> > +	 * initialization are done.
> > +	 */
> > +	return -EFAULT;
> > +}
> > +
> > +static void shutdown_tdx_module(void)
> > +{
> > +	/* TODO: Shut down the TDX module */
> > +	tdx_module_status = TDX_MODULE_SHUTDOWN;
> > +}
> > +
> > +static int __tdx_init(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * Logical-cpu scope initialization requires calling one SEAMCALL
> > +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
> > +	 * module also has such requirement.  Further more, configuring
> 
> such a requirement

Thanks.

> 
> > +	 * the key of the global KeyID requires calling one SEAMCALL for
> > +	 * each package.  For simplicity, disable CPU hotplug in the whole
> > +	 * initialization process.
> > +	 *
> > +	 * It's perhaps better to check whether all BIOS-enabled cpus are
> > +	 * online before starting initializing, and return early if not.
> > +	 * But none of 'possible', 'present' and 'online' CPU masks
> > +	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
> > +	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> > +	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> > +	 * online.
> > +	 */
> > +	cpus_read_lock();
> > +
> > +	ret = init_tdx_module();
> > +
> > +	/*
> > +	 * Shut down the TDX module in case of any error during the
> > +	 * initialization process.  It's meaningless to leave the TDX
> > +	 * module in any middle state of the initialization process.
> > +	 */
> > +	if (ret)
> > +		shutdown_tdx_module();
> > +
> > +	cpus_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * tdx_detect - Detect whether the TDX module has been loaded
> > + *
> > + * Detect whether the TDX module has been loaded and ready for
> > + * initialization.  Only call this function when all cpus are
> > + * already in VMX operation.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0:	The TDX module has been loaded and ready for
> > + *		initialization.
> > + * * -ENODEV:	The TDX module is not loaded.
> > + * * -EPERM:	CPU is not in VMX operation.
> > + * * -EFAULT:	Other internal fatal errors.
> > + */
> > +int tdx_detect(void)
> 
> Will this function be used separately or always along with
> tdx_init()?

The caller should first use tdx_detect() and then use tdx_init().  If caller
only uses tdx_detect(), then TDX module won't be initialized (unless other
caller does this).  If caller calls tdx_init() before tdx_detect(),  it will get
error.

> 
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&tdx_module_lock);
> > +
> > +	switch (tdx_module_status) {
> > +	case TDX_MODULE_UNKNOWN:
> > +		ret = __tdx_detect();
> > +		break;
> > +	case TDX_MODULE_NONE:
> > +		ret = -ENODEV;
> > +		break;
> > +	case TDX_MODULE_LOADED:
> > +	case TDX_MODULE_INITIALIZED:
> > +		ret = 0;
> > +		break;
> > +	case TDX_MODULE_SHUTDOWN:
> > +		ret = -EFAULT;
> > +		break;
> > +	default:
> > +		WARN_ON(1);
> > +		ret = -EFAULT;
> > +	}
> > +
> > +	mutex_unlock(&tdx_module_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_detect);
> > +
> > +/**
> > + * tdx_init - Initialize the TDX module
> 
> If it for tdx module initialization, why not call it
> tdx_module_init()? If not, update the description
> appropriately.

Besides do the actual module initialization, it also has a state machine.

But point taken, and I'll try to refine the description.  Thanks.

> 
> > + *
> > + * Initialize the TDX module to make it ready to run TD guests.  This
> > + * function should be called after tdx_detect() returns successful.
> > + * Only call this function when all cpus are online and are in VMX
> > + * operation.  CPU hotplug is temporarily disabled internally.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0:	The TDX module has been successfully initialized.
> > + * * -ENODEV:	The TDX module is not loaded.
> > + * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
> > + * * -EFAULT:	Other internal fatal errors.
> > + */
> 
> You return differnt error values just for debug prints or there are
> other uses for it?

Caller can distinguish them and act differently.  Even w/o any purpose, I think
it's better to return different error codes to reflect different error reasons.



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-20  4:37     ` Kai Huang
@ 2022-04-20  5:21       ` Dave Hansen
  2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-20  5:21 UTC (permalink / raw)
  To: Kai Huang, Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	isaku.yamahata

On 4/19/22 21:37, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> The TDX module is essentially a CPU-attested software module running
>>> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
>>> host and certain physical attacks.  The TDX module implements the
>> /s/host/hosts
> I don't quite get.  Could you explain why there are multiple hosts?

This one is an arbitrary language tweak.  This:

	to protect VMs from malicious host and certain physical attacks.

could also be written:

	to protect VMs from malicious host attacks and certain physical
	attacks.

But, it's somewhat more compact to do what was writen.  I agree that the
language is a bit clumsy and could be cleaned up, but just doing
s/host/hosts/ doesn't really improve anything.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-20  4:16     ` Kai Huang
@ 2022-04-20  7:29       ` Sathyanarayanan Kuppuswamy
  2022-04-20 10:39         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-20  7:29 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/19/22 9:16 PM, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
>>> ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
>>
>> TDCALL is a new term for this patch set. Maybe add some detail about
>> it in ()?.
>>
>>>
> 
> TDCALL implementation is already in tip/tdx.  This series will be rebased to it.
> I don't think we need to explain more about something that is already in the tip
> tree?

Since you have already expanded terms like TD,TDX and SEAM in this patch
set, I thought you wanted to explain TDX terms to make it easy for new 
readers. So to keep it uniform, I have suggested adding some brief 
details about the TDCALL.

But I am fine either way.

> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-20  7:29       ` Sathyanarayanan Kuppuswamy
@ 2022-04-20 10:39         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20 10:39 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Wed, 2022-04-20 at 00:29 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/19/22 9:16 PM, Kai Huang wrote:
> > On Tue, 2022-04-19 at 07:07 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > 
> > > On 4/5/22 9:49 PM, Kai Huang wrote:
> > > > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > > > ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
> > > 
> > > TDCALL is a new term for this patch set. Maybe add some detail about
> > > it in ()?.
> > > 
> > > > 
> > 
> > TDCALL implementation is already in tip/tdx.  This series will be rebased to it.
> > I don't think we need to explain more about something that is already in the tip
> > tree?
> 
> Since you have already expanded terms like TD,TDX and SEAM in this patch
> set, I thought you wanted to explain TDX terms to make it easy for new 
> readers. So to keep it uniform, I have suggested adding some brief 
> details about the TDCALL.
> 
> 

All right.  I can add one sentence to explain it.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-20  4:37     ` Kai Huang
  2022-04-20  5:21       ` Dave Hansen
@ 2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
  2022-04-20 22:35         ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-20 14:30 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/19/22 9:37 PM, Kai Huang wrote:
> On Tue, 2022-04-19 at 07:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> The TDX module is essentially a CPU-attested software module running
>>> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
>>> host and certain physical attacks.  The TDX module implements the
>>
>> /s/host/hosts
> 
> I don't quite get.  Could you explain why there are multiple hosts?

Sorry, I misread it. It is correct, so ignore it.

> 
>>

>>> +
>>> +/**
>>> + * tdx_detect - Detect whether the TDX module has been loaded
>>> + *
>>> + * Detect whether the TDX module has been loaded and ready for
>>> + * initialization.  Only call this function when all cpus are
>>> + * already in VMX operation.
>>> + *
>>> + * This function can be called in parallel by multiple callers.
>>> + *
>>> + * Return:
>>> + *
>>> + * * -0:	The TDX module has been loaded and ready for
>>> + *		initialization.
>>> + * * -ENODEV:	The TDX module is not loaded.
>>> + * * -EPERM:	CPU is not in VMX operation.
>>> + * * -EFAULT:	Other internal fatal errors.
>>> + */
>>> +int tdx_detect(void)
>>
>> Will this function be used separately or always along with
>> tdx_init()?
> 
> The caller should first use tdx_detect() and then use tdx_init().  If caller
> only uses tdx_detect(), then TDX module won't be initialized (unless other
> caller does this).  If caller calls tdx_init() before tdx_detect(),  it will get
> error.
> 

I just checked your patch set to understand where you are using
tdx_detect()/tdx_init(). But I did not find any callers. Did I miss it? 
or it is not used in your patch set?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
@ 2022-04-20 20:48   ` Isaku Yamahata
  2022-04-20 22:38     ` Kai Huang
  2022-04-27 22:24   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Isaku Yamahata @ 2022-04-20 20:48 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata, isaku.yamahata

> Subject: Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory

Nitpick: coveret => convert

Thanks,

On Wed, Apr 06, 2022 at 04:49:22PM +1200,
Kai Huang <kai.huang@intel.com> wrote:

> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums.  Not all memory
> satisfies these requirements.
> 
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR).  During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.  The list of these
> ranges, along with TDX module information, is available to the kernel by
> querying the TDX module.
> 
> In order to provide crypto protection to TD guests, the TDX architecture
> also needs additional metadata to record things like which TD guest
> "owns" a given page of memory.  This metadata essentially serves as the
> 'struct page' for the TDX module.  The space for this metadata is not
> reserved by the hardware upfront and must be allocated by the kernel
> and given to the TDX module.
> 
> Since this metadata consumes space, the VMM can choose whether or not to
> allocate it for a given area of convertible memory.  If it chooses not
> to, the memory cannot receive TDX protections and can not be used by TDX
> guests as private memory.
> 
> For every memory region that the VMM wants to use as TDX memory, it sets
> up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
> contiguous convertible range and must also have its own physically
> contiguous metadata table, referred to as a Physical Address Metadata
> Table (PAMT), to track status for each page in the TDMR range.
> 
> Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
> support physical RAM areas that don't meet those strict requirements,
> each TDMR permits a number of internal "reserved areas" which can be
> placed over memory holes.  If PAMT metadata is placed within a TDMR it
> must be covered by one of these reserved areas.
> 
> Let's summarize the concepts:
> 
>  CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
>        4K aligned.
> TDMR - Physical address range which is chosen by the kernel to support
>        TDX.  1G granularity and alignment required.  Each TDMR has
>        reserved areas where TDX memory holes and overlapping PAMTs can
>        be put into.
> PAMT - Physically contiguous TDX metadata.  One table for each page size
>        per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
>        PAMT.
> 
> As one step of initializing the TDX module, the memory regions that TDX
> module can use must be configured to the TDX module via an array of
> TDMRs.
> 
> Constructing TDMRs to build the TDX memory consists below steps:
> 
> 1) Create TDMRs to cover all memory regions that TDX module can use;
> 2) Allocate and set up PAMT for each TDMR;
> 3) Set up reserved areas for each TDMR.
> 
> Add a placeholder right after getting TDX module and CMRs information to
> construct TDMRs to do the above steps, as the preparation to configure
> the TDX module.  Always free TDMRs at the end of the initialization (no
> matter successful or not), as TDMRs are only used during the
> initialization.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 47 +++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 482e6d858181..ec27350d53c1 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,7 @@
>  #include <linux/cpu.h>
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
> +#include <linux/slab.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
> @@ -594,8 +595,29 @@ static int tdx_get_sysinfo(void)
>  	return sanitize_cmrs(tdx_cmr_array, cmr_num);
>  }
>  
> +static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_num; i++) {
> +		struct tdmr_info *tdmr = tdmr_array[i];
> +
> +		/* kfree() works with NULL */
> +		kfree(tdmr);
> +		tdmr_array[i] = NULL;
> +	}
> +}
> +
> +static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> +{
> +	/* Return -EFAULT until constructing TDMRs is done */
> +	return -EFAULT;
> +}
> +
>  static int init_tdx_module(void)
>  {
> +	struct tdmr_info **tdmr_array;
> +	int tdmr_num;
>  	int ret;
>  
>  	/* TDX module global initialization */
> @@ -613,11 +635,36 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * Prepare enough space to hold pointers of TDMRs (TDMR_INFO).
> +	 * TDX requires TDMR_INFO being 512 aligned.  Each TDMR is
> +	 * allocated individually within construct_tdmrs() to meet
> +	 * this requirement.
> +	 */
> +	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
> +			GFP_KERNEL);
> +	if (!tdmr_array) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Construct TDMRs to build TDX memory */
> +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
> +	if (ret)
> +		goto out_free_tdmrs;
> +
>  	/*
>  	 * Return -EFAULT until all steps of TDX module
>  	 * initialization are done.
>  	 */
>  	ret = -EFAULT;
> +out_free_tdmrs:
> +	/*
> +	 * TDMRs are only used during initializing TDX module.  Always
> +	 * free them no matter the initialization was successful or not.
> +	 */
> +	free_tdmrs(tdmr_array, tdmr_num);
> +	kfree(tdmr_array);
>  out:
>  	return ret;
>  }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 2f21c45df6ac..05bf9fe6bd00 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -89,6 +89,29 @@ struct tdsysinfo_struct {
>  	};
>  } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
>  
> +struct tdmr_reserved_area {
> +	u64 offset;
> +	u64 size;
> +} __packed;
> +
> +#define TDMR_INFO_ALIGNMENT	512
> +
> +struct tdmr_info {
> +	u64 base;
> +	u64 size;
> +	u64 pamt_1g_base;
> +	u64 pamt_1g_size;
> +	u64 pamt_2m_base;
> +	u64 pamt_2m_size;
> +	u64 pamt_4k_base;
> +	u64 pamt_4k_size;
> +	/*
> +	 * Actual number of reserved areas depends on
> +	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
> +	 */
> +	struct tdmr_reserved_area reserved_areas[0];
> +} __packed __aligned(TDMR_INFO_ALIGNMENT);
> +
>  /*
>   * P-SEAMLDR SEAMCALL leaf function
>   */
> -- 
> 2.35.1
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory
  2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
@ 2022-04-20 20:55   ` Isaku Yamahata
  2022-04-20 22:39     ` Kai Huang
  2022-04-28 15:54   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Isaku Yamahata @ 2022-04-20 20:55 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata, isaku.yamahata

On Wed, Apr 06, 2022 at 04:49:23PM +1200,
Kai Huang <kai.huang@intel.com> wrote:

> +/*
> + * Helper to loop all e820 RAM entries with low 1MB excluded
> + * in a given e820 table.
> + */
> +#define _e820_for_each_mem(_table, _i, _start, _end)				\
> +	for ((_i) = 0, e820_next_mem((_table), &(_i), &(_start), &(_end));	\
> +		(_start) < (_end);						\
> +		e820_next_mem((_table), &(_i), &(_start), &(_end)))
> +
> +/*
> + * Helper to loop all e820 RAM entries with low 1MB excluded
> + * in kernel modified 'e820_table' to honor 'mem' and 'memmap' kernel
> + * command lines.
> + */
> +#define e820_for_each_mem(_i, _start, _end)	\
> +	_e820_for_each_mem(e820_table, _i, _start, _end)
> +
> +/* Check whether first range is the subrange of the second */
> +static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
> +{
> +	return (r1_start >= r2_start && r1_end <= r2_end) ? true : false;

nitpick:
Just "return (r1_start >= r2_start && r1_end <= r2_end)"
Maybe this is a matter of preference, though.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization
  2022-04-06  4:49 ` [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization Kai Huang
@ 2022-04-20 22:27   ` Sathyanarayanan Kuppuswamy
  2022-04-20 22:37     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-20 22:27 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> Do the TDX module global initialization which requires calling
> TDH.SYS.INIT once on any logical cpu.

IMO, you could add some more background details to this commit log. Like
why you are doing it and what it does?. I know that you already 
explained some background in previous patches. But including brief
details here will help to review the commit without checking the
previous commits.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
@ 2022-04-20 22:35         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20 22:35 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

> 
> > > > +
> > > > +/**
> > > > + * tdx_detect - Detect whether the TDX module has been loaded
> > > > + *
> > > > + * Detect whether the TDX module has been loaded and ready for
> > > > + * initialization.  Only call this function when all cpus are
> > > > + * already in VMX operation.
> > > > + *
> > > > + * This function can be called in parallel by multiple callers.
> > > > + *
> > > > + * Return:
> > > > + *
> > > > + * * -0:	The TDX module has been loaded and ready for
> > > > + *		initialization.
> > > > + * * -ENODEV:	The TDX module is not loaded.
> > > > + * * -EPERM:	CPU is not in VMX operation.
> > > > + * * -EFAULT:	Other internal fatal errors.
> > > > + */
> > > > +int tdx_detect(void)
> > > 
> > > Will this function be used separately or always along with
> > > tdx_init()?
> > 
> > The caller should first use tdx_detect() and then use tdx_init().  If caller
> > only uses tdx_detect(), then TDX module won't be initialized (unless other
> > caller does this).  If caller calls tdx_init() before tdx_detect(),  it will get
> > error.
> > 
> 
> I just checked your patch set to understand where you are using
> tdx_detect()/tdx_init(). But I did not find any callers. Did I miss it? 
> or it is not used in your patch set?
> 

No you didn't.  They are not called in this series.  KVM series which is under
upstream process by Isaku will call them.  Dave once said w/o caller is fine as
for this particular case people know KVM is going to use them.  In cover letter
I also mentioned KVM support is under development by another series.  Next
version in cover letter, I'll explicitly call out this series doesn't have
caller of them but depends on KVM to call them.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization
  2022-04-20 22:27   ` Sathyanarayanan Kuppuswamy
@ 2022-04-20 22:37     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20 22:37 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Wed, 2022-04-20 at 15:27 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > Do the TDX module global initialization which requires calling
> > TDH.SYS.INIT once on any logical cpu.
> 
> IMO, you could add some more background details to this commit log. Like
> why you are doing it and what it does?. I know that you already 
> explained some background in previous patches. But including brief
> details here will help to review the commit without checking the
> previous commits.
> 

OK I guess I can add "the first step is global initialization", etc.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-20 20:48   ` Isaku Yamahata
@ 2022-04-20 22:38     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20 22:38 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata

On Wed, 2022-04-20 at 13:48 -0700, Isaku Yamahata wrote:
> > Subject: Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all
> > system RAM as TDX memory
> 
> Nitpick: coveret => convert
> 
> Thanks,

Thanks!

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory
  2022-04-20 20:55   ` Isaku Yamahata
@ 2022-04-20 22:39     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-20 22:39 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: linux-kernel, kvm, seanjc, pbonzini, dave.hansen, len.brown,
	tony.luck, rafael.j.wysocki, reinette.chatre, dan.j.williams,
	peterz, ak, kirill.shutemov, sathyanarayanan.kuppuswamy,
	isaku.yamahata

On Wed, 2022-04-20 at 13:55 -0700, Isaku Yamahata wrote:
> > +/* Check whether first range is the subrange of the second */
> > +static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
> > +{
> > +	return (r1_start >= r2_start && r1_end <= r2_end) ? true : false;
> 
> nitpick:
> Just "return (r1_start >= r2_start && r1_end <= r2_end)"
> Maybe this is a matter of preference, though.

Will use yours.  Thanks!


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
@ 2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
  2022-04-25 23:41     ` Kai Huang
  2022-04-26 20:59   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-23 15:39 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime.  After TDX module is shut down, no further SEAMCALL can be
> made on any logical cpu.
> 
> Shut down the TDX module in case of any error happened during the
> initialization process.  It's pointless to leave the TDX module in some
> middle state.
> 
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all

May be adding specification reference will help.

> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> cpus.  Implement a mechanism to run SEAMCALL concurrently on all online

 From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
VMM (as Part of Module Update)",

TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
SEAMCALLs on the current LP and all SEAMCALL leaf functions except
TDH.SYS.LP.SHUTDOWN on the other LPs.

As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
one LP prevent all SEAMCALL leaf function on all other LPs. If so,
why execute it on all CPUs?

> cpus.  Logical-cpu scope initialization will use it too.

Concurrent SEAMCALL support seem to be useful for other SEAMCALL
types as well. If you agree, I think it would be better if you move
it out to a separate common patch.

> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
>   arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
>   2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 674867bccc14..faf8355965a5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -11,6 +11,8 @@
>   #include <linux/cpumask.h>
>   #include <linux/mutex.h>
>   #include <linux/cpu.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
>   #include <asm/msr-index.h>
>   #include <asm/msr.h>
>   #include <asm/cpufeature.h>
> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>   	return 0;
>   }
>   
> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> +struct seamcall_ctx {
> +	u64 fn;
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	atomic_t err;
> +	u64 seamcall_ret;
> +	struct tdx_module_output out;
> +};
> +
> +static void seamcall_smp_call_function(void *data)
> +{
> +	struct seamcall_ctx *sc = data;
> +	int ret;
> +
> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> +			&sc->seamcall_ret, &sc->out);
> +	if (ret)
> +		atomic_set(&sc->err, ret);
> +}
> +
> +/*
> + * Call the SEAMCALL on all online cpus concurrently.
> + * Return error if SEAMCALL fails on any cpu.
> + */
> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +	return atomic_read(&sc->err);
> +}
> +
>   static inline bool p_seamldr_ready(void)
>   {
>   	return !!p_seamldr_info.p_seamldr_ready;
> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>   
>   static void shutdown_tdx_module(void)
>   {
> -	/* TODO: Shut down the TDX module */
> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> +	seamcall_on_each_cpu(&sc);

May be check the error and WARN_ON on failure?

> +
>   	tdx_module_status = TDX_MODULE_SHUTDOWN;
>   }
>   
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 6990c93198b3..dcc1f6dfe378 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -35,6 +35,11 @@ struct p_seamldr_info {
>   #define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
>   #define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
>   
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN	44
> +
>   struct tdx_module_output;
>   u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>   	       struct tdx_module_output *out);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization
  2022-04-06  4:49 ` [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
@ 2022-04-24  1:27   ` Sathyanarayanan Kuppuswamy
  2022-04-25 23:55     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-24  1:27 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
> BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.

IIUC, this change handles logical CPU initialization part of TDX module
initialization. So why talk about TDH.SYS.CONFIG failure here? Are they
related?

> TDH.SYS.LP.INIT can be called concurrently on all cpus.

IMO, if you move the following paragraph to the beginning, it is easier
to understand "what" and "why" part of this change.
> 
> Following global initialization, do the logical-cpu scope initialization
> by calling TDH.SYS.LP.INIT on all online cpus.  Whether all BIOS-enabled
> cpus are online is not checked here for simplicity.  The caller of
> tdx_init() should guarantee all BIOS-enabled cpus are online.

Include specification reference for TDX module initialization and
TDH.SYS.LP.INIT.

In TDX module spec, section 22.2.35 (TDH.SYS.LP.INIT Leaf), mentions
some environment requirements. I don't see you checking here for it?
Is this already met?



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
@ 2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
  2022-04-26  0:05     ` Kai Huang
  2022-04-27 22:15   ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-25  2:58 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/5/22 9:49 PM, Kai Huang wrote:
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums.  Not all memory
> satisfies these requirements.
> 
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR).  During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.  The list of these
> ranges, along with TDX module information, is available to the kernel by
> querying the TDX module via TDH.SYS.INFO SEAMCALL.
> 
> Host kernel can choose whether or not to use all convertible memory
> regions as TDX memory.  Before TDX module is ready to create any TD
> guests, all TDX memory regions that host kernel intends to use must be
> configured to the TDX module, using specific data structures defined by
> TDX architecture.  Constructing those structures requires information of
> both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
> to get this information as preparation to construct those structures.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---

Looks good. Some minor comments.

>   arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
>   2 files changed, 192 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ef2718423f0f..482e6d858181 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
>   
>   static struct p_seamldr_info p_seamldr_info;
>   
> +/* Base address of CMR array needs to be 512 bytes aligned. */
> +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> +static int tdx_cmr_num;
> +static struct tdsysinfo_struct tdx_sysinfo;
> +
>   static bool __seamrr_enabled(void)
>   {
>   	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
>   	return seamcall_on_each_cpu(&sc);
>   }
>   
> +static inline bool cmr_valid(struct cmr_info *cmr)
> +{
> +	return !!cmr->size;
> +}
> +
> +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
> +		       const char *name)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +
> +		pr_info("%s : [0x%llx, 0x%llx)\n", name,
> +				cmr->base, cmr->base + cmr->size);
> +	}

I am not sure if it is ok to print this info by default or pr_debug
would be better. I will let maintainers decide about it.

> +}
> +
> +static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)

Since this function only deals with tdx_cmr_array, why pass it
as argument?

> +{
> +	int i, j;
> +
> +	/*
> +	 * Intel TDX module spec, 20.7.3 CMR_INFO:
> +	 *
> +	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> +	 *   array of CMR_INFO entries. The CMRs are sorted from the
> +	 *   lowest base address to the highest base address, and they
> +	 *   are non-overlapping.
> +	 *
> +	 * This implies that BIOS may generate invalid empty entries
> +	 * if total CMRs are less than 32.  Skip them manually.
> +	 */
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +		struct cmr_info *prev_cmr = NULL;

Why not keep declarations together at the top of the function?

> +
> +		/* Skip further invalid CMRs */
> +		if (!cmr_valid(cmr))
> +			break;
> +
> +		if (i > 0)
> +			prev_cmr = &cmr_array[i - 1];
> +
> +		/*
> +		 * It is a TDX firmware bug if CMRs are not
> +		 * in address ascending order.
> +		 */
> +		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
> +					cmr->base)) {
> +			pr_err("Firmware bug: CMRs not in address ascending order.\n");
> +			return -EFAULT;
> +		}

Since above condition is only true for i > 0 case, why not combine them
together if (i > 0) {...}

> +	}
> +
> +	/*
> +	 * Also a sane BIOS should never generate invalid CMR(s) between
> +	 * two valid CMRs.  Sanity check this and simply return error in
> +	 * this case.
> +	 *
> +	 * By reaching here @i is the index of the first invalid CMR (or
> +	 * cmr_num).  Starting with next entry of @i since it has already
> +	 * been checked.
> +	 */
> +	for (j = i + 1; j < cmr_num; j++)
> +		if (cmr_valid(&cmr_array[j])) {
> +			pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
> +			return -EFAULT;
> +		}
> +
> +	/*
> +	 * Trim all tail invalid empty CMRs.  BIOS should generate at
> +	 * least one valid CMR, otherwise it's a TDX firmware bug.
> +	 */
> +	tdx_cmr_num = i;
> +	if (!tdx_cmr_num) {
> +		pr_err("Firmware bug: No valid CMR.\n");
> +		return -EFAULT;
> +	}
> +
> +	/* Print kernel sanitized CMRs */
> +	print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
> +
> +	return 0;
> +}
> +


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
@ 2022-04-25 23:41     ` Kai Huang
  2022-04-26  1:48       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-25 23:41 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Sat, 2022-04-23 at 08:39 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > TDX supports shutting down the TDX module at any time during its
> > lifetime.  After TDX module is shut down, no further SEAMCALL can be
> > made on any logical cpu.
> > 
> > Shut down the TDX module in case of any error happened during the
> > initialization process.  It's pointless to leave the TDX module in some
> > middle state.
> > 
> > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> 
> May be adding specification reference will help.

How about adding the reference to the code comment?  Here we just need some fact
description.  Adding reference to the code comment also allows people to find
the relative part in the spec easily when they are looking at the actual code
(i.e. after the code is merged to upstream).  Otherwise people needs to do a git
blame and find the exact commit message for that.
 
> 
> > BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> > cpus.  Implement a mechanism to run SEAMCALL concurrently on all online
> 
>  From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
> VMM (as Part of Module Update)",
> 
> TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
> SEAMCALLs on the current LP and all SEAMCALL leaf functions except
> TDH.SYS.LP.SHUTDOWN on the other LPs.
> 
> As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
> one LP prevent all SEAMCALL leaf function on all other LPs. If so,
> why execute it on all CPUs?

Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN.  The spec defnies
shutting down the TDX module as running this SEAMCALl on all LPs, so why just
run on a single cpu?  What's the benefit?

Also, the spec also mentions for runtime update, "SEAMLDR can check that
TDH.SYS.SHUTDOWN has been executed on all LPs".  Runtime update isn't supported
in this series, but it can leverage the existing code if we run SEAMCALL on all
LPs to shutdown the module as spec suggested.  Why just run on a single cpu?

> 
> > cpus.  Logical-cpu scope initialization will use it too.
> 
> Concurrent SEAMCALL support seem to be useful for other SEAMCALL
> types as well. If you agree, I think it would be better if you move
> it out to a separate common patch.

There are couple of problems of doing that:

- All the functions are static in this tdx.c.  Introducing them separately in
dedicated patch would result in compile warning about those static functions are
not used.
- I have received comments from others I can add those functions when they are
firstly used.  Given those functions is not large, so I prefer this way too.

> 
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >   arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
> >   arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
> >   2 files changed, 44 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 674867bccc14..faf8355965a5 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -11,6 +11,8 @@
> >   #include <linux/cpumask.h>
> >   #include <linux/mutex.h>
> >   #include <linux/cpu.h>
> > +#include <linux/smp.h>
> > +#include <linux/atomic.h>
> >   #include <asm/msr-index.h>
> >   #include <asm/msr.h>
> >   #include <asm/cpufeature.h>
> > @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> >   	return 0;
> >   }
> >   
> > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > +struct seamcall_ctx {
> > +	u64 fn;
> > +	u64 rcx;
> > +	u64 rdx;
> > +	u64 r8;
> > +	u64 r9;
> > +	atomic_t err;
> > +	u64 seamcall_ret;
> > +	struct tdx_module_output out;
> > +};
> > +
> > +static void seamcall_smp_call_function(void *data)
> > +{
> > +	struct seamcall_ctx *sc = data;
> > +	int ret;
> > +
> > +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > +			&sc->seamcall_ret, &sc->out);
> > +	if (ret)
> > +		atomic_set(&sc->err, ret);
> > +}
> > +
> > +/*
> > + * Call the SEAMCALL on all online cpus concurrently.
> > + * Return error if SEAMCALL fails on any cpu.
> > + */
> > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > +{
> > +	on_each_cpu(seamcall_smp_call_function, sc, true);
> > +	return atomic_read(&sc->err);
> > +}
> > +
> >   static inline bool p_seamldr_ready(void)
> >   {
> >   	return !!p_seamldr_info.p_seamldr_ready;
> > @@ -437,7 +472,10 @@ static int init_tdx_module(void)
> >   
> >   static void shutdown_tdx_module(void)
> >   {
> > -	/* TODO: Shut down the TDX module */
> > +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> > +
> > +	seamcall_on_each_cpu(&sc);
> 
> May be check the error and WARN_ON on failure?

When SEAMCALL fails, the error code will be printed out actually (please see
previous patch), so I thought there's no need to WARN_ON() here (and some other
similar places).  I am not sure the additional WARN_ON() will do any help?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization
  2022-04-24  1:27   ` Sathyanarayanan Kuppuswamy
@ 2022-04-25 23:55     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-25 23:55 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

On Sat, 2022-04-23 at 18:27 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 4/5/22 9:49 PM, Kai Huang wrote:
> > Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
> > BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.
> 
> IIUC, this change handles logical CPU initialization part of TDX module
> initialization. So why talk about TDH.SYS.CONFIG failure here? Are they
> related?

They are a little  bit related but I think I can remove it.  Thanks.

> 
> > TDH.SYS.LP.INIT can be called concurrently on all cpus.
> 
> IMO, if you move the following paragraph to the beginning, it is easier
> to understand "what" and "why" part of this change.

OK.

> > 
> > Following global initialization, do the logical-cpu scope initialization
> > by calling TDH.SYS.LP.INIT on all online cpus.  Whether all BIOS-enabled
> > cpus are online is not checked here for simplicity.  The caller of
> > tdx_init() should guarantee all BIOS-enabled cpus are online.
> 
> Include specification reference for TDX module initialization and
> TDH.SYS.LP.INIT.
> 
> In TDX module spec, section 22.2.35 (TDH.SYS.LP.INIT Leaf), mentions
> some environment requirements. I don't see you checking here for it?
> Is this already met?
> 

Good catch.  I missed it, and I'll look into it.  Thanks.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
@ 2022-04-26  0:05     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-26  0:05 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata

> 
> > +}
> > +
> > +static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
> 
> Since this function only deals with tdx_cmr_array, why pass it
> as argument?

I received comments to use cmr_num as argument and pass tdx_cmr_num to
sanitize_cmrs() and finalize it at the end of this function.  In this case I
think it's better to pass tdx_cmr_array as argument.  It also saves some typing
(tdx_cmr_array vs cmr_array) in sanitize_cmrs().

> 
> > +{
> > +	int i, j;
> > +
> > +	/*
> > +	 * Intel TDX module spec, 20.7.3 CMR_INFO:
> > +	 *
> > +	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> > +	 *   array of CMR_INFO entries. The CMRs are sorted from the
> > +	 *   lowest base address to the highest base address, and they
> > +	 *   are non-overlapping.
> > +	 *
> > +	 * This implies that BIOS may generate invalid empty entries
> > +	 * if total CMRs are less than 32.  Skip them manually.
> > +	 */
> > +	for (i = 0; i < cmr_num; i++) {
> > +		struct cmr_info *cmr = &cmr_array[i];
> > +		struct cmr_info *prev_cmr = NULL;
> 
> Why not keep declarations together at the top of the function?

Why? They are only used in this for-loop.

> 
> > +
> > +		/* Skip further invalid CMRs */
> > +		if (!cmr_valid(cmr))
> > +			break;
> > +
> > +		if (i > 0)
> > +			prev_cmr = &cmr_array[i - 1];
> > +
> > +		/*
> > +		 * It is a TDX firmware bug if CMRs are not
> > +		 * in address ascending order.
> > +		 */
> > +		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
> > +					cmr->base)) {
> > +			pr_err("Firmware bug: CMRs not in address ascending order.\n");
> > +			return -EFAULT;
> > +		}
> 
> Since above condition is only true for i > 0 case, why not combine them
> together if (i > 0) {...}

It will make an additional ident for the above if() {} to check prev_cmr and
cmr.  I don't see it is better?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-25 23:41     ` Kai Huang
@ 2022-04-26  1:48       ` Sathyanarayanan Kuppuswamy
  2022-04-26  2:12         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-04-26  1:48 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata



On 4/25/22 4:41 PM, Kai Huang wrote:
> On Sat, 2022-04-23 at 08:39 -0700, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 4/5/22 9:49 PM, Kai Huang wrote:
>>> TDX supports shutting down the TDX module at any time during its
>>> lifetime.  After TDX module is shut down, no further SEAMCALL can be
>>> made on any logical cpu.
>>>
>>> Shut down the TDX module in case of any error happened during the
>>> initialization process.  It's pointless to leave the TDX module in some
>>> middle state.
>>>
>>> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
>>
>> May be adding specification reference will help.
> 
> How about adding the reference to the code comment?  Here we just need some fact
> description.  Adding reference to the code comment also allows people to find
> the relative part in the spec easily when they are looking at the actual code
> (i.e. after the code is merged to upstream).  Otherwise people needs to do a git
> blame and find the exact commit message for that.

If it is not a hassle, you can add references both in code and at the
end of the commit log. Adding two more lines to the commit log should
not be difficult.

I think it is fine either way. Your choice.

>   
>>
>>> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
>>> cpus.  Implement a mechanism to run SEAMCALL concurrently on all online
>>
>>   From TDX Module spec, sec 13.4.1 titled "Shutdown Initiated by the Host
>> VMM (as Part of Module Update)",
>>
>> TDH.SYS.LP.SHUTDOWN is designed to set state variables to block all
>> SEAMCALLs on the current LP and all SEAMCALL leaf functions except
>> TDH.SYS.LP.SHUTDOWN on the other LPs.
>>
>> As per above spec reference, executing TDH.SYS.LP.SHUTDOWN in
>> one LP prevent all SEAMCALL leaf function on all other LPs. If so,
>> why execute it on all CPUs?
> 
> Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN.  The spec defnies
> shutting down the TDX module as running this SEAMCALl on all LPs, so why just
> run on a single cpu?  What's the benefit?

If executing it in one LP prevents SEAMCALLs on all other LPs, I am
trying to understand why spec recommends running it in all LPs?

But the following explanation answers my query. I recommend making a
note about  it in commit log or comments.

> 
> Also, the spec also mentions for runtime update, "SEAMLDR can check that
> TDH.SYS.SHUTDOWN has been executed on all LPs".  Runtime update isn't supported
> in this series, but it can leverage the existing code if we run SEAMCALL on all
> LPs to shutdown the module as spec suggested.  Why just run on a single cpu?
> 
>>
>>> cpus.  Logical-cpu scope initialization will use it too.
>>
>> Concurrent SEAMCALL support seem to be useful for other SEAMCALL
>> types as well. If you agree, I think it would be better if you move
>> it out to a separate common patch.
> 
> There are couple of problems of doing that:
> 
> - All the functions are static in this tdx.c.  Introducing them separately in
> dedicated patch would result in compile warning about those static functions are
> not used.
> - I have received comments from others I can add those functions when they are
> firstly used.  Given those functions is not large, so I prefer this way too.

Ok

> 
>>
>>>
>>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>>> ---
>>>    arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
>>>    arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
>>>    2 files changed, 44 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index 674867bccc14..faf8355965a5 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -11,6 +11,8 @@
>>>    #include <linux/cpumask.h>
>>>    #include <linux/mutex.h>
>>>    #include <linux/cpu.h>
>>> +#include <linux/smp.h>
>>> +#include <linux/atomic.h>
>>>    #include <asm/msr-index.h>
>>>    #include <asm/msr.h>
>>>    #include <asm/cpufeature.h>
>>> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>>>    	return 0;
>>>    }
>>>    
>>> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
>>> +struct seamcall_ctx {
>>> +	u64 fn;
>>> +	u64 rcx;
>>> +	u64 rdx;
>>> +	u64 r8;
>>> +	u64 r9;
>>> +	atomic_t err;
>>> +	u64 seamcall_ret;
>>> +	struct tdx_module_output out;
>>> +};
>>> +
>>> +static void seamcall_smp_call_function(void *data)
>>> +{
>>> +	struct seamcall_ctx *sc = data;
>>> +	int ret;
>>> +
>>> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
>>> +			&sc->seamcall_ret, &sc->out);
>>> +	if (ret)
>>> +		atomic_set(&sc->err, ret);
>>> +}
>>> +
>>> +/*
>>> + * Call the SEAMCALL on all online cpus concurrently.
>>> + * Return error if SEAMCALL fails on any cpu.
>>> + */
>>> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
>>> +{
>>> +	on_each_cpu(seamcall_smp_call_function, sc, true);
>>> +	return atomic_read(&sc->err);
>>> +}
>>> +
>>>    static inline bool p_seamldr_ready(void)
>>>    {
>>>    	return !!p_seamldr_info.p_seamldr_ready;
>>> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>>>    
>>>    static void shutdown_tdx_module(void)
>>>    {
>>> -	/* TODO: Shut down the TDX module */
>>> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
>>> +
>>> +	seamcall_on_each_cpu(&sc);
>>
>> May be check the error and WARN_ON on failure?
> 
> When SEAMCALL fails, the error code will be printed out actually (please see
> previous patch), so I thought there's no need to WARN_ON() here (and some other
> similar places).  I am not sure the additional WARN_ON() will do any help?

OK. I missed that part.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-26  1:48       ` Sathyanarayanan Kuppuswamy
@ 2022-04-26  2:12         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-26  2:12 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, linux-kernel, kvm
  Cc: seanjc, pbonzini, dave.hansen, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, isaku.yamahata


> > 
> > Prevent all SEAMCALLs on other LPs except TDH.SYS.LP.SHUTDOWN.  The spec defnies
> > shutting down the TDX module as running this SEAMCALl on all LPs, so why just
> > run on a single cpu?  What's the benefit?
> 
> If executing it in one LP prevents SEAMCALLs on all other LPs, I am
> trying to understand why spec recommends running it in all LPs?

Please see 3.1.2 Intel TDX Module Shutdown and Update

The "shutdown" case requires "Execute On" on "Each LP".

Also, TDH.SYS.LP.SHUTDOWN describe this is shutdown on *current* LP.
 
> 
> But the following explanation answers my query. I recommend making a
> note about  it in commit log or comments.

Is above enough to address your question?



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
                   ` (21 preceding siblings ...)
  2022-04-14 10:19 ` [PATCH v3 00/21] TDX host kernel support Kai Huang
@ 2022-04-26 20:13 ` Dave Hansen
  2022-04-27  1:15   ` Kai Huang
  2022-04-28  1:01   ` Dan Williams
  22 siblings, 2 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:13 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> SEAM VMX root operation is designed to host a CPU-attested, software
> module called the 'TDX module' which implements functions to manage
> crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also

"crypto protected"?  What the heck is that?

> designed to host a CPU-attested, software module called the 'Intel
> Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> 
> Host kernel transits to either the P-SEAMLDR or the TDX module via a new

 ^ The

> SEAMCALL instruction.  SEAMCALLs are host-side interface functions
> defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> instruction.  They are similar to a hypercall, except they are made by
> host kernel to the SEAM software modules.

This is still missing some important high-level things, like that the
TDX module is protected from the untrusted VMM.  Heck, it forgets to
mention that the VMM itself is untrusted and the TDX module replaces
things that the VMM usually does.

It would also be nice to mention here how this compares with SEV-SNP.
Where is the TDX module in that design?  Why doesn't SEV need all this code?

> TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
> KeyIDs, which can only be used by software runs in SEAM.  The physical

					    ^ which

> address bits for encoding TDX private KeyID are treated as reserved bits
> when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
> private KeyIDs is configured by BIOS.
> 
> Before being able to manage TD guests, the TDX module must be loaded
> and properly initialized using SEAMCALLs defined by TDX architecture.
> This series assumes both the P-SEAMLDR and the TDX module are loaded by
> BIOS before the kernel boots.
> 
> There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
> P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
> SEAMCALL further tells whether TDX module is loaded.

There's a bit of information missing here.  The kernel might not know
the state of things being loaded.  A previous kernel might have loaded
it and left it in an unknown state.

> The TDX module is initialized in multiple steps:
> 
>         1) Global initialization;
>         2) Logical-CPU scope initialization;
>         3) Enumerate the TDX module capabilities;
>         4) Configure the TDX module about usable memory ranges and
>            global KeyID information;
>         5) Package-scope configuration for the global KeyID;
>         6) Initialize TDX metadata for usable memory ranges based on 4).
> 
> Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> table) logical cpus, otherwise step 4) will fail.  Step 5) requires
> calling SEAMCALL on at least one cpu on all packages.
> 
> TDX module can also be shut down at any time during module's lifetime, by
> calling SEAMCALL on all "BIOS-enabled" logical cpus.
> 
> == Design Considerations ==
> 
> 1. Lazy TDX module initialization on-demand by caller

This doesn't really tell us what "lazy" is or what the alternatives are.

There are basically two ways the TDX module could be loaded.  Either:
  * In early boot
or
  * At runtime just before the first TDX guest is run

This series implements the runtime loading.

> None of the steps in the TDX module initialization process must be done
> during kernel boot.  This series doesn't initialize TDX at boot time, but
> instead, provides two functions to allow caller to detect and initialize
> TDX on demand:
> 
>         if (tdx_detect())
>                 goto no_tdx;
>         if (tdx_init())
>                 goto no_tdx;
> 
> This approach has below pros:
> 
> 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> metadata.  Enabling TDX on demand allows only to consume this memory when
> TDX is truly needed (i.e. when KVM wants to create TD guests).
> 
> 2) Both detecting and initializing the TDX module require calling
> SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
> (VMXON has been done).  So far, KVM is the only user of TDX, and it
> already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
> on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
> approach is needed since more kernel components will need to handle
> VMXON/VMXONFF.
> 
> 3) It is more flexible to support "TDX module runtime update" (not in
> this series).  After updating to the new module at runtime, kernel needs
> to go through the initialization process again.  For the new module,
> it's possible the metadata allocated for the old module cannot be reused
> for the new module, and needs to be re-allocated again.
> 
> 2. Kernel policy on TDX memory
> 
> Host kernel is responsible for choosing which memory regions can be used
> as TDX memory, and configuring those memory regions to the TDX module by
> using an array of "TD Memory Regions" (TDMR), which is a data structure
> defined by TDX architecture.


This is putting the cart before the horse.  Don't define the details up
front.

	The TDX architecture allows the VMM to designate specific memory
	as usable for TDX private memory.  This series chooses to
	designate _all_ system RAM as TDX to avoid having to modify the
	page allocator to distinguish TDX and non-TDX-capable memory

... then go on to explain the details.

> The first generation of TDX essentially guarantees that all system RAM
> memory regions (excluding the memory below 1MB) can be used as TDX
> memory.  To avoid having to modify the page allocator to distinguish TDX
> and non-TDX allocation, this series chooses to use all system RAM as TDX
> memory.
> 
> E820 table is used to find all system RAM entries.  Following
> e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> types are treated as TDX memory, and contiguous ranges in the same NUMA
> node are merged together (similar to memblock_add()) before trimming the
> non-page-aligned part.

This e820 cruft is too much detail for a cover letter.  In general, once
you start talking about individual functions, you've gone too far in the
cover letter.

> 3. Memory hotplug
> 
> The first generation of TDX architecturally doesn't support memory
> hotplug.  And the first generation of TDX-capable platforms don't support
> physical memory hotplug.  Since it physically cannot happen, this series
> doesn't add any check in ACPI memory hotplug code path to disable it.
> 
> A special case of memory hotplug is adding NVDIMM as system RAM using
> kmem driver.  However the first generation of TDX-capable platforms
> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> happen either.

What prevents this code from today's code being run on tomorrow's
platforms and breaking these assumptions?

> Another case is admin can use 'memmap' kernel command line to create
> legacy PMEMs and use them as TD guest memory, or theoretically, can use
> kmem driver to add them as system RAM.  To avoid having to change memory
> hotplug code to prevent this from happening, this series always include
> legacy PMEMs when constructing TDMRs so they are also TDX memory.
> 
> 4. CPU hotplug
> 
> The first generation of TDX architecturally doesn't support ACPI CPU
> hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> first generation of TDX-capable platforms don't support ACPI CPU hotplug
> either.  Since this physically cannot happen, this series doesn't add any
> check in ACPI CPU hotplug code path to disable it.
> 
> Also, only TDX module initialization requires all BIOS-enabled cpus are
> online.  After the initialization, any logical cpu can be brought down
> and brought up to online again later.  Therefore this series doesn't
> change logical CPU hotplug either.
> 
> 5. TDX interaction with kexec()
> 
> If TDX is ever enabled and/or used to run any TD guests, the cachelines
> of TDX private memory, including PAMTs, used by TDX module need to be
> flushed before transiting to the new kernel otherwise they may silently
> corrupt the new kernel.  Similar to SME, this series flushes cache in
> stop_this_cpu().

What does this have to do with kexec()?  What's a PAMT?

> The TDX module can be initialized only once during its lifetime.  The
> first generation of TDX doesn't have interface to reset TDX module to

				      ^ an

> uninitialized state so it can be initialized again.
> 
> This implies:
> 
>   - If the old kernel fails to initialize TDX, the new kernel cannot
>     use TDX too unless the new kernel fixes the bug which leads to
>     initialization failure in the old kernel and can resume from where
>     the old kernel stops. This requires certain coordination between
>     the two kernels.

OK, but what does this *MEAN*?

>   - If the old kernel has initialized TDX successfully, the new kernel
>     may be able to use TDX if the two kernels have the exactly same
>     configurations on the TDX module. It further requires the new kernel
>     to reserve the TDX metadata pages (allocated by the old kernel) in
>     its page allocator. It also requires coordination between the two
>     kernels.  Furthermore, if kexec() is done when there are active TD
>     guests running, the new kernel cannot use TDX because it's extremely
>     hard for the old kernel to pass all TDX private pages to the new
>     kernel.
> 
> Given that, this series doesn't support TDX after kexec() (except the
> old kernel doesn't attempt to initialize TDX at all).
> 
> And this series doesn't shut down TDX module but leaves it open during
> kexec().  It is because shutting down TDX module requires CPU being in
> VMX operation but there's no guarantee of this during kexec().  Leaving
> the TDX module open is not the best case, but it is OK since the new
> kernel won't be able to use TDX anyway (therefore TDX module won't run
> at all).

tl;dr: kexec() doesn't work with this code.

Right?

That doesn't seem good.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
  2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
@ 2022-04-26 20:21   ` Dave Hansen
  2022-04-26 23:12     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:21 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

> +config INTEL_TDX_HOST
> +	bool "Intel Trust Domain Extensions (TDX) host support"
> +	default n
> +	depends on CPU_SUP_INTEL
> +	depends on X86_64
> +	help
> +	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> +	  host and certain physical attacks.  This option enables necessary TDX
> +	  support in host kernel to run protected VMs.
> +
> +	  If unsure, say N.

Nothing about KVM?

...
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..03f35c75f439
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2022 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt)	"tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/cpumask.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/cpufeature.h>
> +#include <asm/cpufeatures.h>
> +#include <asm/tdx.h>
> +
> +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> +#define MTRR_CAP_SEAMRR			BIT(15)
> +
> +/* Core-scope Intel SEAMRR base and mask registers. */
> +#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
> +#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
> +
> +#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
> +#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
> +#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
> +
> +#define SEAMRR_ENABLED_BITS	\
> +	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> +
> +/* BIOS must configure SEAMRR registers for all cores consistently */
> +static u64 seamrr_base, seamrr_mask;
> +
> +static bool __seamrr_enabled(void)
> +{
> +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> +}

But there's no case where seamrr_mask is non-zero and where
_seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?

> +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> +{
> +	u64 mtrrcap, base, mask;
> +
> +	/* SEAMRR is reported via MTRRcap */
> +	if (!boot_cpu_has(X86_FEATURE_MTRR))
> +		return;
> +
> +	rdmsrl(MSR_MTRRcap, mtrrcap);
> +	if (!(mtrrcap & MTRR_CAP_SEAMRR))
> +		return;
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> +	if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> +		pr_info("SEAMRR base is not configured by BIOS\n");
> +		return;
> +	}
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> +		pr_info("SEAMRR is not enabled by BIOS\n");
> +		return;
> +	}
> +
> +	seamrr_base = base;
> +	seamrr_mask = mask;
> +}

Comment, please.

	/*
	 * Stash the boot CPU's MSR values so that AP values
	 * can can be checked for consistency.
	 */


> +static void detect_seam_ap(struct cpuinfo_x86 *c)
> +{
> +	u64 base, mask;
> +
> +	/*
> +	 * Don't bother to detect this AP if SEAMRR is not
> +	 * enabled after earlier detections.
> +	 */
> +	if (!__seamrr_enabled())
> +		return;
> +
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> +

This is the place for a comment about why the values have to be equal.

> +	if (base == seamrr_base && mask == seamrr_mask)
> +		return;
> +
> +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> +	/* Mark SEAMRR as disabled. */
> +	seamrr_base = 0;
> +	seamrr_mask = 0;
> +}
> +
> +static void detect_seam(struct cpuinfo_x86 *c)
> +{
> +	if (c == &boot_cpu_data)
> +		detect_seam_bsp(c);
> +	else
> +		detect_seam_ap(c);
> +}
> +
> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> +{
> +	detect_seam(c);
> +}

The extra function looks a bit silly here now.  Maybe this gets filled
out later, but it's goofy-looking here.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
  2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
@ 2022-04-26 20:37   ` Dave Hansen
  2022-04-26 23:29     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:37 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
> defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
> operation (SEAM VMX non-root) which are isolated from legacy VMX root
> and VMX non-root mode.

I feel like this is too much detail for an opening paragraph.

> A CPU-attested software module (called the 'TDX module') runs in SEAM
> VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
> SEAM VMX root is also used to host another CPU-attested software module
> (called the 'P-SEAMLDR') to load and update the TDX module.
>> Host kernel transits to either the P-SEAMLDR or the TDX module via the
> new SEAMCALL instruction.  SEAMCALL leaf functions are host-side
> interface functions defined by the P-SEAMLDR and the TDX module around
> the new SEAMCALL instruction.  They are similar to a hypercall, except
> they are made by host kernel to the SEAM software.

I think you can get rid of about half of this changelog so farand make
it more clear in the process with this:

	TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).
	This mode runs only the TDX module itself or other code needed
	to load the TDX module.

	The host kernel communicates with SEAM software via a new
	SEAMCALL instruction.  This is conceptually similar to
	a guest->host hypercall, except it is made from the host to SEAM
	software instead.

This is a technical document, but you're writing too technically for my
taste and focusing on the low-level details rather than the high-level
concepts.  What do I care that SEAM is two modes and what their names
are at this juncture?  Are those details necesarry to get me to
understand what a SEAMCALL is or what this patch implements?

> SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
> %rax is used to carry both the SEAMCALL leaf function number (input) and
> the completion status code (output).  Additional GPRs (%rcx, %rdx,
> %r8->%r11) may be further used as both input and output operands in
> individual leaf functions.
> 
> Implement a C function __seamcall()

Your "C function" looks a bit like assembly to me.

> to do SEAMCALL leaf functions using
> the assembly macro used by __tdx_module_call() (the implementation of
> TDCALL leaf functions).  The only exception not covered here is TDENTER
> leaf function which takes all GPRs and XMM0-XMM15 as both input and
> output.  The caller of TDENTER should implement its own logic to call
> TDENTER directly instead of using this function.

I have no idea why this paragraph is here or what it is trying to tell me.

> SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
> root, and it can fail with VMfailInvalid, for instance, when the SEAM
> software module is not loaded.  The C function __seamcall() returns
> TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
> code of SEAMCALLs, to uniquely represent this case.

Again, I'm lost.  Why is this detail here?  I don't even see
TDX_SEAMCALL_VMFAILINVALID in the patch.

> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 1bd688684716..fd577619620e 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,2 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
> +obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> new file mode 100644
> index 000000000000..327961b2dd5a
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/frame.h>
> +
> +#include "tdxcall.S"
> +
> +/*
> + * __seamcall()  - Host-side interface functions to SEAM software module
> + *		   (the P-SEAMLDR or the TDX module)
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI.  Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
> + * the SEAMCALL.  Additional output operands are saved in @out (if it is
> + * provided by caller).

This needs to say:

	Returns TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails.

> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX                 - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX                 - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __seamcall() function ABI:
> + *
> + * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI)          - Input parameter 1, moved to RCX
> + * @rdx (RDX)          - Input parameter 2, moved to RDX
> + * @r8  (RCX)          - Input parameter 3, moved to R8
> + * @r9  (R8)           - Input parameter 4, moved to R9
> + *
> + * @out (R9)           - struct tdx_module_output pointer
> + *			 stored temporarily in R12 (not
> + *			 used by the P-SEAMLDR or the TDX
> + *			 module). It can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL, or
> + * TDX_SEAMCALL_VMFAILINVALID.
> + */
> +SYM_FUNC_START(__seamcall)
> +	FRAME_BEGIN
> +	TDX_MODULE_CALL host=1
> +	FRAME_END
> +	ret
> +SYM_FUNC_END(__seamcall)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> new file mode 100644
> index 000000000000..9d5b6f554c20
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +#include <linux/types.h>
> +
> +struct tdx_module_output;
> +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +	       struct tdx_module_output *out);
> +
> +#endif


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
  2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
@ 2022-04-26 20:53   ` Dave Hansen
  2022-04-27  0:43     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:53 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> The TDX module is essentially a CPU-attested software module running
> in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> host and certain physical attacks.  The TDX module implements the
> functions to build, tear down and start execution of the protected VMs
> called Trusted Domains (TD).  Before the TDX module can be used to
> create and run TD guests, it must be loaded into the SEAM Range Register
> (SEAMRR) and properly initialized.

The module isn't loaded into a register, right?

It's loaded into a memory area pointed to *by* the register.

>  The TDX module is expected to be
> loaded by BIOS before booting to the kernel, and the kernel is expected
> to detect and initialize it, using the SEAMCALLs defined by TDX
> architecture.

Wait a sec...  So, what was all this gobleygook about TDX module loading
and SEAMRR's if the kernel just has the TDX module *handed* to it
already loaded?

It looks to me like you wrote all of this before the TDX module was
being loaded by the BIOS and neglected to go and update these changelogs.

> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> on-demand approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This avoids consuming the memory that must be
> allocated by kernel and given to the TDX module as metadata (~1/256th of
> the TDX-usable memory), and also saves the time of initializing the TDX
> module (and the metadata) when TDX is not used at all.  Initializing the
> TDX module at runtime on-demand also is more flexible to support TDX
> module runtime updating in the future (after updating the TDX module, it
> needs to be initialized again).
> 
> Introduce two placeholders tdx_detect() and tdx_init() to detect and
> initialize the TDX module on demand, with a state machine introduced to
> orchestrate the entire process (in case of multiple callers).
> 
> To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
> TDX module is reported as not loaded if either SEAMRR is not enabled, or
> there are no enough TDX private KeyIDs to create any TD guest.  The TDX
> module itself requires one global TDX private KeyID to crypto protect
> its metadata.

This is stepping over the line into telling me what the code does
instead of why.

> And tdx_init() is currently empty.  The TDX module will be initialized
> in multi-steps defined by the TDX architecture:
> 
>   1) Global initialization;
>   2) Logical-CPU scope initialization;
>   3) Enumerate the TDX module capabilities and platform configuration;
>   4) Configure the TDX module about usable memory ranges and global
>      KeyID information;
>   5) Package-scope configuration for the global KeyID;
>   6) Initialize usable memory ranges based on 4).
> 
> The TDX module can also be shut down at any time during its lifetime.
> In case of any error during the initialization process, shut down the
> module.  It's pointless to leave the module in any intermediate state
> during the initialization.
> 
> SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
> operation (VMXON has been done), otherwise it generates #UD.  So far
> only KVM handles VMXON/VMXOFF.  Choose to not handle VMXON/VMXOFF in
> tdx_detect() and tdx_init() but depend on the caller to guarantee that,
> since so far KVM is the only user of TDX.  In the long term, more kernel
> components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
> module runtime update), so a reference-based approach to do VMXON/VMXOFF
> is likely needed.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/tdx.h  |   4 +
>  arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 226 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 1f29813b1646..c8af2ba6bb8a 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  void tdx_detect_cpu(struct cpuinfo_x86 *c);
> +int tdx_detect(void);
> +int tdx_init(void);
>  #else
>  static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
> +static inline int tdx_detect(void) { return -ENODEV; }
> +static inline int tdx_init(void) { return -ENODEV; }
>  #endif /* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ba2210001ea8..53093d4ad458 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -9,6 +9,8 @@
>  
>  #include <linux/types.h>
>  #include <linux/cpumask.h>
> +#include <linux/mutex.h>
> +#include <linux/cpu.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
> @@ -45,12 +47,33 @@
>  		((u32)(((_keyid_part) & 0xffffffffull) + 1))
>  #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
>  
> +/*
> + * TDX module status during initialization
> + */
> +enum tdx_module_status_t {
> +	/* TDX module status is unknown */
> +	TDX_MODULE_UNKNOWN,
> +	/* TDX module is not loaded */
> +	TDX_MODULE_NONE,
> +	/* TDX module is loaded, but not initialized */
> +	TDX_MODULE_LOADED,
> +	/* TDX module is fully initialized */
> +	TDX_MODULE_INITIALIZED,
> +	/* TDX module is shutdown due to error during initialization */
> +	TDX_MODULE_SHUTDOWN,
> +};
> +
>  /* BIOS must configure SEAMRR registers for all cores consistently */
>  static u64 seamrr_base, seamrr_mask;
>  
>  static u32 tdx_keyid_start;
>  static u32 tdx_keyid_num;
>  
> +static enum tdx_module_status_t tdx_module_status;
> +
> +/* Prevent concurrent attempts on TDX detection and initialization */
> +static DEFINE_MUTEX(tdx_module_lock);
> +
>  static bool __seamrr_enabled(void)
>  {
>  	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
>  	detect_seam(c);
>  	detect_tdx_keyids(c);
>  }
> +
> +static bool seamrr_enabled(void)
> +{
> +	/*
> +	 * To detect any BIOS misconfiguration among cores, all logical
> +	 * cpus must have been brought up at least once.  This is true
> +	 * unless 'maxcpus' kernel command line is used to limit the
> +	 * number of cpus to be brought up during boot time.  However
> +	 * 'maxcpus' is basically an invalid operation mode due to the
> +	 * MCE broadcast problem, and it should not be used on a TDX
> +	 * capable machine.  Just do paranoid check here and do not
> +	 * report SEAMRR as enabled in this case.
> +	 */
> +	if (!cpumask_equal(&cpus_booted_once_mask,
> +					cpu_present_mask))
> +		return false;
> +
> +	return __seamrr_enabled();
> +}
> +
> +static bool tdx_keyid_sufficient(void)
> +{
> +	if (!cpumask_equal(&cpus_booted_once_mask,
> +					cpu_present_mask))
> +		return false;

I'd move this cpumask_equal() to a helper.

> +	/*
> +	 * TDX requires at least two KeyIDs: one global KeyID to
> +	 * protect the metadata of the TDX module and one or more
> +	 * KeyIDs to run TD guests.
> +	 */
> +	return tdx_keyid_num >= 2;
> +}
> +
> +static int __tdx_detect(void)
> +{
> +	/* The TDX module is not loaded if SEAMRR is disabled */
> +	if (!seamrr_enabled()) {
> +		pr_info("SEAMRR not enabled.\n");
> +		goto no_tdx_module;
> +	}

Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
the module with SEAMCALL.  Why not just use that directly?

> +	/*
> +	 * Also do not report the TDX module as loaded if there's
> +	 * no enough TDX private KeyIDs to run any TD guests.
> +	 */
> +	if (!tdx_keyid_sufficient()) {
> +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
> +				tdx_keyid_num);
> +		goto no_tdx_module;
> +	}
> +
> +	/* Return -ENODEV until the TDX module is detected */
> +no_tdx_module:
> +	tdx_module_status = TDX_MODULE_NONE;
> +	return -ENODEV;
> +}
> +
> +static int init_tdx_module(void)
> +{
> +	/*
> +	 * Return -EFAULT until all steps of TDX module
> +	 * initialization are done.
> +	 */
> +	return -EFAULT;
> +}
> +
> +static void shutdown_tdx_module(void)
> +{
> +	/* TODO: Shut down the TDX module */
> +	tdx_module_status = TDX_MODULE_SHUTDOWN;
> +}
> +
> +static int __tdx_init(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * Logical-cpu scope initialization requires calling one SEAMCALL
> +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
> +	 * module also has such requirement.  Further more, configuring
> +	 * the key of the global KeyID requires calling one SEAMCALL for
> +	 * each package.  For simplicity, disable CPU hotplug in the whole
> +	 * initialization process.
> +	 *
> +	 * It's perhaps better to check whether all BIOS-enabled cpus are
> +	 * online before starting initializing, and return early if not.

But you did some of this cpumask checking above.  Right?

> +	 * But none of 'possible', 'present' and 'online' CPU masks
> +	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
> +	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> +	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> +	 * online.
> +	 */
> +	cpus_read_lock();
> +
> +	ret = init_tdx_module();
> +
> +	/*
> +	 * Shut down the TDX module in case of any error during the
> +	 * initialization process.  It's meaningless to leave the TDX
> +	 * module in any middle state of the initialization process.
> +	 */
> +	if (ret)
> +		shutdown_tdx_module();
> +
> +	cpus_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_detect - Detect whether the TDX module has been loaded
> + *
> + * Detect whether the TDX module has been loaded and ready for
> + * initialization.  Only call this function when all cpus are
> + * already in VMX operation.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0:	The TDX module has been loaded and ready for
> + *		initialization.

"-0", eh?

> + * * -ENODEV:	The TDX module is not loaded.
> + * * -EPERM:	CPU is not in VMX operation.
> + * * -EFAULT:	Other internal fatal errors.
> + */
> +int tdx_detect(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_UNKNOWN:
> +		ret = __tdx_detect();
> +		break;
> +	case TDX_MODULE_NONE:
> +		ret = -ENODEV;
> +		break;
> +	case TDX_MODULE_LOADED:
> +	case TDX_MODULE_INITIALIZED:
> +		ret = 0;
> +		break;
> +	case TDX_MODULE_SHUTDOWN:
> +		ret = -EFAULT;
> +		break;
> +	default:
> +		WARN_ON(1);
> +		ret = -EFAULT;
> +	}
> +
> +	mutex_unlock(&tdx_module_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_detect);
> +
> +/**
> + * tdx_init - Initialize the TDX module
> + *
> + * Initialize the TDX module to make it ready to run TD guests.  This
> + * function should be called after tdx_detect() returns successful.
> + * Only call this function when all cpus are online and are in VMX
> + * operation.  CPU hotplug is temporarily disabled internally.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return:
> + *
> + * * -0:	The TDX module has been successfully initialized.
> + * * -ENODEV:	The TDX module is not loaded.
> + * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
> + * * -EFAULT:	Other internal fatal errors.
> + */
> +int tdx_init(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_NONE:
> +		ret = -ENODEV;
> +		break;
> +	case TDX_MODULE_LOADED:
> +		ret = __tdx_init();
> +		break;
> +	case TDX_MODULE_INITIALIZED:
> +		ret = 0;
> +		break;
> +	default:
> +		ret = -EFAULT;
> +		break;
> +	}
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_init);

Why does this need both a tdx_detect() and a tdx_init()?  Shouldn't the
interface from outside just be "get TDX up and running, please?"

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
  2022-04-06  4:49 ` [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module Kai Huang
@ 2022-04-26 20:56   ` Dave Hansen
  2022-04-27  0:01     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:56 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> The P-SEAMLDR (persistent SEAM loader) is the first software module that
> runs in SEAM VMX root, responsible for loading and updating the TDX
> module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
> before host kernel boots.

Why bother with the P-SEAMLDR here at all?  The kernel isn't loading the
TDX module in this series.  Why not just call into the TDX module directly?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
  2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
@ 2022-04-26 20:59   ` Dave Hansen
  2022-04-27  0:06     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 20:59 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime.  After TDX module is shut down, no further SEAMCALL can be
> made on any logical cpu.

Is this strictly true?

I thought SEAMCALLs were used for the P-SEAMLDR too.

> Shut down the TDX module in case of any error happened during the
> initialization process.  It's pointless to leave the TDX module in some
> middle state.
> 
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled cpus, and the SEMACALL can run concurrently on different
> cpus.  Implement a mechanism to run SEAMCALL concurrently on all online
> cpus.  Logical-cpu scope initialization will use it too.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 674867bccc14..faf8355965a5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -11,6 +11,8 @@
>  #include <linux/cpumask.h>
>  #include <linux/mutex.h>
>  #include <linux/cpu.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
> @@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	return 0;
>  }
>  
> +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> +struct seamcall_ctx {
> +	u64 fn;
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	atomic_t err;
> +	u64 seamcall_ret;
> +	struct tdx_module_output out;
> +};
> +
> +static void seamcall_smp_call_function(void *data)
> +{
> +	struct seamcall_ctx *sc = data;
> +	int ret;
> +
> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> +			&sc->seamcall_ret, &sc->out);
> +	if (ret)
> +		atomic_set(&sc->err, ret);
> +}
> +
> +/*
> + * Call the SEAMCALL on all online cpus concurrently.
> + * Return error if SEAMCALL fails on any cpu.
> + */
> +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +	return atomic_read(&sc->err);
> +}

Why bother returning something that's not read?

>  static inline bool p_seamldr_ready(void)
>  {
>  	return !!p_seamldr_info.p_seamldr_ready;
> @@ -437,7 +472,10 @@ static int init_tdx_module(void)
>  
>  static void shutdown_tdx_module(void)
>  {
> -	/* TODO: Shut down the TDX module */
> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> +	seamcall_on_each_cpu(&sc);
> +
>  	tdx_module_status = TDX_MODULE_SHUTDOWN;
>  }
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 6990c93198b3..dcc1f6dfe378 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -35,6 +35,11 @@ struct p_seamldr_info {
>  #define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
>  #define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN	44
> +
>  struct tdx_module_output;
>  u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-26 20:21   ` Dave Hansen
@ 2022-04-26 23:12     ` Kai Huang
  2022-04-26 23:28       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-26 23:12 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

Hi Dave,

Thanks for review!

On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > +config INTEL_TDX_HOST
> > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > +	default n
> > +	depends on CPU_SUP_INTEL
> > +	depends on X86_64
> > +	help
> > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from
> > malicious
> > +	  host and certain physical attacks.  This option enables necessary
> > TDX
> > +	  support in host kernel to run protected VMs.
> > +
> > +	  If unsure, say N.
> 
> Nothing about KVM?

I'll add KVM into the context. How about below?

"Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  This option enables necessary TDX
support in host kernel to allow KVM to run protected VMs called Trust
Domains (TD)."

> 
> ...
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > new file mode 100644
> > index 000000000000..03f35c75f439
> > --- /dev/null
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright(c) 2022 Intel Corporation.
> > + *
> > + * Intel Trusted Domain Extensions (TDX) support
> > + */
> > +
> > +#define pr_fmt(fmt)	"tdx: " fmt
> > +
> > +#include <linux/types.h>
> > +#include <linux/cpumask.h>
> > +#include <asm/msr-index.h>
> > +#include <asm/msr.h>
> > +#include <asm/cpufeature.h>
> > +#include <asm/cpufeatures.h>
> > +#include <asm/tdx.h>
> > +
> > +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> > +#define MTRR_CAP_SEAMRR			BIT(15)
> > +
> > +/* Core-scope Intel SEAMRR base and mask registers. */
> > +#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
> > +#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
> > +
> > +#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
> > +#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
> > +#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
> > +
> > +#define SEAMRR_ENABLED_BITS	\
> > +	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> > +
> > +/* BIOS must configure SEAMRR registers for all cores consistently */
> > +static u64 seamrr_base, seamrr_mask;
> > +
> > +static bool __seamrr_enabled(void)
> > +{
> > +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > +}
> 
> But there's no case where seamrr_mask is non-zero and where
> _seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?

seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
is 0.  It will also be cleared when BIOS mis-configuration is detected on any
AP.  SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.

> 
> > +static void detect_seam_bsp(struct cpuinfo_x86 *c)
> > +{
> > +	u64 mtrrcap, base, mask;
> > +
> > +	/* SEAMRR is reported via MTRRcap */
> > +	if (!boot_cpu_has(X86_FEATURE_MTRR))
> > +		return;
> > +
> > +	rdmsrl(MSR_MTRRcap, mtrrcap);
> > +	if (!(mtrrcap & MTRR_CAP_SEAMRR))
> > +		return;
> > +
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > +	if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
> > +		pr_info("SEAMRR base is not configured by BIOS\n");
> > +		return;
> > +	}
> > +
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > +	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
> > +		pr_info("SEAMRR is not enabled by BIOS\n");
> > +		return;
> > +	}
> > +
> > +	seamrr_base = base;
> > +	seamrr_mask = mask;
> > +}
> 
> Comment, please.
> 
> 	/*
> 	 * Stash the boot CPU's MSR values so that AP values
> 	 * can can be checked for consistency.
> 	 */
> 

Thanks. Will add.

> 
> > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > +{
> > +	u64 base, mask;
> > +
> > +	/*
> > +	 * Don't bother to detect this AP if SEAMRR is not
> > +	 * enabled after earlier detections.
> > +	 */
> > +	if (!__seamrr_enabled())
> > +		return;
> > +
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > +
> 
> This is the place for a comment about why the values have to be equal.

I'll add below:

/* BIOS must configure SEAMRR consistently across all cores */

> 
> > +	if (base == seamrr_base && mask == seamrr_mask)
> > +		return;
> > +
> > +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> > +	/* Mark SEAMRR as disabled. */
> > +	seamrr_base = 0;
> > +	seamrr_mask = 0;
> > +}
> > +
> > +static void detect_seam(struct cpuinfo_x86 *c)
> > +{
> > +	if (c == &boot_cpu_data)
> > +		detect_seam_bsp(c);
> > +	else
> > +		detect_seam_ap(c);
> > +}
> > +
> > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > +{
> > +	detect_seam(c);
> > +}
> 
> The extra function looks a bit silly here now.  Maybe this gets filled
> out later, but it's goofy-looking here.

Thomas suggested to put all TDX detection related in one function call, so I
added tdx_detect_cpu().  I'll move this to the next patch when detecting TDX
KeyIDs.


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-26 23:12     ` Kai Huang
@ 2022-04-26 23:28       ` Dave Hansen
  2022-04-26 23:49         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-26 23:28 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/26/22 16:12, Kai Huang wrote:
> Hi Dave,
> 
> Thanks for review!
> 
> On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
>>> +config INTEL_TDX_HOST
>>> +	bool "Intel Trust Domain Extensions (TDX) host support"
>>> +	default n
>>> +	depends on CPU_SUP_INTEL
>>> +	depends on X86_64
>>> +	help
>>> +	  Intel Trust Domain Extensions (TDX) protects guest VMs from
>>> malicious
>>> +	  host and certain physical attacks.  This option enables necessary
>>> TDX
>>> +	  support in host kernel to run protected VMs.
>>> +
>>> +	  If unsure, say N.
>>
>> Nothing about KVM?
> 
> I'll add KVM into the context. How about below?
> 
> "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  This option enables necessary TDX
> support in host kernel to allow KVM to run protected VMs called Trust
> Domains (TD)."

What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?

>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> new file mode 100644
>>> index 000000000000..03f35c75f439
>>> --- /dev/null
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -0,0 +1,102 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright(c) 2022 Intel Corporation.
>>> + *
>>> + * Intel Trusted Domain Extensions (TDX) support
>>> + */
>>> +
>>> +#define pr_fmt(fmt)	"tdx: " fmt
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/cpumask.h>
>>> +#include <asm/msr-index.h>
>>> +#include <asm/msr.h>
>>> +#include <asm/cpufeature.h>
>>> +#include <asm/cpufeatures.h>
>>> +#include <asm/tdx.h>
>>> +
>>> +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
>>> +#define MTRR_CAP_SEAMRR			BIT(15)
>>> +
>>> +/* Core-scope Intel SEAMRR base and mask registers. */
>>> +#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
>>> +#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
>>> +
>>> +#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
>>> +#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
>>> +#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
>>> +
>>> +#define SEAMRR_ENABLED_BITS	\
>>> +	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
>>> +
>>> +/* BIOS must configure SEAMRR registers for all cores consistently */
>>> +static u64 seamrr_base, seamrr_mask;
>>> +
>>> +static bool __seamrr_enabled(void)
>>> +{
>>> +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
>>> +}
>>
>> But there's no case where seamrr_mask is non-zero and where
>> _seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?
> 
> seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> is 0.  It will also be cleared when BIOS mis-configuration is detected on any
> AP.  SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.

The point is that this could be:

	return !!seamrr_mask;


>>> +static void detect_seam_ap(struct cpuinfo_x86 *c)
>>> +{
>>> +	u64 base, mask;
>>> +
>>> +	/*
>>> +	 * Don't bother to detect this AP if SEAMRR is not
>>> +	 * enabled after earlier detections.
>>> +	 */
>>> +	if (!__seamrr_enabled())
>>> +		return;
>>> +
>>> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
>>> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
>>> +
>>
>> This is the place for a comment about why the values have to be equal.
> 
> I'll add below:
> 
> /* BIOS must configure SEAMRR consistently across all cores */

What happens if the BIOS doesn't do this?  What actually breaks?  In
other words, do we *NEED* error checking here?

>>> +	if (base == seamrr_base && mask == seamrr_mask)
>>> +		return;
>>> +
>>> +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
>>> +	/* Mark SEAMRR as disabled. */
>>> +	seamrr_base = 0;
>>> +	seamrr_mask = 0;
>>> +}
>>> +
>>> +static void detect_seam(struct cpuinfo_x86 *c)
>>> +{
>>> +	if (c == &boot_cpu_data)
>>> +		detect_seam_bsp(c);
>>> +	else
>>> +		detect_seam_ap(c);
>>> +}
>>> +
>>> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
>>> +{
>>> +	detect_seam(c);
>>> +}
>>
>> The extra function looks a bit silly here now.  Maybe this gets filled
>> out later, but it's goofy-looking here.
> 
> Thomas suggested to put all TDX detection related in one function call, so I
> added tdx_detect_cpu().  I'll move this to the next patch when detecting TDX
> KeyIDs.

That's fine, or just add a comment or a changelog sentence about this
being filled out later.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
  2022-04-26 20:37   ` Dave Hansen
@ 2022-04-26 23:29     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-26 23:29 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 13:37 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
> > defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
> > operation (SEAM VMX non-root) which are isolated from legacy VMX root
> > and VMX non-root mode.
> 
> I feel like this is too much detail for an opening paragraph.
> 
> > A CPU-attested software module (called the 'TDX module') runs in SEAM
> > VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
> > SEAM VMX root is also used to host another CPU-attested software module
> > (called the 'P-SEAMLDR') to load and update the TDX module.
> > > Host kernel transits to either the P-SEAMLDR or the TDX module via the
> > new SEAMCALL instruction.  SEAMCALL leaf functions are host-side
> > interface functions defined by the P-SEAMLDR and the TDX module around
> > the new SEAMCALL instruction.  They are similar to a hypercall, except
> > they are made by host kernel to the SEAM software.
> 
> I think you can get rid of about half of this changelog so farand make
> it more clear in the process with this:
> 
> 	TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).
> 	This mode runs only the TDX module itself or other code needed
> 	to load the TDX module.
> 
> 	The host kernel communicates with SEAM software via a new
> 	SEAMCALL instruction.  This is conceptually similar to
> 	a guest->host hypercall, except it is made from the host to SEAM
> 	software instead.

Thank you!

> 
> This is a technical document, but you're writing too technically for my
> taste and focusing on the low-level details rather than the high-level
> concepts.  What do I care that SEAM is two modes and what their names
> are at this juncture?  Are those details necesarry to get me to
> understand what a SEAMCALL is or what this patch implements?

Thanks for the point. I'll revisit this series based on this in next version.

> 
> > SEAMCALL leaf functions use an ABI different from the x86-64 system-v
> > ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
> > %rax is used to carry both the SEAMCALL leaf function number (input) and
> > the completion status code (output).  Additional GPRs (%rcx, %rdx,
> > %r8->%r11) may be further used as both input and output operands in
> > individual leaf functions.
> > 
> > Implement a C function __seamcall()
> 
> Your "C function" looks a bit like assembly to me.

Will change to (I saw TDX guest patch used similar way):

	Add a generic interface to do SEAMCALL leaf functions, using the
	assembly macro used by __tdx_module_call().

> 
> > to do SEAMCALL leaf functions using
> > the assembly macro used by __tdx_module_call() (the implementation of
> > TDCALL leaf functions).  The only exception not covered here is TDENTER
> > leaf function which takes all GPRs and XMM0-XMM15 as both input and
> > output.  The caller of TDENTER should implement its own logic to call
> > TDENTER directly instead of using this function.
> 
> I have no idea why this paragraph is here or what it is trying to tell me.

Will get rid of the rest staff.

> 
> > SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
> > root, and it can fail with VMfailInvalid, for instance, when the SEAM
> > software module is not loaded.  The C function __seamcall() returns
> > TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
> > code of SEAMCALLs, to uniquely represent this case.
> 
> Again, I'm lost.  Why is this detail here?  I don't even see
> TDX_SEAMCALL_VMFAILINVALID in the patch.

Will remove.

> 
> > diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> > index 1bd688684716..fd577619620e 100644
> > --- a/arch/x86/virt/vmx/tdx/Makefile
> > +++ b/arch/x86/virt/vmx/tdx/Makefile
> > @@ -1,2 +1,2 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> > -obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
> > +obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o seamcall.o
> > diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
> > new file mode 100644
> > index 000000000000..327961b2dd5a
> > --- /dev/null
> > +++ b/arch/x86/virt/vmx/tdx/seamcall.S
> > @@ -0,0 +1,52 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#include <linux/linkage.h>
> > +#include <asm/frame.h>
> > +
> > +#include "tdxcall.S"
> > +
> > +/*
> > + * __seamcall()  - Host-side interface functions to SEAM software module
> > + *		   (the P-SEAMLDR or the TDX module)
> > + *
> > + * Transform function call register arguments into the SEAMCALL register
> > + * ABI.  Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
> > + * the SEAMCALL.  Additional output operands are saved in @out (if it is
> > + * provided by caller).
> 
> This needs to say:
> 
> 	Returns TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails.

OK.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-26 23:28       ` Dave Hansen
@ 2022-04-26 23:49         ` Kai Huang
  2022-04-27  0:22           ` Sean Christopherson
  2022-04-27 14:22           ` Dave Hansen
  0 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-26 23:49 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> On 4/26/22 16:12, Kai Huang wrote:
> > Hi Dave,
> > 
> > Thanks for review!
> > 
> > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > +config INTEL_TDX_HOST
> > > > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > > > +	default n
> > > > +	depends on CPU_SUP_INTEL
> > > > +	depends on X86_64
> > > > +	help
> > > > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > malicious
> > > > +	  host and certain physical attacks.  This option enables necessary
> > > > TDX
> > > > +	  support in host kernel to run protected VMs.
> > > > +
> > > > +	  If unsure, say N.
> > > 
> > > Nothing about KVM?
> > 
> > I'll add KVM into the context. How about below?
> > 
> > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks.  This option enables necessary TDX
> > support in host kernel to allow KVM to run protected VMs called Trust
> > Domains (TD)."
> 
> What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?

Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM.  But so far KVM is the only
user of TDX, so in practice the code is dead w/o KVM.

What's your opinion?

> 
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > new file mode 100644
> > > > index 000000000000..03f35c75f439
> > > > --- /dev/null
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -0,0 +1,102 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Copyright(c) 2022 Intel Corporation.
> > > > + *
> > > > + * Intel Trusted Domain Extensions (TDX) support
> > > > + */
> > > > +
> > > > +#define pr_fmt(fmt)	"tdx: " fmt
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/cpumask.h>
> > > > +#include <asm/msr-index.h>
> > > > +#include <asm/msr.h>
> > > > +#include <asm/cpufeature.h>
> > > > +#include <asm/cpufeatures.h>
> > > > +#include <asm/tdx.h>
> > > > +
> > > > +/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
> > > > +#define MTRR_CAP_SEAMRR			BIT(15)
> > > > +
> > > > +/* Core-scope Intel SEAMRR base and mask registers. */
> > > > +#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
> > > > +#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
> > > > +
> > > > +#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
> > > > +#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
> > > > +#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
> > > > +
> > > > +#define SEAMRR_ENABLED_BITS	\
> > > > +	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
> > > > +
> > > > +/* BIOS must configure SEAMRR registers for all cores consistently */
> > > > +static u64 seamrr_base, seamrr_mask;
> > > > +
> > > > +static bool __seamrr_enabled(void)
> > > > +{
> > > > +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > > > +}
> > > 
> > > But there's no case where seamrr_mask is non-zero and where
> > > _seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?
> > 
> > seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> > is 0.  It will also be cleared when BIOS mis-configuration is detected on any
> > AP.  SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
> 
> The point is that this could be:
> 
> 	return !!seamrr_mask;

The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits. 
Explicitly checking the two bits, instead of !!seamrr_mask roles out other
incorrect configurations.  For instance, we should not treat SEAMRR being
enabled if we only have "ENABLED" bit set or "LOCKED" bit set.

> 
> 
> > > > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > > > +{
> > > > +	u64 base, mask;
> > > > +
> > > > +	/*
> > > > +	 * Don't bother to detect this AP if SEAMRR is not
> > > > +	 * enabled after earlier detections.
> > > > +	 */
> > > > +	if (!__seamrr_enabled())
> > > > +		return;
> > > > +
> > > > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > > > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > > > +
> > > 
> > > This is the place for a comment about why the values have to be equal.
> > 
> > I'll add below:
> > 
> > /* BIOS must configure SEAMRR consistently across all cores */
> 
> What happens if the BIOS doesn't do this?  What actually breaks?  In
> other words, do we *NEED* error checking here?

AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
configure them consistently among cores.  But for safety I think it's better to
detect.

> 
> > > > +	if (base == seamrr_base && mask == seamrr_mask)
> > > > +		return;
> > > > +
> > > > +	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
> > > > +	/* Mark SEAMRR as disabled. */
> > > > +	seamrr_base = 0;
> > > > +	seamrr_mask = 0;
> > > > +}
> > > > +
> > > > +static void detect_seam(struct cpuinfo_x86 *c)
> > > > +{
> > > > +	if (c == &boot_cpu_data)
> > > > +		detect_seam_bsp(c);
> > > > +	else
> > > > +		detect_seam_ap(c);
> > > > +}
> > > > +
> > > > +void tdx_detect_cpu(struct cpuinfo_x86 *c)
> > > > +{
> > > > +	detect_seam(c);
> > > > +}
> > > 
> > > The extra function looks a bit silly here now.  Maybe this gets filled
> > > out later, but it's goofy-looking here.
> > 
> > Thomas suggested to put all TDX detection related in one function call, so I
> > added tdx_detect_cpu().  I'll move this to the next patch when detecting TDX
> > KeyIDs.
> 
> That's fine, or just add a comment or a changelog sentence about this
> being filled out later.

There's already one sentence in the changelog:

"......Add a function to detect all TDX preliminaries (SEAMRR, TDX private
KeyIDs) for a given cpu when it is brought up.  As the first step, detect the
validity of SEAMRR."

Does this look good to you?


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
  2022-04-26 20:56   ` Dave Hansen
@ 2022-04-27  0:01     ` Kai Huang
  2022-04-27 14:24       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-27  0:01 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The P-SEAMLDR (persistent SEAM loader) is the first software module that
> > runs in SEAM VMX root, responsible for loading and updating the TDX
> > module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
> > before host kernel boots.
> 
> Why bother with the P-SEAMLDR here at all?  The kernel isn't loading the
> TDX module in this series.  Why not just call into the TDX module directly?

It's not absolutely needed in this series.  I choose to detect P-SEAMLDR because
detecting it can also detect the TDX module, and eventually we will need to
support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
SEAMCALL to do that.

Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
SEAMLDR information to user at a basic level in dmesg:

[..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
build_num 160, major 1, minor 0

This may be useful to users, but it's not a hard requirement for this series.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-26 20:59   ` Dave Hansen
@ 2022-04-27  0:06     ` Kai Huang
  2022-05-18 16:19       ` Sagi Shahar
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-27  0:06 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > TDX supports shutting down the TDX module at any time during its
> > lifetime.  After TDX module is shut down, no further SEAMCALL can be
> > made on any logical cpu.
> 
> Is this strictly true?
> 
> I thought SEAMCALLs were used for the P-SEAMLDR too.

Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.

[...]

> >  
> > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > +struct seamcall_ctx {
> > +	u64 fn;
> > +	u64 rcx;
> > +	u64 rdx;
> > +	u64 r8;
> > +	u64 r9;
> > +	atomic_t err;
> > +	u64 seamcall_ret;
> > +	struct tdx_module_output out;
> > +};
> > +
> > +static void seamcall_smp_call_function(void *data)
> > +{
> > +	struct seamcall_ctx *sc = data;
> > +	int ret;
> > +
> > +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > +			&sc->seamcall_ret, &sc->out);
> > +	if (ret)
> > +		atomic_set(&sc->err, ret);
> > +}
> > +
> > +/*
> > + * Call the SEAMCALL on all online cpus concurrently.
> > + * Return error if SEAMCALL fails on any cpu.
> > + */
> > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > +{
> > +	on_each_cpu(seamcall_smp_call_function, sc, true);
> > +	return atomic_read(&sc->err);
> > +}
> 
> Why bother returning something that's not read?

It's not needed.  I'll make it void.

Caller can check seamcall_ctx::err directly if they want to know whether any
error happened.



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-26 23:49         ` Kai Huang
@ 2022-04-27  0:22           ` Sean Christopherson
  2022-04-27  0:44             ` Kai Huang
  2022-04-27 14:22           ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Sean Christopherson @ 2022-04-27  0:22 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, Apr 27, 2022, Kai Huang wrote:
> On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > On 4/26/22 16:12, Kai Huang wrote:
> > > Hi Dave,
> > > 
> > > Thanks for review!
> > > 
> > > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > > +config INTEL_TDX_HOST
> > > > > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > > > > +	default n
> > > > > +	depends on CPU_SUP_INTEL
> > > > > +	depends on X86_64
> > > > > +	help
> > > > > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > > malicious
> > > > > +	  host and certain physical attacks.  This option enables necessary
> > > > > TDX
> > > > > +	  support in host kernel to run protected VMs.
> > > > > +
> > > > > +	  If unsure, say N.
> > > > 
> > > > Nothing about KVM?
> > > 
> > > I'll add KVM into the context. How about below?
> > > 
> > > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > host and certain physical attacks.  This option enables necessary TDX
> > > support in host kernel to allow KVM to run protected VMs called Trust
> > > Domains (TD)."
> > 
> > What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?
> 
> Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM.  But so far KVM is the only
> user of TDX, so in practice the code is dead w/o KVM.
> 
> What's your opinion?

Take a dependency on CONFIG_KVM_INTEL, there's already precedence for this specific
case of a feature that can't possibly have an in-kernel user.  See
arch/x86/kernel/cpu/feat_ctl.c, which in the (very) unlikely event IA32_FEATURE_CONTROL
is left unlocked by BIOS, will deliberately disable VMX if CONFIG_KVM_INTEL=n.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-26 20:53   ` Dave Hansen
@ 2022-04-27  0:43     ` Kai Huang
  2022-04-27 14:49       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-27  0:43 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The TDX module is essentially a CPU-attested software module running
> > in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
> > host and certain physical attacks.  The TDX module implements the
> > functions to build, tear down and start execution of the protected VMs
> > called Trusted Domains (TD).  Before the TDX module can be used to
> > create and run TD guests, it must be loaded into the SEAM Range Register
> > (SEAMRR) and properly initialized.
> 
> The module isn't loaded into a register, right?
> 
> It's loaded into a memory area pointed to *by* the register.

Yes.  Should be below:

"..., it must be loaded into the memory area pointed to by the SEAM Ranger
Register (SEAMRR) ...".

> 
> >  The TDX module is expected to be
> > loaded by BIOS before booting to the kernel, and the kernel is expected
> > to detect and initialize it, using the SEAMCALLs defined by TDX
> > architecture.
> 
> Wait a sec...  So, what was all this gobleygook about TDX module loading
> and SEAMRR's if the kernel just has the TDX module *handed* to it
> already loaded?
> 
> It looks to me like you wrote all of this before the TDX module was
> being loaded by the BIOS and neglected to go and update these changelogs.

Those were written on purpose after we changed to loading the TDX module in the
BIOS.  In the code, I checks seamrr_enabled() as the first step to detect the
TDX module, so I thought it would be better to talk a little bit about "the TDX
module needs to be loaded to SEAMRR" thing.

> 
> > The TDX module can be initialized only once in its lifetime.  Instead
> > of always initializing it at boot time, this implementation chooses an
> > on-demand approach to initialize TDX until there is a real need (e.g
> > when requested by KVM).  This avoids consuming the memory that must be
> > allocated by kernel and given to the TDX module as metadata (~1/256th of
> > the TDX-usable memory), and also saves the time of initializing the TDX
> > module (and the metadata) when TDX is not used at all.  Initializing the
> > TDX module at runtime on-demand also is more flexible to support TDX
> > module runtime updating in the future (after updating the TDX module, it
> > needs to be initialized again).
> > 
> > Introduce two placeholders tdx_detect() and tdx_init() to detect and
> > initialize the TDX module on demand, with a state machine introduced to
> > orchestrate the entire process (in case of multiple callers).
> > 
> > To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
> > TDX module is reported as not loaded if either SEAMRR is not enabled, or
> > there are no enough TDX private KeyIDs to create any TD guest.  The TDX
> > module itself requires one global TDX private KeyID to crypto protect
> > its metadata.
> 
> This is stepping over the line into telling me what the code does
> instead of why.

Will remove.

[...]

> > +
> > +static bool seamrr_enabled(void)
> > +{
> > +	/*
> > +	 * To detect any BIOS misconfiguration among cores, all logical
> > +	 * cpus must have been brought up at least once.  This is true
> > +	 * unless 'maxcpus' kernel command line is used to limit the
> > +	 * number of cpus to be brought up during boot time.  However
> > +	 * 'maxcpus' is basically an invalid operation mode due to the
> > +	 * MCE broadcast problem, and it should not be used on a TDX
> > +	 * capable machine.  Just do paranoid check here and do not
> > +	 * report SEAMRR as enabled in this case.
> > +	 */
> > +	if (!cpumask_equal(&cpus_booted_once_mask,
> > +					cpu_present_mask))
> > +		return false;
> > +
> > +	return __seamrr_enabled();
> > +}
> > +
> > +static bool tdx_keyid_sufficient(void)
> > +{
> > +	if (!cpumask_equal(&cpus_booted_once_mask,
> > +					cpu_present_mask))
> > +		return false;
> 
> I'd move this cpumask_equal() to a helper.

Sorry to double confirm, do you want something like:

static bool tdx_detected_on_all_cpus(void)
{
	/*
	 * To detect any BIOS misconfiguration among cores, all logical
	 * cpus must have been brought up at least once.  This is true
	 * unless 'maxcpus' kernel command line is used to limit the
	 * number of cpus to be brought up during boot time.  However
	 * 'maxcpus' is basically an invalid operation mode due to the
	 * MCE broadcast problem, and it should not be used on a TDX
	 * capable machine.  Just do paranoid check here and do not
	 * report SEAMRR as enabled in this case.
	 */
	return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
}

static bool seamrr_enabled(void)
{
	if (!tdx_detected_on_all_cpus())
		return false;

	return __seamrr_enabled();
}

static bool tdx_keyid_sufficient()
{
	if (!tdx_detected_on_all_cpus())
		return false;

	...
}

> 
> > +	/*
> > +	 * TDX requires at least two KeyIDs: one global KeyID to
> > +	 * protect the metadata of the TDX module and one or more
> > +	 * KeyIDs to run TD guests.
> > +	 */
> > +	return tdx_keyid_num >= 2;
> > +}
> > +
> > +static int __tdx_detect(void)
> > +{
> > +	/* The TDX module is not loaded if SEAMRR is disabled */
> > +	if (!seamrr_enabled()) {
> > +		pr_info("SEAMRR not enabled.\n");
> > +		goto no_tdx_module;
> > +	}
> 
> Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
> the module with SEAMCALL.  Why not just use that directly?

SEAMCALL will cause #GP if SEAMRR is not enabled.  We should check whether
SEAMRR is enabled before making SEAMCALL.

> 
> > +	/*
> > +	 * Also do not report the TDX module as loaded if there's
> > +	 * no enough TDX private KeyIDs to run any TD guests.
> > +	 */
> > +	if (!tdx_keyid_sufficient()) {
> > +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > +				tdx_keyid_num);
> > +		goto no_tdx_module;
> > +	}
> > +
> > +	/* Return -ENODEV until the TDX module is detected */
> > +no_tdx_module:
> > +	tdx_module_status = TDX_MODULE_NONE;
> > +	return -ENODEV;
> > +}
> > +
> > +static int init_tdx_module(void)
> > +{
> > +	/*
> > +	 * Return -EFAULT until all steps of TDX module
> > +	 * initialization are done.
> > +	 */
> > +	return -EFAULT;
> > +}
> > +
> > +static void shutdown_tdx_module(void)
> > +{
> > +	/* TODO: Shut down the TDX module */
> > +	tdx_module_status = TDX_MODULE_SHUTDOWN;
> > +}
> > +
> > +static int __tdx_init(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * Logical-cpu scope initialization requires calling one SEAMCALL
> > +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
> > +	 * module also has such requirement.  Further more, configuring
> > +	 * the key of the global KeyID requires calling one SEAMCALL for
> > +	 * each package.  For simplicity, disable CPU hotplug in the whole
> > +	 * initialization process.
> > +	 *
> > +	 * It's perhaps better to check whether all BIOS-enabled cpus are
> > +	 * online before starting initializing, and return early if not.
> 
> But you did some of this cpumask checking above.  Right?

Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
cpus.  the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.

> 
> > +	 * But none of 'possible', 'present' and 'online' CPU masks
> > +	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
> > +	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
> > +	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
> > +	 * online.
> > +	 */
> > +	cpus_read_lock();
> > +
> > +	ret = init_tdx_module();
> > +
> > +	/*
> > +	 * Shut down the TDX module in case of any error during the
> > +	 * initialization process.  It's meaningless to leave the TDX
> > +	 * module in any middle state of the initialization process.
> > +	 */
> > +	if (ret)
> > +		shutdown_tdx_module();
> > +
> > +	cpus_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * tdx_detect - Detect whether the TDX module has been loaded
> > + *
> > + * Detect whether the TDX module has been loaded and ready for
> > + * initialization.  Only call this function when all cpus are
> > + * already in VMX operation.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0:	The TDX module has been loaded and ready for
> > + *		initialization.
> 
> "-0", eh?

Sorry.  Originally I meant to have below:

	- 0:
	- -ENODEV:
	...

I changed to want to have below:

	0:
	-ENODEV:
	...

But forgot to remove the '-' before 0. I'll remove it.

> 
> > + * * -ENODEV:	The TDX module is not loaded.
> > + * * -EPERM:	CPU is not in VMX operation.
> > + * * -EFAULT:	Other internal fatal errors.
> > + */
> > +int tdx_detect(void)
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&tdx_module_lock);
> > +
> > +	switch (tdx_module_status) {
> > +	case TDX_MODULE_UNKNOWN:
> > +		ret = __tdx_detect();
> > +		break;
> > +	case TDX_MODULE_NONE:
> > +		ret = -ENODEV;
> > +		break;
> > +	case TDX_MODULE_LOADED:
> > +	case TDX_MODULE_INITIALIZED:
> > +		ret = 0;
> > +		break;
> > +	case TDX_MODULE_SHUTDOWN:
> > +		ret = -EFAULT;
> > +		break;
> > +	default:
> > +		WARN_ON(1);
> > +		ret = -EFAULT;
> > +	}
> > +
> > +	mutex_unlock(&tdx_module_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_detect);
> > +
> > +/**
> > + * tdx_init - Initialize the TDX module
> > + *
> > + * Initialize the TDX module to make it ready to run TD guests.  This
> > + * function should be called after tdx_detect() returns successful.
> > + * Only call this function when all cpus are online and are in VMX
> > + * operation.  CPU hotplug is temporarily disabled internally.
> > + *
> > + * This function can be called in parallel by multiple callers.
> > + *
> > + * Return:
> > + *
> > + * * -0:	The TDX module has been successfully initialized.
> > + * * -ENODEV:	The TDX module is not loaded.
> > + * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
> > + * * -EFAULT:	Other internal fatal errors.
> > + */
> > +int tdx_init(void)
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&tdx_module_lock);
> > +
> > +	switch (tdx_module_status) {
> > +	case TDX_MODULE_NONE:
> > +		ret = -ENODEV;
> > +		break;
> > +	case TDX_MODULE_LOADED:
> > +		ret = __tdx_init();
> > +		break;
> > +	case TDX_MODULE_INITIALIZED:
> > +		ret = 0;
> > +		break;
> > +	default:
> > +		ret = -EFAULT;
> > +		break;
> > +	}
> > +	mutex_unlock(&tdx_module_lock);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_init);
> 
> Why does this need both a tdx_detect() and a tdx_init()?  Shouldn't the
> interface from outside just be "get TDX up and running, please?"

We can have a single tdx_init().  However tdx_init() can be heavy, and having a
separate non-heavy tdx_detect() may be useful if caller wants to separate
"detecting the TDX module" and "initializing the TDX module", i.e. to do
something in the middle.

However tdx_detect() basically only detects P-SEAMLDR.  If we move P-SEAMLDR
detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
need tdx_detect() anymore.  We can expose seamrr_enabled() and TDX KeyID
variables or functions so caller can use them to see whether it should do TDX
related staff and then call tdx_init().


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-27  0:22           ` Sean Christopherson
@ 2022-04-27  0:44             ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-27  0:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 00:22 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Kai Huang wrote:
> > On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > > On 4/26/22 16:12, Kai Huang wrote:
> > > > Hi Dave,
> > > > 
> > > > Thanks for review!
> > > > 
> > > > On Tue, 2022-04-26 at 13:21 -0700, Dave Hansen wrote:
> > > > > > +config INTEL_TDX_HOST
> > > > > > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > > > > > +	default n
> > > > > > +	depends on CPU_SUP_INTEL
> > > > > > +	depends on X86_64
> > > > > > +	help
> > > > > > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from
> > > > > > malicious
> > > > > > +	  host and certain physical attacks.  This option enables necessary
> > > > > > TDX
> > > > > > +	  support in host kernel to run protected VMs.
> > > > > > +
> > > > > > +	  If unsure, say N.
> > > > > 
> > > > > Nothing about KVM?
> > > > 
> > > > I'll add KVM into the context. How about below?
> > > > 
> > > > "Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > > host and certain physical attacks.  This option enables necessary TDX
> > > > support in host kernel to allow KVM to run protected VMs called Trust
> > > > Domains (TD)."
> > > 
> > > What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?
> > 
> > Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> > make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM.  But so far KVM is the only
> > user of TDX, so in practice the code is dead w/o KVM.
> > 
> > What's your opinion?
> 
> Take a dependency on CONFIG_KVM_INTEL, there's already precedence for this specific
> case of a feature that can't possibly have an in-kernel user.  See
> arch/x86/kernel/cpu/feat_ctl.c, which in the (very) unlikely event IA32_FEATURE_CONTROL
> is left unlocked by BIOS, will deliberately disable VMX if CONFIG_KVM_INTEL=n.

Thanks.  Fine to me.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-26 20:13 ` Dave Hansen
@ 2022-04-27  1:15   ` Kai Huang
  2022-04-27 21:59     ` Dave Hansen
  2022-04-28  1:01   ` Dan Williams
  1 sibling, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-27  1:15 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > SEAM VMX root operation is designed to host a CPU-attested, software
> > module called the 'TDX module' which implements functions to manage
> > crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
> 
> "crypto protected"?  What the heck is that?

How about "crypto-protected"?  I googled and it seems it is used by someone
else.

> 
> > designed to host a CPU-attested, software module called the 'Intel
> > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> > 
> > Host kernel transits to either the P-SEAMLDR or the TDX module via a new
> 
>  ^ The

Thanks.

> 
> > SEAMCALL instruction.  SEAMCALLs are host-side interface functions
> > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> > instruction.  They are similar to a hypercall, except they are made by
> > host kernel to the SEAM software modules.
> 
> This is still missing some important high-level things, like that the
> TDX module is protected from the untrusted VMM.  Heck, it forgets to
> mention that the VMM itself is untrusted and the TDX module replaces
> things that the VMM usually does.
> 
> It would also be nice to mention here how this compares with SEV-SNP.
> Where is the TDX module in that design?  Why doesn't SEV need all this code?
> 
> > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> > protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
> > KeyIDs, which can only be used by software runs in SEAM.  The physical
> 
> 					    ^ which

Thanks.

> 
> > address bits for encoding TDX private KeyID are treated as reserved bits
> > when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
> > private KeyIDs is configured by BIOS.
> > 
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized using SEAMCALLs defined by TDX architecture.
> > This series assumes both the P-SEAMLDR and the TDX module are loaded by
> > BIOS before the kernel boots.
> > 
> > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> > SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
> > P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
> > SEAMCALL further tells whether TDX module is loaded.
> 
> There's a bit of information missing here.  The kernel might not know
> the state of things being loaded.  A previous kernel might have loaded
> it and left it in an unknown state.
> 
> > The TDX module is initialized in multiple steps:
> > 
> >         1) Global initialization;
> >         2) Logical-CPU scope initialization;
> >         3) Enumerate the TDX module capabilities;
> >         4) Configure the TDX module about usable memory ranges and
> >            global KeyID information;
> >         5) Package-scope configuration for the global KeyID;
> >         6) Initialize TDX metadata for usable memory ranges based on 4).
> > 
> > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> > table) logical cpus, otherwise step 4) will fail.  Step 5) requires
> > calling SEAMCALL on at least one cpu on all packages.
> > 
> > TDX module can also be shut down at any time during module's lifetime, by
> > calling SEAMCALL on all "BIOS-enabled" logical cpus.
> > 
> > == Design Considerations ==
> > 
> > 1. Lazy TDX module initialization on-demand by caller
> 
> This doesn't really tell us what "lazy" is or what the alternatives are.
> 
> There are basically two ways the TDX module could be loaded.  Either:
>   * In early boot
> or
>   * At runtime just before the first TDX guest is run
> 
> This series implements the runtime loading.

OK will do.

> 
> > None of the steps in the TDX module initialization process must be done
> > during kernel boot.  This series doesn't initialize TDX at boot time, but
> > instead, provides two functions to allow caller to detect and initialize
> > TDX on demand:
> > 
> >         if (tdx_detect())
> >                 goto no_tdx;
> >         if (tdx_init())
> >                 goto no_tdx;
> > 
> > This approach has below pros:
> > 
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata.  Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> > 
> > 2) Both detecting and initializing the TDX module require calling
> > SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
> > (VMXON has been done).  So far, KVM is the only user of TDX, and it
> > already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
> > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> > core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
> > approach is needed since more kernel components will need to handle
> > VMXON/VMXONFF.
> > 
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series).  After updating to the new module at runtime, kernel needs
> > to go through the initialization process again.  For the new module,
> > it's possible the metadata allocated for the old module cannot be reused
> > for the new module, and needs to be re-allocated again.
> > 
> > 2. Kernel policy on TDX memory
> > 
> > Host kernel is responsible for choosing which memory regions can be used
> > as TDX memory, and configuring those memory regions to the TDX module by
> > using an array of "TD Memory Regions" (TDMR), which is a data structure
> > defined by TDX architecture.
> 
> 
> This is putting the cart before the horse.  Don't define the details up
> front.
> 
> 	The TDX architecture allows the VMM to designate specific memory
> 	as usable for TDX private memory.  This series chooses to
> 	designate _all_ system RAM as TDX to avoid having to modify the
> 	page allocator to distinguish TDX and non-TDX-capable memory
> 
> ... then go on to explain the details.

Thanks.  Will update.

> 
> > The first generation of TDX essentially guarantees that all system RAM
> > memory regions (excluding the memory below 1MB) can be used as TDX
> > memory.  To avoid having to modify the page allocator to distinguish TDX
> > and non-TDX allocation, this series chooses to use all system RAM as TDX
> > memory.
> > 
> > E820 table is used to find all system RAM entries.  Following
> > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> > types are treated as TDX memory, and contiguous ranges in the same NUMA
> > node are merged together (similar to memblock_add()) before trimming the
> > non-page-aligned part.
> 
> This e820 cruft is too much detail for a cover letter.  In general, once
> you start talking about individual functions, you've gone too far in the
> cover letter.

Will remove.

> 
> > 3. Memory hotplug
> > 
> > The first generation of TDX architecturally doesn't support memory
> > hotplug.  And the first generation of TDX-capable platforms don't support
> > physical memory hotplug.  Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> > 
> > A special case of memory hotplug is adding NVDIMM as system RAM using
> > kmem driver.  However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
> 
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

I forgot to add below (which is in the documentation patch):

"This can be enhanced when future generation of TDX starts to support ACPI
memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
same platform."

Is this acceptable?

> 
> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM.  To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > 
> > 4. CPU hotplug
> > 
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either.  Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.
> > 
> > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > online.  After the initialization, any logical cpu can be brought down
> > and brought up to online again later.  Therefore this series doesn't
> > change logical CPU hotplug either.
> > 
> > 5. TDX interaction with kexec()
> > 
> > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > of TDX private memory, including PAMTs, used by TDX module need to be
> > flushed before transiting to the new kernel otherwise they may silently
> > corrupt the new kernel.  Similar to SME, this series flushes cache in
> > stop_this_cpu().
> 
> What does this have to do with kexec()?  What's a PAMT?

The point is the dirty cachelines of TDX private memory must be flushed
otherwise they may slightly corrupt the new kexec()-ed kernel.

Will use "TDX metadata" instead of "PAMT".  The former has already been
mentioned above.

> 
> > The TDX module can be initialized only once during its lifetime.  The
> > first generation of TDX doesn't have interface to reset TDX module to
> 
> 				      ^ an

Thanks.

> 
> > uninitialized state so it can be initialized again.
> > 
> > This implies:
> > 
> >   - If the old kernel fails to initialize TDX, the new kernel cannot
> >     use TDX too unless the new kernel fixes the bug which leads to
> >     initialization failure in the old kernel and can resume from where
> >     the old kernel stops. This requires certain coordination between
> >     the two kernels.
> 
> OK, but what does this *MEAN*?

This means we need to extend the information which the old kernel passes to the
new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
to make it more concise next version.

> 
> >   - If the old kernel has initialized TDX successfully, the new kernel
> >     may be able to use TDX if the two kernels have the exactly same
> >     configurations on the TDX module. It further requires the new kernel
> >     to reserve the TDX metadata pages (allocated by the old kernel) in
> >     its page allocator. It also requires coordination between the two
> >     kernels.  Furthermore, if kexec() is done when there are active TD
> >     guests running, the new kernel cannot use TDX because it's extremely
> >     hard for the old kernel to pass all TDX private pages to the new
> >     kernel.
> > 
> > Given that, this series doesn't support TDX after kexec() (except the
> > old kernel doesn't attempt to initialize TDX at all).
> > 
> > And this series doesn't shut down TDX module but leaves it open during
> > kexec().  It is because shutting down TDX module requires CPU being in
> > VMX operation but there's no guarantee of this during kexec().  Leaving
> > the TDX module open is not the best case, but it is OK since the new
> > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > at all).
> 
> tl;dr: kexec() doesn't work with this code.
> 
> Right?
> 
> That doesn't seem good.

It can work in my understanding.  We just need to flush cache before booting to
the new kernel.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-26 23:49         ` Kai Huang
  2022-04-27  0:22           ` Sean Christopherson
@ 2022-04-27 14:22           ` Dave Hansen
  2022-04-27 22:39             ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 14:22 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/26/22 16:49, Kai Huang wrote:
> On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
>> What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?
> 
> Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM.  But so far KVM is the only
> user of TDX, so in practice the code is dead w/o KVM.
> 
> What's your opinion?

You're stuck in some really weird fantasy world.  Sure, we can dream up
more than one user of the TDX module.  But, in the real world, there's
only one.  Plus, code can have multiple dependencies!

	depends on FOO || BAR

This TDX cruft is dead code in today's real-world kernel without KVM.
You should add a dependency.

>>>>> +static bool __seamrr_enabled(void)
>>>>> +{
>>>>> +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
>>>>> +}
>>>>
>>>> But there's no case where seamrr_mask is non-zero and where
>>>> _seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?
>>>
>>> seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
>>> is 0.  It will also be cleared when BIOS mis-configuration is detected on any
>>> AP.  SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
>>
>> The point is that this could be:
>>
>> 	return !!seamrr_mask;
> 
> The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits. 
> Explicitly checking the two bits, instead of !!seamrr_mask roles out other
> incorrect configurations.  For instance, we should not treat SEAMRR being
> enabled if we only have "ENABLED" bit set or "LOCKED" bit set.

You're confusing two different things:
 * The state of the variable
 * The actual correct hardware state

The *VARIABLE* can't be non-zero and also denote that SEAMRR is enabled.
 Does this *CODE* ever set ENABLED or LOCKED without each other?

>>>>> +static void detect_seam_ap(struct cpuinfo_x86 *c)
>>>>> +{
>>>>> +	u64 base, mask;
>>>>> +
>>>>> +	/*
>>>>> +	 * Don't bother to detect this AP if SEAMRR is not
>>>>> +	 * enabled after earlier detections.
>>>>> +	 */
>>>>> +	if (!__seamrr_enabled())
>>>>> +		return;
>>>>> +
>>>>> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
>>>>> +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
>>>>> +
>>>>
>>>> This is the place for a comment about why the values have to be equal.
>>>
>>> I'll add below:
>>>
>>> /* BIOS must configure SEAMRR consistently across all cores */
>>
>> What happens if the BIOS doesn't do this?  What actually breaks?  In
>> other words, do we *NEED* error checking here?
> 
> AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
> configure them consistently among cores.  But for safety I think it's better to
> detect.

Safety?  Safety of what?

>>>>> +void tdx_detect_cpu(struct cpuinfo_x86 *c)
>>>>> +{
>>>>> +	detect_seam(c);
>>>>> +}
>>>>
>>>> The extra function looks a bit silly here now.  Maybe this gets filled
>>>> out later, but it's goofy-looking here.
>>>
>>> Thomas suggested to put all TDX detection related in one function call, so I
>>> added tdx_detect_cpu().  I'll move this to the next patch when detecting TDX
>>> KeyIDs.
>>
>> That's fine, or just add a comment or a changelog sentence about this
>> being filled out later.
> 
> There's already one sentence in the changelog:
> 
> "......Add a function to detect all TDX preliminaries (SEAMRR, TDX private
> KeyIDs) for a given cpu when it is brought up.  As the first step, detect the
> validity of SEAMRR."
> 
> Does this look good to you?

No, that doesn't provide enough context.

There are two single-line wrapper functions.  One calls the other.  That
looks entirely silly in this patch.  You need to explain the silliness,
explicitly.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
  2022-04-27  0:01     ` Kai Huang
@ 2022-04-27 14:24       ` Dave Hansen
  2022-04-27 21:30         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 14:24 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/26/22 17:01, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> The P-SEAMLDR (persistent SEAM loader) is the first software module that
>>> runs in SEAM VMX root, responsible for loading and updating the TDX
>>> module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
>>> before host kernel boots.
>>
>> Why bother with the P-SEAMLDR here at all?  The kernel isn't loading the
>> TDX module in this series.  Why not just call into the TDX module directly?
> 
> It's not absolutely needed in this series.  I choose to detect P-SEAMLDR because
> detecting it can also detect the TDX module, and eventually we will need to
> support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
> SEAMCALL to do that.
> 
> Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
> SEAMLDR information to user at a basic level in dmesg:
> 
> [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
> build_num 160, major 1, minor 0
> 
> This may be useful to users, but it's not a hard requirement for this series.

We've had a lot of problems in general with this code trying to do too
much at once.  I thought we agreed that this was going to only contain
the minimum code to make TDX functional.  It seems to be creeping to
grow bigger and bigger.

Am I remembering this wrong?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-27  0:43     ` Kai Huang
@ 2022-04-27 14:49       ` Dave Hansen
  2022-04-28  0:00         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 14:49 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/26/22 17:43, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
...
>>> +static bool tdx_keyid_sufficient(void)
>>> +{
>>> +	if (!cpumask_equal(&cpus_booted_once_mask,
>>> +					cpu_present_mask))
>>> +		return false;
>>
>> I'd move this cpumask_equal() to a helper.
> 
> Sorry to double confirm, do you want something like:
> 
> static bool tdx_detected_on_all_cpus(void)
> {
> 	/*
> 	 * To detect any BIOS misconfiguration among cores, all logical
> 	 * cpus must have been brought up at least once.  This is true
> 	 * unless 'maxcpus' kernel command line is used to limit the
> 	 * number of cpus to be brought up during boot time.  However
> 	 * 'maxcpus' is basically an invalid operation mode due to the
> 	 * MCE broadcast problem, and it should not be used on a TDX
> 	 * capable machine.  Just do paranoid check here and do not
> 	 * report SEAMRR as enabled in this case.
> 	 */
> 	return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
> }

That's logically the right idea, but I hate the name since the actual
test has nothing to do with TDX being detected.  The comment is also
rather verbose and rambling.

It should be named something like:

	all_cpus_booted()

and with a comment like this:

/*
 * To initialize TDX, the kernel needs to run some code on every
 * present CPU.  Detect cases where present CPUs have not been
 * booted, like when maxcpus=N is used.
 */

> static bool seamrr_enabled(void)
> {
> 	if (!tdx_detected_on_all_cpus())
> 		return false;
> 
> 	return __seamrr_enabled();
> }
> 
> static bool tdx_keyid_sufficient()
> {
> 	if (!tdx_detected_on_all_cpus())
> 		return false;
> 
> 	...
> }

Although, looking at those, it's *still* unclear why you need this.  I
assume it's because some later TDX SEAMCALL will fail if you get this
wrong, and you want to be able to provide a better error message.

*BUT* this code doesn't actually provide halfway reasonable error
messages.  If someone uses maxcpus=99, then this code will report:

	pr_info("SEAMRR not enabled.\n");

right?  That's bonkers.

>>> +	/*
>>> +	 * TDX requires at least two KeyIDs: one global KeyID to
>>> +	 * protect the metadata of the TDX module and one or more
>>> +	 * KeyIDs to run TD guests.
>>> +	 */
>>> +	return tdx_keyid_num >= 2;
>>> +}
>>> +
>>> +static int __tdx_detect(void)
>>> +{
>>> +	/* The TDX module is not loaded if SEAMRR is disabled */
>>> +	if (!seamrr_enabled()) {
>>> +		pr_info("SEAMRR not enabled.\n");
>>> +		goto no_tdx_module;
>>> +	}
>>
>> Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
>> the module with SEAMCALL.  Why not just use that directly?
> 
> SEAMCALL will cause #GP if SEAMRR is not enabled.  We should check whether
> SEAMRR is enabled before making SEAMCALL.

So...  You could actually get rid of all this code.  if SEAMCALL #GP's,
then you say, "Whoops, the firmware didn't load the TDX module
correctly, sorry."

Why is all this code here?  What is it for?

>>> +	/*
>>> +	 * Also do not report the TDX module as loaded if there's
>>> +	 * no enough TDX private KeyIDs to run any TD guests.
>>> +	 */
>>> +	if (!tdx_keyid_sufficient()) {
>>> +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
>>> +				tdx_keyid_num);
>>> +		goto no_tdx_module;
>>> +	}
>>> +
>>> +	/* Return -ENODEV until the TDX module is detected */
>>> +no_tdx_module:
>>> +	tdx_module_status = TDX_MODULE_NONE;
>>> +	return -ENODEV;
>>> +}

Again, if someone uses maxcpus=1234 and we get down here, then it
reports to the user:
	
	Number of TDX private KeyIDs too small: ...

????  When the root of the problem has nothing to do with KeyIDs.

>>> +static int init_tdx_module(void)
>>> +{
>>> +	/*
>>> +	 * Return -EFAULT until all steps of TDX module
>>> +	 * initialization are done.
>>> +	 */
>>> +	return -EFAULT;
>>> +}
>>> +
>>> +static void shutdown_tdx_module(void)
>>> +{
>>> +	/* TODO: Shut down the TDX module */
>>> +	tdx_module_status = TDX_MODULE_SHUTDOWN;
>>> +}
>>> +
>>> +static int __tdx_init(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * Logical-cpu scope initialization requires calling one SEAMCALL
>>> +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
>>> +	 * module also has such requirement.  Further more, configuring
>>> +	 * the key of the global KeyID requires calling one SEAMCALL for
>>> +	 * each package.  For simplicity, disable CPU hotplug in the whole
>>> +	 * initialization process.
>>> +	 *
>>> +	 * It's perhaps better to check whether all BIOS-enabled cpus are
>>> +	 * online before starting initializing, and return early if not.
>>
>> But you did some of this cpumask checking above.  Right?
> 
> Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
> cpus.  the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.

I have no idea what this is saying.  In general, I have no idea what the
comment is saying.  It makes zero sense.  The locking pattern for stuff
like this is:

	cpus_read_lock();

	for_each_online_cpu()
		do_something();

	cpus_read_unlock();

because you need to make sure that you don't miss "do_something()" on a
CPU that comes online during the loop.

But, now that I think about it, all of the checks I've seen so far are
for *booted* CPUs.  While the lock (I assume) would keep new CPUs from
booting, it doesn't do any good really since the "cpus_booted_once_mask"
bits are only set and not cleared.  A CPU doesn't un-become booted once.

Again, we seem to have a long, verbose comment that says very little and
only confuses me.

...
>> Why does this need both a tdx_detect() and a tdx_init()?  Shouldn't the
>> interface from outside just be "get TDX up and running, please?"
> 
> We can have a single tdx_init().  However tdx_init() can be heavy, and having a
> separate non-heavy tdx_detect() may be useful if caller wants to separate
> "detecting the TDX module" and "initializing the TDX module", i.e. to do
> something in the middle.

<Sigh>  So, this "design" went unmentioned, *and* I can't review if the
actual callers of this need the functionality or not because they're not
in this series.

> However tdx_detect() basically only detects P-SEAMLDR.  If we move P-SEAMLDR
> detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
> need tdx_detect() anymore.  We can expose seamrr_enabled() and TDX KeyID
> variables or functions so caller can use them to see whether it should do TDX
> related staff and then call tdx_init().

I don't think you've made a strong case for why P-SEAMLDR detection is
even necessary in this series.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
  2022-04-27 14:24       ` Dave Hansen
@ 2022-04-27 21:30         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-27 21:30 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 07:24 -0700, Dave Hansen wrote:
> On 4/26/22 17:01, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:56 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > The P-SEAMLDR (persistent SEAM loader) is the first software module that
> > > > runs in SEAM VMX root, responsible for loading and updating the TDX
> > > > module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
> > > > before host kernel boots.
> > > 
> > > Why bother with the P-SEAMLDR here at all?  The kernel isn't loading the
> > > TDX module in this series.  Why not just call into the TDX module directly?
> > 
> > It's not absolutely needed in this series.  I choose to detect P-SEAMLDR because
> > detecting it can also detect the TDX module, and eventually we will need to
> > support P-SEAMLDR because the TDX module runtime update uses P-SEAMLDR's
> > SEAMCALL to do that.
> > 
> > Also, even for this series, detecting the P-SEAMLDR allows us to provide the P-
> > SEAMLDR information to user at a basic level in dmesg:
> > 
> > [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209,
> > build_num 160, major 1, minor 0
> > 
> > This may be useful to users, but it's not a hard requirement for this series.
> 
> We've had a lot of problems in general with this code trying to do too
> much at once.  I thought we agreed that this was going to only contain
> the minimum code to make TDX functional.  It seems to be creeping to
> grow bigger and bigger.
> 
> Am I remembering this wrong?

OK. I'll remove the P-SEAMLDR related code.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-27  1:15   ` Kai Huang
@ 2022-04-27 21:59     ` Dave Hansen
  2022-04-28  0:37       ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 21:59 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/26/22 18:15, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> SEAM VMX root operation is designed to host a CPU-attested, software
>>> module called the 'TDX module' which implements functions to manage
>>> crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
>>
>> "crypto protected"?  What the heck is that?
> 
> How about "crypto-protected"?  I googled and it seems it is used by someone
> else.

Cryptography itself doesn't provide (much) protection in the TDX
architecture.  TDX guests are isolated from the VMM in ways that
traditional guests are not, but that has almost nothing to do with
cryptography.

Is it cryptography that keeps the host from reading guest private data
in the clear?  Is it cryptography that keeps the host from reading guest
ciphertext?  Does cryptography enforce the extra rules of Secure-EPT?

>>> 3. Memory hotplug
>>>
>>> The first generation of TDX architecturally doesn't support memory
>>> hotplug.  And the first generation of TDX-capable platforms don't support
>>> physical memory hotplug.  Since it physically cannot happen, this series
>>> doesn't add any check in ACPI memory hotplug code path to disable it.
>>>
>>> A special case of memory hotplug is adding NVDIMM as system RAM using
>>> kmem driver.  However the first generation of TDX-capable platforms
>>> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
>>> happen either.
>>
>> What prevents this code from today's code being run on tomorrow's
>> platforms and breaking these assumptions?
> 
> I forgot to add below (which is in the documentation patch):
> 
> "This can be enhanced when future generation of TDX starts to support ACPI
> memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> same platform."
> 
> Is this acceptable?

No, Kai.

You're basically saying: *this* code doesn't work with feature A, B and
C.  Then, you're pivoting to say that it doesn't matter because one
version of Intel's hardware doesn't support A, B, or C.

I don't care about this *ONE* version of the hardware.  I care about
*ALL* the hardware that this code will ever support.  *ALL* the hardware
on which this code will run.

In 5 years, if someone takes this code and runs it on Intel hardware
with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

You can't just ignore the problems because they're not present on one
version of the hardware.

>>> Another case is admin can use 'memmap' kernel command line to create
>>> legacy PMEMs and use them as TD guest memory, or theoretically, can use
>>> kmem driver to add them as system RAM.  To avoid having to change memory
>>> hotplug code to prevent this from happening, this series always include
>>> legacy PMEMs when constructing TDMRs so they are also TDX memory.
>>>
>>> 4. CPU hotplug
>>>
>>> The first generation of TDX architecturally doesn't support ACPI CPU
>>> hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
>>> first generation of TDX-capable platforms don't support ACPI CPU hotplug
>>> either.  Since this physically cannot happen, this series doesn't add any
>>> check in ACPI CPU hotplug code path to disable it.
>>>
>>> Also, only TDX module initialization requires all BIOS-enabled cpus are
>>> online.  After the initialization, any logical cpu can be brought down
>>> and brought up to online again later.  Therefore this series doesn't
>>> change logical CPU hotplug either.
>>>
>>> 5. TDX interaction with kexec()
>>>
>>> If TDX is ever enabled and/or used to run any TD guests, the cachelines
>>> of TDX private memory, including PAMTs, used by TDX module need to be
>>> flushed before transiting to the new kernel otherwise they may silently
>>> corrupt the new kernel.  Similar to SME, this series flushes cache in
>>> stop_this_cpu().
>>
>> What does this have to do with kexec()?  What's a PAMT?
> 
> The point is the dirty cachelines of TDX private memory must be flushed
> otherwise they may slightly corrupt the new kexec()-ed kernel.
> 
> Will use "TDX metadata" instead of "PAMT".  The former has already been
> mentioned above.

Longer description for the patch itself:

TDX memory encryption is built on top of MKTME which uses physical
address aliases to designate encryption keys.  This architecture is not
cache coherent.  Software is responsible for flushing the CPU caches
when memory changes keys.  When kexec()'ing, memory can be repurposed
from TDX use to non-TDX use, changing the effective encryption key.

Cover-letter-level description:

Just like SME, TDX hosts require special cache flushing before kexec().

>>> uninitialized state so it can be initialized again.
>>>
>>> This implies:
>>>
>>>   - If the old kernel fails to initialize TDX, the new kernel cannot
>>>     use TDX too unless the new kernel fixes the bug which leads to
>>>     initialization failure in the old kernel and can resume from where
>>>     the old kernel stops. This requires certain coordination between
>>>     the two kernels.
>>
>> OK, but what does this *MEAN*?
> 
> This means we need to extend the information which the old kernel passes to the
> new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
> to make it more concise next version.
> 
>>
>>>   - If the old kernel has initialized TDX successfully, the new kernel
>>>     may be able to use TDX if the two kernels have the exactly same
>>>     configurations on the TDX module. It further requires the new kernel
>>>     to reserve the TDX metadata pages (allocated by the old kernel) in
>>>     its page allocator. It also requires coordination between the two
>>>     kernels.  Furthermore, if kexec() is done when there are active TD
>>>     guests running, the new kernel cannot use TDX because it's extremely
>>>     hard for the old kernel to pass all TDX private pages to the new
>>>     kernel.
>>>
>>> Given that, this series doesn't support TDX after kexec() (except the
>>> old kernel doesn't attempt to initialize TDX at all).
>>>
>>> And this series doesn't shut down TDX module but leaves it open during
>>> kexec().  It is because shutting down TDX module requires CPU being in
>>> VMX operation but there's no guarantee of this during kexec().  Leaving
>>> the TDX module open is not the best case, but it is OK since the new
>>> kernel won't be able to use TDX anyway (therefore TDX module won't run
>>> at all).
>>
>> tl;dr: kexec() doesn't work with this code.
>>
>> Right?
>>
>> That doesn't seem good.
> 
> It can work in my understanding.  We just need to flush cache before booting to
> the new kernel.

What about all the concerns about TDX module configuration changing?


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
  2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
@ 2022-04-27 22:15   ` Dave Hansen
  2022-04-28  0:15     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 22:15 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums.  Not all memory
> satisfies these requirements.
> 
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR).  During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.  The list of these
> ranges, along with TDX module information, is available to the kernel by
> querying the TDX module via TDH.SYS.INFO SEAMCALL.
> 
> Host kernel can choose whether or not to use all convertible memory
> regions as TDX memory.  Before TDX module is ready to create any TD
> guests, all TDX memory regions that host kernel intends to use must be
> configured to the TDX module, using specific data structures defined by
> TDX architecture.  Constructing those structures requires information of
> both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
> to get this information as preparation to construct those structures.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
>  2 files changed, 192 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ef2718423f0f..482e6d858181 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
>  
>  static struct p_seamldr_info p_seamldr_info;
>  
> +/* Base address of CMR array needs to be 512 bytes aligned. */
> +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> +static int tdx_cmr_num;
> +static struct tdsysinfo_struct tdx_sysinfo;

I really dislike mixing hardware and software structures.  Please make
it clear which of these are fully software-defined and which are part of
the hardware ABI.

>  static bool __seamrr_enabled(void)
>  {
>  	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> @@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
>  	return seamcall_on_each_cpu(&sc);
>  }
>  
> +static inline bool cmr_valid(struct cmr_info *cmr)
> +{
> +	return !!cmr->size;
> +}
> +
> +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
> +		       const char *name)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +
> +		pr_info("%s : [0x%llx, 0x%llx)\n", name,
> +				cmr->base, cmr->base + cmr->size);
> +	}
> +}
> +
> +static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
> +{
> +	int i, j;
> +
> +	/*
> +	 * Intel TDX module spec, 20.7.3 CMR_INFO:
> +	 *
> +	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> +	 *   array of CMR_INFO entries. The CMRs are sorted from the
> +	 *   lowest base address to the highest base address, and they
> +	 *   are non-overlapping.
> +	 *
> +	 * This implies that BIOS may generate invalid empty entries
> +	 * if total CMRs are less than 32.  Skip them manually.
> +	 */
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +		struct cmr_info *prev_cmr = NULL;
> +
> +		/* Skip further invalid CMRs */
> +		if (!cmr_valid(cmr))
> +			break;
> +
> +		if (i > 0)
> +			prev_cmr = &cmr_array[i - 1];
> +
> +		/*
> +		 * It is a TDX firmware bug if CMRs are not
> +		 * in address ascending order.
> +		 */
> +		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
> +					cmr->base)) {
> +			pr_err("Firmware bug: CMRs not in address ascending order.\n");
> +			return -EFAULT;

-EFAULT is a really weird return code to use for this.  I'd use -EINVAL.

> +		}
> +	}
> +
> +	/*
> +	 * Also a sane BIOS should never generate invalid CMR(s) between
> +	 * two valid CMRs.  Sanity check this and simply return error in
> +	 * this case.
> +	 *
> +	 * By reaching here @i is the index of the first invalid CMR (or
> +	 * cmr_num).  Starting with next entry of @i since it has already
> +	 * been checked.
> +	 */
> +	for (j = i + 1; j < cmr_num; j++)
> +		if (cmr_valid(&cmr_array[j])) {
> +			pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
> +			return -EFAULT;
> +		}

Please add brackets for the for().

> +	/*
> +	 * Trim all tail invalid empty CMRs.  BIOS should generate at
> +	 * least one valid CMR, otherwise it's a TDX firmware bug.
> +	 */
> +	tdx_cmr_num = i;
> +	if (!tdx_cmr_num) {
> +		pr_err("Firmware bug: No valid CMR.\n");
> +		return -EFAULT;
> +	}
> +
> +	/* Print kernel sanitized CMRs */
> +	print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
> +
> +	return 0;
> +}
> +
> +static int tdx_get_sysinfo(void)
> +{
> +	struct tdx_module_output out;
> +	u64 tdsysinfo_sz, cmr_num;
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
> +
> +	ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
> +			__pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * If TDH.SYS.CONFIG succeeds, RDX contains the actual bytes
> +	 * written to @tdx_sysinfo and R9 contains the actual entries
> +	 * written to @tdx_cmr_array.  Sanity check them.
> +	 */
> +	tdsysinfo_sz = out.rdx;
> +	cmr_num = out.r9;

Please vertically align things like this:

	tdsysinfo_sz = out.rdx;
	cmr_num	     = out.r9;

> +	if (WARN_ON_ONCE((tdsysinfo_sz > sizeof(tdx_sysinfo)) || !tdsysinfo_sz ||
> +				(cmr_num > MAX_CMRS) || !cmr_num))
> +		return -EFAULT;

Sanity checking is good, but this makes me wonder how much is too much.
 I don't see a lot of code for instance checking if sys_write() writes
more than how much it was supposed to.

Why are these sanity checks necessary here?  Is the TDX module expected
to be *THAT* buggy?  The thing that's providing, oh, basically all of
the security guarantees of this architecture.  It's overflowing the
buffers you hand it?

> +	pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> +		tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
> +		tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
> +		tdx_sysinfo.build_num);
> +
> +	/* Print BIOS provided CMRs */
> +	print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
> +
> +	return sanitize_cmrs(tdx_cmr_array, cmr_num);
> +}

Does sanitize_cmrs() sanitize anything?  It looks to me like it *checks*
the CMRs.  But, sanitizing is an active operation that writes to the
data being sanitized.  This looks read-only to me.  check_cmrs() would
be a better name for a passive check.

>  static int init_tdx_module(void)
>  {
>  	int ret;
> @@ -482,6 +608,11 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/* Get TDX module information and CMRs */
> +	ret = tdx_get_sysinfo();
> +	if (ret)
> +		goto out;

Couldn't we get rid of that comment if you did something like:

	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);

and preferably make the variables function-local.

>  	/*
>  	 * Return -EFAULT until all steps of TDX module
>  	 * initialization are done.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index b8cfdd6e12f3..2f21c45df6ac 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -29,6 +29,66 @@ struct p_seamldr_info {
>  	u8	reserved2[88];
>  } __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
>  
> +struct cmr_info {
> +	u64	base;
> +	u64	size;
> +} __packed;
> +
> +#define MAX_CMRS			32
> +#define CMR_INFO_ARRAY_ALIGNMENT	512
> +
> +struct cpuid_config {
> +	u32	leaf;
> +	u32	sub_leaf;
> +	u32	eax;
> +	u32	ebx;
> +	u32	ecx;
> +	u32	edx;
> +} __packed;
> +
> +#define TDSYSINFO_STRUCT_SIZE		1024
> +#define TDSYSINFO_STRUCT_ALIGNMENT	1024
> +
> +struct tdsysinfo_struct {
> +	/* TDX-SEAM Module Info */
> +	u32	attributes;
> +	u32	vendor_id;
> +	u32	build_date;
> +	u16	build_num;
> +	u16	minor_version;
> +	u16	major_version;
> +	u8	reserved0[14];
> +	/* Memory Info */
> +	u16	max_tdmrs;
> +	u16	max_reserved_per_tdmr;
> +	u16	pamt_entry_size;
> +	u8	reserved1[10];
> +	/* Control Struct Info */
> +	u16	tdcs_base_size;
> +	u8	reserved2[2];
> +	u16	tdvps_base_size;
> +	u8	tdvps_xfam_dependent_size;
> +	u8	reserved3[9];
> +	/* TD Capabilities */
> +	u64	attributes_fixed0;
> +	u64	attributes_fixed1;
> +	u64	xfam_fixed0;
> +	u64	xfam_fixed1;
> +	u8	reserved4[32];
> +	u32	num_cpuid_config;
> +	/*
> +	 * The actual number of CPUID_CONFIG depends on above
> +	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
> +	 * is 1024B defined by TDX architecture.  Use a union with
> +	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
> +	 * equal to 1024.
> +	 */
> +	union {
> +		struct cpuid_config	cpuid_configs[0];
> +		u8			reserved5[892];
> +	};
> +} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
> +
>  /*
>   * P-SEAMLDR SEAMCALL leaf function
>   */
> @@ -38,6 +98,7 @@ struct p_seamldr_info {
>  /*
>   * TDX module SEAMCALL leaf functions
>   */
> +#define TDH_SYS_INFO		32
>  #define TDH_SYS_INIT		33
>  #define TDH_SYS_LP_INIT		35
>  #define TDH_SYS_LP_SHUTDOWN	44


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
  2022-04-20 20:48   ` Isaku Yamahata
@ 2022-04-27 22:24   ` Dave Hansen
  2022-04-28  0:53     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-27 22:24 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> TDX provides increased levels of memory confidentiality and integrity.
> This requires special hardware support for features like memory
> encryption and storage of memory integrity checksums.  Not all memory
> satisfies these requirements.
> 
> As a result, TDX introduced the concept of a "Convertible Memory Region"
> (CMR).  During boot, the firmware builds a list of all of the memory
> ranges which can provide the TDX security guarantees.  The list of these
> ranges, along with TDX module information, is available to the kernel by
> querying the TDX module.
> 
> In order to provide crypto protection to TD guests, the TDX architecture

There's that "crypto protection" thing again.  I'm not really a fan of
the changes made to this changelog since I wrote it. :)

> also needs additional metadata to record things like which TD guest
> "owns" a given page of memory.  This metadata essentially serves as the
> 'struct page' for the TDX module.  The space for this metadata is not
> reserved by the hardware upfront and must be allocated by the kernel

			    ^ "up front"

...
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 482e6d858181..ec27350d53c1 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,7 @@
>  #include <linux/cpu.h>
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
> +#include <linux/slab.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
> @@ -594,8 +595,29 @@ static int tdx_get_sysinfo(void)
>  	return sanitize_cmrs(tdx_cmr_array, cmr_num);
>  }
>  
> +static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_num; i++) {
> +		struct tdmr_info *tdmr = tdmr_array[i];
> +
> +		/* kfree() works with NULL */
> +		kfree(tdmr);
> +		tdmr_array[i] = NULL;
> +	}
> +}
> +
> +static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> +{
> +	/* Return -EFAULT until constructing TDMRs is done */
> +	return -EFAULT;
> +}
> +
>  static int init_tdx_module(void)
>  {
> +	struct tdmr_info **tdmr_array;
> +	int tdmr_num;
>  	int ret;
>  
>  	/* TDX module global initialization */
> @@ -613,11 +635,36 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * Prepare enough space to hold pointers of TDMRs (TDMR_INFO).
> +	 * TDX requires TDMR_INFO being 512 aligned.  Each TDMR is

					 ^ "512-byte aligned"

Right?

> +	 * allocated individually within construct_tdmrs() to meet
> +	 * this requirement.
> +	 */
> +	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
> +			GFP_KERNEL);

Where, exactly is that alignment provided?  A 'struct tdmr_info *' is 8
bytes so a tdx_sysinfo.max_tdmrs=8 kcalloc() would only guarantee
64-byte alignment.

Also, I'm surprised that this is an array of virtual address pointers.
The previous interactions with the TDX module seemed to all take
physical addresses.  How is it that this hardware structure which has
hardware alignment constraints is holding virtual addresses?

> +	if (!tdmr_array) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Construct TDMRs to build TDX memory */
> +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
> +	if (ret)
> +		goto out_free_tdmrs;
> +
>  	/*
>  	 * Return -EFAULT until all steps of TDX module
>  	 * initialization are done.
>  	 */
>  	ret = -EFAULT;

There's the -EFAULT again.  I'd replace these with a better error code.

> +out_free_tdmrs:
> +	/*
> +	 * TDMRs are only used during initializing TDX module.  Always
> +	 * free them no matter the initialization was successful or not.
> +	 */
> +	free_tdmrs(tdmr_array, tdmr_num);
> +	kfree(tdmr_array);
>  out:
>  	return ret;
>  }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 2f21c45df6ac..05bf9fe6bd00 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -89,6 +89,29 @@ struct tdsysinfo_struct {
>  	};
>  } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
>  
> +struct tdmr_reserved_area {
> +	u64 offset;
> +	u64 size;
> +} __packed;
> +
> +#define TDMR_INFO_ALIGNMENT	512
> +
> +struct tdmr_info {
> +	u64 base;
> +	u64 size;
> +	u64 pamt_1g_base;
> +	u64 pamt_1g_size;
> +	u64 pamt_2m_base;
> +	u64 pamt_2m_size;
> +	u64 pamt_4k_base;
> +	u64 pamt_4k_size;
> +	/*
> +	 * Actual number of reserved areas depends on
> +	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
> +	 */
> +	struct tdmr_reserved_area reserved_areas[0];
> +} __packed __aligned(TDMR_INFO_ALIGNMENT);
> +
>  /*
>   * P-SEAMLDR SEAMCALL leaf function
>   */


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
  2022-04-27 14:22           ` Dave Hansen
@ 2022-04-27 22:39             ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-27 22:39 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 07:22 -0700, Dave Hansen wrote:
> On 4/26/22 16:49, Kai Huang wrote:
> > On Tue, 2022-04-26 at 16:28 -0700, Dave Hansen wrote:
> > > What about a dependency?  Isn't this dead code without CONFIG_KVM=y/m?
> > 
> > Conceptually, KVM is one user of the TDX module, so it doesn't seem correct to
> > make CONFIG_INTEL_TDX_HOST depend on CONFIG_KVM.  But so far KVM is the only
> > user of TDX, so in practice the code is dead w/o KVM.
> > 
> > What's your opinion?
> 
> You're stuck in some really weird fantasy world.  Sure, we can dream up
> more than one user of the TDX module.  But, in the real world, there's
> only one.  Plus, code can have multiple dependencies!
> 
> 	depends on FOO || BAR
> 
> This TDX cruft is dead code in today's real-world kernel without KVM.
> You should add a dependency.

Will add a dependency on CONFIG_KVM_INTEL.

> 
> > > > > > +static bool __seamrr_enabled(void)
> > > > > > +{
> > > > > > +	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > > > > > +}
> > > > > 
> > > > > But there's no case where seamrr_mask is non-zero and where
> > > > > _seamrr_enabled().  Why bother checking the SEAMRR_ENABLED_BITS?
> > > > 
> > > > seamrr_mask will only be non-zero when SEAMRR is enabled by BIOS, otherwise it
> > > > is 0.  It will also be cleared when BIOS mis-configuration is detected on any
> > > > AP.  SEAMRR_ENABLED_BITS is used to check whether SEAMRR is enabled.
> > > 
> > > The point is that this could be:
> > > 
> > > 	return !!seamrr_mask;
> > 
> > The definition of this SEAMRR_MASK MSR defines "ENABLED" and "LOCKED" bits. 
> > Explicitly checking the two bits, instead of !!seamrr_mask roles out other
> > incorrect configurations.  For instance, we should not treat SEAMRR being
> > enabled if we only have "ENABLED" bit set or "LOCKED" bit set.
> 
> You're confusing two different things:
>  * The state of the variable
>  * The actual correct hardware state
> 
> The *VARIABLE* can't be non-zero and also denote that SEAMRR is enabled.
>  Does this *CODE* ever set ENABLED or LOCKED without each other?

OK.  Will just use !!seamrr_mask.  I thought explicitly checking
SEAMRR_ENABLED_BITS would be clearer.

> 
> > > > > > +static void detect_seam_ap(struct cpuinfo_x86 *c)
> > > > > > +{
> > > > > > +	u64 base, mask;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Don't bother to detect this AP if SEAMRR is not
> > > > > > +	 * enabled after earlier detections.
> > > > > > +	 */
> > > > > > +	if (!__seamrr_enabled())
> > > > > > +		return;
> > > > > > +
> > > > > > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
> > > > > > +	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
> > > > > > +
> > > > > 
> > > > > This is the place for a comment about why the values have to be equal.
> > > > 
> > > > I'll add below:
> > > > 
> > > > /* BIOS must configure SEAMRR consistently across all cores */
> > > 
> > > What happens if the BIOS doesn't do this?  What actually breaks?  In
> > > other words, do we *NEED* error checking here?
> > 
> > AFAICT the spec doesn't explicitly mention what will happen if BIOS doesn't
> > configure them consistently among cores.  But for safety I think it's better to
> > detect.
> 
> Safety?  Safety of what?

I'll ask TDX architect people and get back to you.

I'll also ask what will happen if TDX KeyID isn't configured consistently across
packages.  Currently TDX KeyID is also detected on all cpus (existing
detect_tme() also detect MKTME KeyID bits on all cpus).


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-27 14:49       ` Dave Hansen
@ 2022-04-28  0:00         ` Kai Huang
  2022-04-28 14:27           ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  0:00 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> On 4/26/22 17:43, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:53 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> ...
> > > > +static bool tdx_keyid_sufficient(void)
> > > > +{
> > > > +	if (!cpumask_equal(&cpus_booted_once_mask,
> > > > +					cpu_present_mask))
> > > > +		return false;
> > > 
> > > I'd move this cpumask_equal() to a helper.
> > 
> > Sorry to double confirm, do you want something like:
> > 
> > static bool tdx_detected_on_all_cpus(void)
> > {
> > 	/*
> > 	 * To detect any BIOS misconfiguration among cores, all logical
> > 	 * cpus must have been brought up at least once.  This is true
> > 	 * unless 'maxcpus' kernel command line is used to limit the
> > 	 * number of cpus to be brought up during boot time.  However
> > 	 * 'maxcpus' is basically an invalid operation mode due to the
> > 	 * MCE broadcast problem, and it should not be used on a TDX
> > 	 * capable machine.  Just do paranoid check here and do not
> > 	 * report SEAMRR as enabled in this case.
> > 	 */
> > 	return cpumask_equal(&cpus_booted_once_mask, cpu_present_mask);
> > }
> 
> That's logically the right idea, but I hate the name since the actual
> test has nothing to do with TDX being detected.  The comment is also
> rather verbose and rambling.
> 
> It should be named something like:
> 
> 	all_cpus_booted()
> 
> and with a comment like this:
> 
> /*
>  * To initialize TDX, the kernel needs to run some code on every
>  * present CPU.  Detect cases where present CPUs have not been
>  * booted, like when maxcpus=N is used.
>  */

Thank you.

> 
> > static bool seamrr_enabled(void)
> > {
> > 	if (!tdx_detected_on_all_cpus())
> > 		return false;
> > 
> > 	return __seamrr_enabled();
> > }
> > 
> > static bool tdx_keyid_sufficient()
> > {
> > 	if (!tdx_detected_on_all_cpus())
> > 		return false;
> > 
> > 	...
> > }
> 
> Although, looking at those, it's *still* unclear why you need this.  I
> assume it's because some later TDX SEAMCALL will fail if you get this
> wrong, and you want to be able to provide a better error message.
> 
> *BUT* this code doesn't actually provide halfway reasonable error
> messages.  If someone uses maxcpus=99, then this code will report:
> 
> 	pr_info("SEAMRR not enabled.\n");
> 
> right?  That's bonkers.

Right this isn't good.

I think we can use pr_info_once() when all_cpus_booted() returns false, and get
rid of printing "SEAMRR not enabled" in seamrr_enabled().  How about below?

static bool seamrr_enabled(void)
{
	if (!all_cpus_booted())
		pr_info_once("Not all present CPUs have been booted.  Report
SEAMRR as not enabled");

	return __seamrr_enabled();
}

And we don't print "SEAMRR not enabled".

> 
> > > > +	/*
> > > > +	 * TDX requires at least two KeyIDs: one global KeyID to
> > > > +	 * protect the metadata of the TDX module and one or more
> > > > +	 * KeyIDs to run TD guests.
> > > > +	 */
> > > > +	return tdx_keyid_num >= 2;
> > > > +}
> > > > +
> > > > +static int __tdx_detect(void)
> > > > +{
> > > > +	/* The TDX module is not loaded if SEAMRR is disabled */
> > > > +	if (!seamrr_enabled()) {
> > > > +		pr_info("SEAMRR not enabled.\n");
> > > > +		goto no_tdx_module;
> > > > +	}
> > > 
> > > Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
> > > the module with SEAMCALL.  Why not just use that directly?
> > 
> > SEAMCALL will cause #GP if SEAMRR is not enabled.  We should check whether
> > SEAMRR is enabled before making SEAMCALL.
> 
> So...  You could actually get rid of all this code.  if SEAMCALL #GP's,
> then you say, "Whoops, the firmware didn't load the TDX module
> correctly, sorry."

Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
module is loaded.  If SEAMCALL is successful, the module is loaded.

One problem is currently the patch to flush cache for kexec() uses
seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
flush the cache.  The reason is, similar to SME, the flush is done in
stop_this_cpu(), but the status of TDX module initialization is protected by
mutex, so we cannot use TDX module status in stop_this_cpu() to determine
whether to flush.

If that patch makes sense, I think we still need to detect SEAMRR?

> 
> Why is all this code here?  What is it for?
> 
> > > > +	/*
> > > > +	 * Also do not report the TDX module as loaded if there's
> > > > +	 * no enough TDX private KeyIDs to run any TD guests.
> > > > +	 */
> > > > +	if (!tdx_keyid_sufficient()) {
> > > > +		pr_info("Number of TDX private KeyIDs too small: %u.\n",
> > > > +				tdx_keyid_num);
> > > > +		goto no_tdx_module;
> > > > +	}
> > > > +
> > > > +	/* Return -ENODEV until the TDX module is detected */
> > > > +no_tdx_module:
> > > > +	tdx_module_status = TDX_MODULE_NONE;
> > > > +	return -ENODEV;
> > > > +}
> 
> Again, if someone uses maxcpus=1234 and we get down here, then it
> reports to the user:
> 	
> 	Number of TDX private KeyIDs too small: ...
> 
> ????  When the root of the problem has nothing to do with KeyIDs.

Thanks for catching.  Similar to seamrr_enabled() above.

> 
> > > > +static int init_tdx_module(void)
> > > > +{
> > > > +	/*
> > > > +	 * Return -EFAULT until all steps of TDX module
> > > > +	 * initialization are done.
> > > > +	 */
> > > > +	return -EFAULT;
> > > > +}
> > > > +
> > > > +static void shutdown_tdx_module(void)
> > > > +{
> > > > +	/* TODO: Shut down the TDX module */
> > > > +	tdx_module_status = TDX_MODULE_SHUTDOWN;
> > > > +}
> > > > +
> > > > +static int __tdx_init(void)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * Logical-cpu scope initialization requires calling one SEAMCALL
> > > > +	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
> > > > +	 * module also has such requirement.  Further more, configuring
> > > > +	 * the key of the global KeyID requires calling one SEAMCALL for
> > > > +	 * each package.  For simplicity, disable CPU hotplug in the whole
> > > > +	 * initialization process.
> > > > +	 *
> > > > +	 * It's perhaps better to check whether all BIOS-enabled cpus are
> > > > +	 * online before starting initializing, and return early if not.
> > > 
> > > But you did some of this cpumask checking above.  Right?
> > 
> > Above check only guarantees SEAMRR/TDX KeyID has been detected on all presnet
> > cpus.  the 'present' cpumask doesn't equal to all BIOS-enabled CPUs.
> 
> I have no idea what this is saying.  In general, I have no idea what the
> comment is saying.  It makes zero sense.  The locking pattern for stuff
> like this is:
> 
> 	cpus_read_lock();
> 
> 	for_each_online_cpu()
> 		do_something();
> 
> 	cpus_read_unlock();
> 
> because you need to make sure that you don't miss "do_something()" on a
> CPU that comes online during the loop.

I don't want any CPU going offline so  "do_something" will be done on all online
CPUs.

> 
> But, now that I think about it, all of the checks I've seen so far are
> for *booted* CPUs.  While the lock (I assume) would keep new CPUs from
> booting, it doesn't do any good really since the "cpus_booted_once_mask"
> bits are only set and not cleared.  A CPU doesn't un-become booted once.
> 
> Again, we seem to have a long, verbose comment that says very little and
> only confuses me.

How about below:

"During initializing the TDX module, one step requires some SEAMCALL must be
done on all logical cpus enabled by BIOS, otherwise a later step will fail. 
Disable CPU hotplug during the initialization process to prevent any CPU going
offline during initializing the TDX module.  Note it is caller's responsibility
to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
are online."


> 
> ...
> > > Why does this need both a tdx_detect() and a tdx_init()?  Shouldn't the
> > > interface from outside just be "get TDX up and running, please?"
> > 
> > We can have a single tdx_init().  However tdx_init() can be heavy, and having a
> > separate non-heavy tdx_detect() may be useful if caller wants to separate
> > "detecting the TDX module" and "initializing the TDX module", i.e. to do
> > something in the middle.
> 
> <Sigh>  So, this "design" went unmentioned, *and* I can't review if the
> actual callers of this need the functionality or not because they're not
> in this series.

I'll remove tdx_detect().  Currently KVM doesn't do anything between
tdx_detect() and tdx_init(). 

https://lore.kernel.org/lkml/cover.1646422845.git.isaku.yamahata@intel.com/T/#mc7d5bb37107131b65ca7142b418b3e17da36a9ca

> 
> > However tdx_detect() basically only detects P-SEAMLDR.  If we move P-SEAMLDR
> > detection to tdx_init(), or we git rid of P-SEAMLDR completely, then we don't
> > need tdx_detect() anymore.  We can expose seamrr_enabled() and TDX KeyID
> > variables or functions so caller can use them to see whether it should do TDX
> > related staff and then call tdx_init().
> 
> I don't think you've made a strong case for why P-SEAMLDR detection is
> even necessary in this series.

Will remove P-SEAMLDR code and tdx_detect().


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-27 22:15   ` Dave Hansen
@ 2022-04-28  0:15     ` Kai Huang
  2022-04-28 14:06       ` Dave Hansen
  2022-05-18 22:30       ` Sagi Shahar
  0 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-28  0:15 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 15:15 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > TDX provides increased levels of memory confidentiality and integrity.
> > This requires special hardware support for features like memory
> > encryption and storage of memory integrity checksums.  Not all memory
> > satisfies these requirements.
> > 
> > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > (CMR).  During boot, the firmware builds a list of all of the memory
> > ranges which can provide the TDX security guarantees.  The list of these
> > ranges, along with TDX module information, is available to the kernel by
> > querying the TDX module via TDH.SYS.INFO SEAMCALL.
> > 
> > Host kernel can choose whether or not to use all convertible memory
> > regions as TDX memory.  Before TDX module is ready to create any TD
> > guests, all TDX memory regions that host kernel intends to use must be
> > configured to the TDX module, using specific data structures defined by
> > TDX architecture.  Constructing those structures requires information of
> > both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
> > to get this information as preparation to construct those structures.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
> >  arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
> >  2 files changed, 192 insertions(+)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index ef2718423f0f..482e6d858181 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
> >  
> >  static struct p_seamldr_info p_seamldr_info;
> >  
> > +/* Base address of CMR array needs to be 512 bytes aligned. */
> > +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> > +static int tdx_cmr_num;
> > +static struct tdsysinfo_struct tdx_sysinfo;
> 
> I really dislike mixing hardware and software structures.  Please make
> it clear which of these are fully software-defined and which are part of
> the hardware ABI.

Both 'struct tdsysinfo_struct' and 'struct cmr_info' are hardware structures. 
They are defined in tdx.h which has a comment saying the data structures below
this comment is hardware structures:

	+/*
	+ * TDX architectural data structures
	+ */

It is introduced in the P-SEAMLDR patch.

Should I explicitly add comments around the variables saying they are used by
hardware, something like:

	/*
	 * Data structures used by TDH.SYS.INFO SEAMCALL to return CMRs and
	 * TDX module system information.
	 */

?
 
> 
> >  static bool __seamrr_enabled(void)
> >  {
> >  	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > @@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
> >  	return seamcall_on_each_cpu(&sc);
> >  }
> >  
> > +static inline bool cmr_valid(struct cmr_info *cmr)
> > +{
> > +	return !!cmr->size;
> > +}
> > +
> > +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
> > +		       const char *name)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < cmr_num; i++) {
> > +		struct cmr_info *cmr = &cmr_array[i];
> > +
> > +		pr_info("%s : [0x%llx, 0x%llx)\n", name,
> > +				cmr->base, cmr->base + cmr->size);
> > +	}
> > +}
> > +
> > +static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
> > +{
> > +	int i, j;
> > +
> > +	/*
> > +	 * Intel TDX module spec, 20.7.3 CMR_INFO:
> > +	 *
> > +	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> > +	 *   array of CMR_INFO entries. The CMRs are sorted from the
> > +	 *   lowest base address to the highest base address, and they
> > +	 *   are non-overlapping.
> > +	 *
> > +	 * This implies that BIOS may generate invalid empty entries
> > +	 * if total CMRs are less than 32.  Skip them manually.
> > +	 */
> > +	for (i = 0; i < cmr_num; i++) {
> > +		struct cmr_info *cmr = &cmr_array[i];
> > +		struct cmr_info *prev_cmr = NULL;
> > +
> > +		/* Skip further invalid CMRs */
> > +		if (!cmr_valid(cmr))
> > +			break;
> > +
> > +		if (i > 0)
> > +			prev_cmr = &cmr_array[i - 1];
> > +
> > +		/*
> > +		 * It is a TDX firmware bug if CMRs are not
> > +		 * in address ascending order.
> > +		 */
> > +		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
> > +					cmr->base)) {
> > +			pr_err("Firmware bug: CMRs not in address ascending order.\n");
> > +			return -EFAULT;
> 
> -EFAULT is a really weird return code to use for this.  I'd use -EINVAL.

OK thanks.

> 
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Also a sane BIOS should never generate invalid CMR(s) between
> > +	 * two valid CMRs.  Sanity check this and simply return error in
> > +	 * this case.
> > +	 *
> > +	 * By reaching here @i is the index of the first invalid CMR (or
> > +	 * cmr_num).  Starting with next entry of @i since it has already
> > +	 * been checked.
> > +	 */
> > +	for (j = i + 1; j < cmr_num; j++)
> > +		if (cmr_valid(&cmr_array[j])) {
> > +			pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
> > +			return -EFAULT;
> > +		}
> 
> Please add brackets for the for().

OK.

> 
> > +	/*
> > +	 * Trim all tail invalid empty CMRs.  BIOS should generate at
> > +	 * least one valid CMR, otherwise it's a TDX firmware bug.
> > +	 */
> > +	tdx_cmr_num = i;
> > +	if (!tdx_cmr_num) {
> > +		pr_err("Firmware bug: No valid CMR.\n");
> > +		return -EFAULT;
> > +	}
> > +
> > +	/* Print kernel sanitized CMRs */
> > +	print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
> > +
> > +	return 0;
> > +}
> > +
> > +static int tdx_get_sysinfo(void)
> > +{
> > +	struct tdx_module_output out;
> > +	u64 tdsysinfo_sz, cmr_num;
> > +	int ret;
> > +
> > +	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
> > +
> > +	ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
> > +			__pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * If TDH.SYS.CONFIG succeeds, RDX contains the actual bytes
> > +	 * written to @tdx_sysinfo and R9 contains the actual entries
> > +	 * written to @tdx_cmr_array.  Sanity check them.
> > +	 */
> > +	tdsysinfo_sz = out.rdx;
> > +	cmr_num = out.r9;
> 
> Please vertically align things like this:
> 
> 	tdsysinfo_sz = out.rdx;
> 	cmr_num	     = out.r9;

OK.

> 
> > +	if (WARN_ON_ONCE((tdsysinfo_sz > sizeof(tdx_sysinfo)) || !tdsysinfo_sz ||
> > +				(cmr_num > MAX_CMRS) || !cmr_num))
> > +		return -EFAULT;
> 
> Sanity checking is good, but this makes me wonder how much is too much.
>  I don't see a lot of code for instance checking if sys_write() writes
> more than how much it was supposed to.
> 
> Why are these sanity checks necessary here?  Is the TDX module expected
> to be *THAT* buggy?  The thing that's providing, oh, basically all of
> the security guarantees of this architecture.  It's overflowing the
> buffers you hand it?

I think this check can be removed.  Will remove.

> 
> > +	pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> > +		tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
> > +		tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
> > +		tdx_sysinfo.build_num);
> > +
> > +	/* Print BIOS provided CMRs */
> > +	print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
> > +
> > +	return sanitize_cmrs(tdx_cmr_array, cmr_num);
> > +}
> 
> Does sanitize_cmrs() sanitize anything?  It looks to me like it *checks*
> the CMRs.  But, sanitizing is an active operation that writes to the
> data being sanitized.  This looks read-only to me.  check_cmrs() would
> be a better name for a passive check.

Sure will change to check_cmrs().

> 
> >  static int init_tdx_module(void)
> >  {
> >  	int ret;
> > @@ -482,6 +608,11 @@ static int init_tdx_module(void)
> >  	if (ret)
> >  		goto out;
> >  
> > +	/* Get TDX module information and CMRs */
> > +	ret = tdx_get_sysinfo();
> > +	if (ret)
> > +		goto out;
> 
> Couldn't we get rid of that comment if you did something like:
> 
> 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);

Yes will do.

> 
> and preferably make the variables function-local.

'tdx_sysinfo' will be used by KVM too.



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-27 21:59     ` Dave Hansen
@ 2022-04-28  0:37       ` Kai Huang
  2022-04-28  0:50         ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  0:37 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> On 4/26/22 18:15, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > SEAM VMX root operation is designed to host a CPU-attested, software
> > > > module called the 'TDX module' which implements functions to manage
> > > > crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
> > > 
> > > "crypto protected"?  What the heck is that?
> > 
> > How about "crypto-protected"?  I googled and it seems it is used by someone
> > else.
> 
> Cryptography itself doesn't provide (much) protection in the TDX
> architecture.  TDX guests are isolated from the VMM in ways that
> traditional guests are not, but that has almost nothing to do with
> cryptography.
> 
> Is it cryptography that keeps the host from reading guest private data
> in the clear?  Is it cryptography that keeps the host from reading guest
> ciphertext?  Does cryptography enforce the extra rules of Secure-EPT?

OK will change to "protected VMs" in this entire series.

> 
> > > > 3. Memory hotplug
> > > > 
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > > 
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > > 
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> > 
> > I forgot to add below (which is in the documentation patch):
> > 
> > "This can be enhanced when future generation of TDX starts to support ACPI
> > memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> > same platform."
> > 
> > Is this acceptable?
> 
> No, Kai.
> 
> You're basically saying: *this* code doesn't work with feature A, B and
> C.  Then, you're pivoting to say that it doesn't matter because one
> version of Intel's hardware doesn't support A, B, or C.
> 
> I don't care about this *ONE* version of the hardware.  I care about
> *ALL* the hardware that this code will ever support.  *ALL* the hardware
> on which this code will run.
> 
> In 5 years, if someone takes this code and runs it on Intel hardware
> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

I thought we could document this in the documentation saying that this code can
only work on TDX machines that don't have above capabilities (SPR for now).  We
can change the code and the documentation  when we add the support of those
features in the future, and update the documentation.

If 5 years later someone takes this code, he/she should take a look at the
documentation and figure out that he/she should choose a newer kernel if the
machine support those features.

I'll think about design solutions if above doesn't look good for you.

> 
> You can't just ignore the problems because they're not present on one
> version of the hardware.
> 
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > > > 
> > > > 4. CPU hotplug
> > > > 
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> > > > 
> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > > > online.  After the initialization, any logical cpu can be brought down
> > > > and brought up to online again later.  Therefore this series doesn't
> > > > change logical CPU hotplug either.
> > > > 
> > > > 5. TDX interaction with kexec()
> > > > 
> > > > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > > > of TDX private memory, including PAMTs, used by TDX module need to be
> > > > flushed before transiting to the new kernel otherwise they may silently
> > > > corrupt the new kernel.  Similar to SME, this series flushes cache in
> > > > stop_this_cpu().
> > > 
> > > What does this have to do with kexec()?  What's a PAMT?
> > 
> > The point is the dirty cachelines of TDX private memory must be flushed
> > otherwise they may slightly corrupt the new kexec()-ed kernel.
> > 
> > Will use "TDX metadata" instead of "PAMT".  The former has already been
> > mentioned above.
> 
> Longer description for the patch itself:
> 
> TDX memory encryption is built on top of MKTME which uses physical
> address aliases to designate encryption keys.  This architecture is not
> cache coherent.  Software is responsible for flushing the CPU caches
> when memory changes keys.  When kexec()'ing, memory can be repurposed
> from TDX use to non-TDX use, changing the effective encryption key.
> 
> Cover-letter-level description:
> 
> Just like SME, TDX hosts require special cache flushing before kexec().

Thanks.

> 
> > > > uninitialized state so it can be initialized again.
> > > > 
> > > > This implies:
> > > > 
> > > >   - If the old kernel fails to initialize TDX, the new kernel cannot
> > > >     use TDX too unless the new kernel fixes the bug which leads to
> > > >     initialization failure in the old kernel and can resume from where
> > > >     the old kernel stops. This requires certain coordination between
> > > >     the two kernels.
> > > 
> > > OK, but what does this *MEAN*?
> > 
> > This means we need to extend the information which the old kernel passes to the
> > new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
> > to make it more concise next version.
> > 
> > > 
> > > >   - If the old kernel has initialized TDX successfully, the new kernel
> > > >     may be able to use TDX if the two kernels have the exactly same
> > > >     configurations on the TDX module. It further requires the new kernel
> > > >     to reserve the TDX metadata pages (allocated by the old kernel) in
> > > >     its page allocator. It also requires coordination between the two
> > > >     kernels.  Furthermore, if kexec() is done when there are active TD
> > > >     guests running, the new kernel cannot use TDX because it's extremely
> > > >     hard for the old kernel to pass all TDX private pages to the new
> > > >     kernel.
> > > > 
> > > > Given that, this series doesn't support TDX after kexec() (except the
> > > > old kernel doesn't attempt to initialize TDX at all).
> > > > 
> > > > And this series doesn't shut down TDX module but leaves it open during
> > > > kexec().  It is because shutting down TDX module requires CPU being in
> > > > VMX operation but there's no guarantee of this during kexec().  Leaving
> > > > the TDX module open is not the best case, but it is OK since the new
> > > > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > > > at all).
> > > 
> > > tl;dr: kexec() doesn't work with this code.
> > > 
> > > Right?
> > > 
> > > That doesn't seem good.
> > 
> > It can work in my understanding.  We just need to flush cache before booting to
> > the new kernel.
> 
> What about all the concerns about TDX module configuration changing?
> 

Leaving the TDX module in fully initialized state or shutdown state (in case of
error during it's initialization) to the new kernel is fine.  If the new kernel
doesn't use TDX at all, then the TDX module won't access memory using it's
global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
first SEAMCALL when it tries to initialize the TDX module, and won't use
SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
then it is a bug in the new kernel, or the new kernel is malicious, in which
case it can potentially corrupt the data.  But I don't think we need to consider
this as if the new kernel is malicious, then it can corrupt data anyway.

Does this make sense?

Is there any other concerns that I missed? 

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-28  0:37       ` Kai Huang
@ 2022-04-28  0:50         ` Dave Hansen
  2022-04-28  0:58           ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28  0:50 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/27/22 17:37, Kai Huang wrote:
> On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
>> In 5 years, if someone takes this code and runs it on Intel hardware
>> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> 
> I thought we could document this in the documentation saying that this code can
> only work on TDX machines that don't have above capabilities (SPR for now).  We
> can change the code and the documentation  when we add the support of those
> features in the future, and update the documentation.
> 
> If 5 years later someone takes this code, he/she should take a look at the
> documentation and figure out that he/she should choose a newer kernel if the
> machine support those features.
> 
> I'll think about design solutions if above doesn't look good for you.

No, it doesn't look good to me.

You can't just say:

	/*
	 * This code will eat puppies if used on systems with hotplug.
	 */

and merrily await the puppy bloodbath.

If it's not compatible, then you have to *MAKE* it not compatible in a
safe, controlled way.

>> You can't just ignore the problems because they're not present on one
>> version of the hardware.

Please, please read this again ^^

>> What about all the concerns about TDX module configuration changing?
> 
> Leaving the TDX module in fully initialized state or shutdown state (in case of
> error during it's initialization) to the new kernel is fine.  If the new kernel
> doesn't use TDX at all, then the TDX module won't access memory using it's
> global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
> first SEAMCALL when it tries to initialize the TDX module, and won't use
> SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
> then it is a bug in the new kernel, or the new kernel is malicious, in which
> case it can potentially corrupt the data.  But I don't think we need to consider
> this as if the new kernel is malicious, then it can corrupt data anyway.
> 
> Does this make sense?

No, I'm pretty lost.  But, I'll look at the next version of this with
fresh eyes and hopefully you'll have had time to streamline the text by
then.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-27 22:24   ` Dave Hansen
@ 2022-04-28  0:53     ` Kai Huang
  2022-04-28  1:07       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  0:53 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 15:24 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > TDX provides increased levels of memory confidentiality and integrity.
> > This requires special hardware support for features like memory
> > encryption and storage of memory integrity checksums.  Not all memory
> > satisfies these requirements.
> > 
> > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > (CMR).  During boot, the firmware builds a list of all of the memory
> > ranges which can provide the TDX security guarantees.  The list of these
> > ranges, along with TDX module information, is available to the kernel by
> > querying the TDX module.
> > 
> > In order to provide crypto protection to TD guests, the TDX architecture
> 
> There's that "crypto protection" thing again.  I'm not really a fan of
> the changes made to this changelog since I wrote it. :)

Sorry about that.  I'll remove "In order to provide crypto protection to TD
guests".

> 
> > also needs additional metadata to record things like which TD guest
> > "owns" a given page of memory.  This metadata essentially serves as the
> > 'struct page' for the TDX module.  The space for this metadata is not
> > reserved by the hardware upfront and must be allocated by the kernel
> 
> 			    ^ "up front"

Thanks will change to "up front".

Btw, the gmail grammar check gives me a red line if I use "up front", but it
doesn't complain "upfront".

> 
> ...
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 482e6d858181..ec27350d53c1 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/cpu.h>
> >  #include <linux/smp.h>
> >  #include <linux/atomic.h>
> > +#include <linux/slab.h>
> >  #include <asm/msr-index.h>
> >  #include <asm/msr.h>
> >  #include <asm/cpufeature.h>
> > @@ -594,8 +595,29 @@ static int tdx_get_sysinfo(void)
> >  	return sanitize_cmrs(tdx_cmr_array, cmr_num);
> >  }
> >  
> > +static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdmr_num; i++) {
> > +		struct tdmr_info *tdmr = tdmr_array[i];
> > +
> > +		/* kfree() works with NULL */
> > +		kfree(tdmr);
> > +		tdmr_array[i] = NULL;
> > +	}
> > +}
> > +
> > +static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > +{
> > +	/* Return -EFAULT until constructing TDMRs is done */
> > +	return -EFAULT;
> > +}
> > +
> >  static int init_tdx_module(void)
> >  {
> > +	struct tdmr_info **tdmr_array;
> > +	int tdmr_num;
> >  	int ret;
> >  
> >  	/* TDX module global initialization */
> > @@ -613,11 +635,36 @@ static int init_tdx_module(void)
> >  	if (ret)
> >  		goto out;
> >  
> > +	/*
> > +	 * Prepare enough space to hold pointers of TDMRs (TDMR_INFO).
> > +	 * TDX requires TDMR_INFO being 512 aligned.  Each TDMR is
> 
> 					 ^ "512-byte aligned"
> 
> Right?

Yes.  Will update.

> 
> > +	 * allocated individually within construct_tdmrs() to meet
> > +	 * this requirement.
> > +	 */
> > +	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
> > +			GFP_KERNEL);
> 
> Where, exactly is that alignment provided?  A 'struct tdmr_info *' is 8
> bytes so a tdx_sysinfo.max_tdmrs=8 kcalloc() would only guarantee
> 64-byte alignment.

The entries in the array only contain a pointer to TDMR_INFO.  The actual
TDMR_INFO is allocated separately. The array itself is never used by TDX
hardware so it doesn't matter.  We just need to guarantee each TDMR_INFO is
512B-byte aligned.

> 
> Also, I'm surprised that this is an array of virtual address pointers.
> The previous interactions with the TDX module seemed to all take
> physical addresses.  How is it that this hardware structure which has
> hardware alignment constraints is holding virtual addresses?

In later patches when TDMRs are configured to the TDX module, the input will be
converted to physical address, and there will be another array which is used by
the TDX module hardware.  This array is used to by kernel only to construct
TDMRs.

> 
> > +	if (!tdmr_array) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	/* Construct TDMRs to build TDX memory */
> > +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
> > +	if (ret)
> > +		goto out_free_tdmrs;
> > +
> >  	/*
> >  	 * Return -EFAULT until all steps of TDX module
> >  	 * initialization are done.
> >  	 */
> >  	ret = -EFAULT;
> 
> There's the -EFAULT again.  I'd replace these with a better error code.

I couldn't think out a better error code.  -EINVAL looks doesn't suit.  -EAGAIN
also doesn't make sense for now since we always shutdown the TDX module in case
of any error so caller should never retry.  I think we need some error code to
tell "the job isn't done yet".  Perhaps -EBUSY?



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-28  0:50         ` Dave Hansen
@ 2022-04-28  0:58           ` Kai Huang
  2022-04-29  1:40             ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  0:58 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> On 4/27/22 17:37, Kai Huang wrote:
> > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > 
> > I thought we could document this in the documentation saying that this code can
> > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > can change the code and the documentation  when we add the support of those
> > features in the future, and update the documentation.
> > 
> > If 5 years later someone takes this code, he/she should take a look at the
> > documentation and figure out that he/she should choose a newer kernel if the
> > machine support those features.
> > 
> > I'll think about design solutions if above doesn't look good for you.
> 
> No, it doesn't look good to me.
> 
> You can't just say:
> 
> 	/*
> 	 * This code will eat puppies if used on systems with hotplug.
> 	 */
> 
> and merrily await the puppy bloodbath.
> 
> If it's not compatible, then you have to *MAKE* it not compatible in a
> safe, controlled way.
> 
> > > You can't just ignore the problems because they're not present on one
> > > version of the hardware.
> 
> Please, please read this again ^^

OK.  I'll think about solutions and come back later.

> 
> > > What about all the concerns about TDX module configuration changing?
> > 
> > Leaving the TDX module in fully initialized state or shutdown state (in case of
> > error during it's initialization) to the new kernel is fine.  If the new kernel
> > doesn't use TDX at all, then the TDX module won't access memory using it's
> > global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
> > first SEAMCALL when it tries to initialize the TDX module, and won't use
> > SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
> > then it is a bug in the new kernel, or the new kernel is malicious, in which
> > case it can potentially corrupt the data.  But I don't think we need to consider
> > this as if the new kernel is malicious, then it can corrupt data anyway.
> > 
> > Does this make sense?
> 
> No, I'm pretty lost.  But, I'll look at the next version of this with
> fresh eyes and hopefully you'll have had time to streamline the text by
> then.

OK thanks.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-26 20:13 ` Dave Hansen
  2022-04-27  1:15   ` Kai Huang
@ 2022-04-28  1:01   ` Dan Williams
  2022-04-28  1:21     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-28  1:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
[..]
> > 3. Memory hotplug
> >
> > The first generation of TDX architecturally doesn't support memory
> > hotplug.  And the first generation of TDX-capable platforms don't support
> > physical memory hotplug.  Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> >
> > A special case of memory hotplug is adding NVDIMM as system RAM using

Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...

> > kmem driver.  However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
>
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

The assumption is already broken today with NVDIMM-N. The lack of
DDR-T support on TDX enabled platforms has zero effect on DDR-based
persistent memory solutions. In other words, please describe the
actual software and hardware conflicts at play here, and do not make
the mistake of assuming that "no DDR-T support on TDX platforms" ==
"no NVDIMM support".

> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM.  To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.

I am not sure what you are trying to say here?

> > 4. CPU hotplug
> >
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either.  Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.

What are the actual challenges posed to TDX with respect to CPU hotplug?

> > Also, only TDX module initialization requires all BIOS-enabled cpus are

Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
/proc/cpuinfo for example.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-28  0:53     ` Kai Huang
@ 2022-04-28  1:07       ` Dave Hansen
  2022-04-28  1:35         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28  1:07 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/27/22 17:53, Kai Huang wrote:
> On Wed, 2022-04-27 at 15:24 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> TDX provides increased levels of memory confidentiality and integrity.
>>> This requires special hardware support for features like memory
>>> encryption and storage of memory integrity checksums.  Not all memory
>>> satisfies these requirements.
>>>
>>> As a result, TDX introduced the concept of a "Convertible Memory Region"
>>> (CMR).  During boot, the firmware builds a list of all of the memory
>>> ranges which can provide the TDX security guarantees.  The list of these
>>> ranges, along with TDX module information, is available to the kernel by
>>> querying the TDX module.
>>>
>>> In order to provide crypto protection to TD guests, the TDX architecture
>>
>> There's that "crypto protection" thing again.  I'm not really a fan of
>> the changes made to this changelog since I wrote it. :)
> 
> Sorry about that.  I'll remove "In order to provide crypto protection to TD
> guests".

Seriously, though.  I took the effort to write these changelogs for you.
 They were fine.  I'm not stoked about needing to proofread them again.

>>> also needs additional metadata to record things like which TD guest
>>> "owns" a given page of memory.  This metadata essentially serves as the
>>> 'struct page' for the TDX module.  The space for this metadata is not
>>> reserved by the hardware upfront and must be allocated by the kernel
>>
>> 			    ^ "up front"
> 
> Thanks will change to "up front".
> 
> Btw, the gmail grammar check gives me a red line if I use "up front", but it
> doesn't complain "upfront".

I'm pretty sure it's wrong.  "up front" is an adverb that applies to
"reserved".  "Upfront" is an adjective and not how you used it in that
sentence.

>>> +	 * allocated individually within construct_tdmrs() to meet
>>> +	 * this requirement.
>>> +	 */
>>> +	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
>>> +			GFP_KERNEL);
>>
>> Where, exactly is that alignment provided?  A 'struct tdmr_info *' is 8
>> bytes so a tdx_sysinfo.max_tdmrs=8 kcalloc() would only guarantee
>> 64-byte alignment.
> 
> The entries in the array only contain a pointer to TDMR_INFO.  The actual
> TDMR_INFO is allocated separately. The array itself is never used by TDX
> hardware so it doesn't matter.  We just need to guarantee each TDMR_INFO is
> 512B-byte aligned.

The comment was clear as mud about this.  If you're going to talk about
alignment, then do it near the allocation that guarantees the alignment,
not in some other function near *ANOTHER* allocation.

Also, considering that you're about to go allocate potentially gigabytes
of physically contiguous memory, it seems laughable that you'd go to any
trouble at all to allocate an array of pointers here.  Why not just

	kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tmdr_info), ...);

Or, heck, just vmalloc() the dang thing.  Why even bother with the array
of pointers?


>>> +	if (!tdmr_array) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	/* Construct TDMRs to build TDX memory */
>>> +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
>>> +	if (ret)
>>> +		goto out_free_tdmrs;
>>> +
>>>  	/*
>>>  	 * Return -EFAULT until all steps of TDX module
>>>  	 * initialization are done.
>>>  	 */
>>>  	ret = -EFAULT;
>>
>> There's the -EFAULT again.  I'd replace these with a better error code.
> 
> I couldn't think out a better error code.  -EINVAL looks doesn't suit.  -EAGAIN
> also doesn't make sense for now since we always shutdown the TDX module in case
> of any error so caller should never retry.  I think we need some error code to
> tell "the job isn't done yet".  Perhaps -EBUSY?

Is this going to retry if it sees -EFAULT or -EBUSY?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-28  1:01   ` Dan Williams
@ 2022-04-28  1:21     ` Kai Huang
  2022-04-29  2:58       ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  1:21 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen
  Cc: Linux Kernel Mailing List, KVM list, Sean Christopherson,
	Paolo Bonzini, Brown, Len, Luck, Tony, Rafael J Wysocki,
	Reinette Chatre, Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> [..]
> > > 3. Memory hotplug
> > > 
> > > The first generation of TDX architecturally doesn't support memory
> > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > physical memory hotplug.  Since it physically cannot happen, this series
> > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > 
> > > A special case of memory hotplug is adding NVDIMM as system RAM using
> 
> Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> 
> > > kmem driver.  However the first generation of TDX-capable platforms
> > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > happen either.
> > 
> > What prevents this code from today's code being run on tomorrow's
> > platforms and breaking these assumptions?
> 
> The assumption is already broken today with NVDIMM-N. The lack of
> DDR-T support on TDX enabled platforms has zero effect on DDR-based
> persistent memory solutions. In other words, please describe the
> actual software and hardware conflicts at play here, and do not make
> the mistake of assuming that "no DDR-T support on TDX platforms" ==
> "no NVDIMM support".

Sorry I got this information from planning team or execution team I guess. I was
told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
machine.  "co-exist" means they cannot be turned on simultaneously on the same
platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
and DDR-T based persistent memory.  Could you give some more background here so
I can take a look?

> 
> > > Another case is admin can use 'memmap' kernel command line to create
> > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > hotplug code to prevent this from happening, this series always include
> > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> 
> I am not sure what you are trying to say here?

We want to always make sure the memory managed by page allocator is TDX memory.
So if the legacy PMEMs are unconditionally configured as TDX memory, then we
don't need to prevent them from being added as system memory via kmem driver.

> 
> > > 4. CPU hotplug
> > > 
> > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > either.  Since this physically cannot happen, this series doesn't add any
> > > check in ACPI CPU hotplug code path to disable it.
> 
> What are the actual challenges posed to TDX with respect to CPU hotplug?

During the TDX module initialization, there is a step to call SEAMCALL on all
logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
to enforce this check too I guess but I don't know details about this.

> 
> > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> 
> Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> /proc/cpuinfo for example.

It means the CPUs with "enable" bit set in the MADT table.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-28  1:07       ` Dave Hansen
@ 2022-04-28  1:35         ` Kai Huang
  2022-04-28  3:40           ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28  1:35 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 18:07 -0700, Dave Hansen wrote:
> On 4/27/22 17:53, Kai Huang wrote:
> > On Wed, 2022-04-27 at 15:24 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > TDX provides increased levels of memory confidentiality and integrity.
> > > > This requires special hardware support for features like memory
> > > > encryption and storage of memory integrity checksums.  Not all memory
> > > > satisfies these requirements.
> > > > 
> > > > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > > > (CMR).  During boot, the firmware builds a list of all of the memory
> > > > ranges which can provide the TDX security guarantees.  The list of these
> > > > ranges, along with TDX module information, is available to the kernel by
> > > > querying the TDX module.
> > > > 
> > > > In order to provide crypto protection to TD guests, the TDX architecture
> > > 
> > > There's that "crypto protection" thing again.  I'm not really a fan of
> > > the changes made to this changelog since I wrote it. :)
> > 
> > Sorry about that.  I'll remove "In order to provide crypto protection to TD
> > guests".
> 
> Seriously, though.  I took the effort to write these changelogs for you.
>  They were fine.  I'm not stoked about needing to proofread them again.

Yeah pretty clear to me now. Really thanks for your time.

Won't happen again.  If there's something I feel not right, I'll raise but not
slightly change.

> 
> > > > also needs additional metadata to record things like which TD guest
> > > > "owns" a given page of memory.  This metadata essentially serves as the
> > > > 'struct page' for the TDX module.  The space for this metadata is not
> > > > reserved by the hardware upfront and must be allocated by the kernel
> > > 
> > > 			    ^ "up front"
> > 
> > Thanks will change to "up front".
> > 
> > Btw, the gmail grammar check gives me a red line if I use "up front", but it
> > doesn't complain "upfront".
> 
> I'm pretty sure it's wrong.  "up front" is an adverb that applies to
> "reserved".  "Upfront" is an adjective and not how you used it in that
> sentence.

Thanks for explaining.  Anyway the gmail grammar can have bug.

> 
> > > > +	 * allocated individually within construct_tdmrs() to meet
> > > > +	 * this requirement.
> > > > +	 */
> > > > +	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
> > > > +			GFP_KERNEL);
> > > 
> > > Where, exactly is that alignment provided?  A 'struct tdmr_info *' is 8
> > > bytes so a tdx_sysinfo.max_tdmrs=8 kcalloc() would only guarantee
> > > 64-byte alignment.
> > 
> > The entries in the array only contain a pointer to TDMR_INFO.  The actual
> > TDMR_INFO is allocated separately. The array itself is never used by TDX
> > hardware so it doesn't matter.  We just need to guarantee each TDMR_INFO is
> > 512B-byte aligned.
> 
> The comment was clear as mud about this.  If you're going to talk about
> alignment, then do it near the allocation that guarantees the alignment,
> not in some other function near *ANOTHER* allocation.
> 
> Also, considering that you're about to go allocate potentially gigabytes
> of physically contiguous memory, it seems laughable that you'd go to any
> trouble at all to allocate an array of pointers here.  Why not just
> 
> 	kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tmdr_info), ...);

kmalloc() guarantees the size-alignment if the size is power-of-two.  TDMR_INFO
(512-bytes) itself is  power of two, but the 'max_tdmrs x sizeof(TDMR_INFO)' may
not be power of two.  For instance, when max_tdmrs == 3, the result is not
power-of-two.

Or am I wrong? I am not good at math though.

> 
> Or, heck, just vmalloc() the dang thing.  Why even bother with the array
> of pointers?
> 
> 
> > > > +	if (!tdmr_array) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	/* Construct TDMRs to build TDX memory */
> > > > +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
> > > > +	if (ret)
> > > > +		goto out_free_tdmrs;
> > > > +
> > > >  	/*
> > > >  	 * Return -EFAULT until all steps of TDX module
> > > >  	 * initialization are done.
> > > >  	 */
> > > >  	ret = -EFAULT;
> > > 
> > > There's the -EFAULT again.  I'd replace these with a better error code.
> > 
> > I couldn't think out a better error code.  -EINVAL looks doesn't suit.  -EAGAIN
> > also doesn't make sense for now since we always shutdown the TDX module in case
> > of any error so caller should never retry.  I think we need some error code to
> > tell "the job isn't done yet".  Perhaps -EBUSY?
> 
> Is this going to retry if it sees -EFAULT or -EBUSY?

No.  Currently we always shutdown the module in case of any error.  Caller won't
be able to retry.

In the future, this can be optimized.  We don't shutdown the module in case of
*some* error (i.e. -ENOMEM), but record an internal state when error happened,
so the caller can retry again.  For now, there's no retry.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-28  1:35         ` Kai Huang
@ 2022-04-28  3:40           ` Dave Hansen
  2022-04-28  3:55             ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28  3:40 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/27/22 18:35, Kai Huang wrote:
> On Wed, 2022-04-27 at 18:07 -0700, Dave Hansen wrote:
>> Also, considering that you're about to go allocate potentially gigabytes
>> of physically contiguous memory, it seems laughable that you'd go to any
>> trouble at all to allocate an array of pointers here.  Why not just
>>
>> 	kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tmdr_info), ...);
> 
> kmalloc() guarantees the size-alignment if the size is power-of-two.  TDMR_INFO
> (512-bytes) itself is  power of two, but the 'max_tdmrs x sizeof(TDMR_INFO)' may
> not be power of two.  For instance, when max_tdmrs == 3, the result is not
> power-of-two.
> 
> Or am I wrong? I am not good at math though.

No, you're right, the kcalloc() wouldn't work for odd sizes.

But, the point is still that you don't need an array of pointers.  Use
vmalloc().  Use a plain old alloc_pages_exact().  Why bother wasting
the memory and addiong the complexity of an array of pointers?

>> Or, heck, just vmalloc() the dang thing.  Why even bother with the array
>> of pointers?
>>
>>
>>>>> +	if (!tdmr_array) {
>>>>> +		ret = -ENOMEM;
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	/* Construct TDMRs to build TDX memory */
>>>>> +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
>>>>> +	if (ret)
>>>>> +		goto out_free_tdmrs;
>>>>> +
>>>>>  	/*
>>>>>  	 * Return -EFAULT until all steps of TDX module
>>>>>  	 * initialization are done.
>>>>>  	 */
>>>>>  	ret = -EFAULT;
>>>>
>>>> There's the -EFAULT again.  I'd replace these with a better error code.
>>>
>>> I couldn't think out a better error code.  -EINVAL looks doesn't suit.  -EAGAIN
>>> also doesn't make sense for now since we always shutdown the TDX module in case
>>> of any error so caller should never retry.  I think we need some error code to
>>> tell "the job isn't done yet".  Perhaps -EBUSY?
>>
>> Is this going to retry if it sees -EFAULT or -EBUSY?
> 
> No.  Currently we always shutdown the module in case of any error.  Caller won't
> be able to retry.
> 
> In the future, this can be optimized.  We don't shutdown the module in case of
> *some* error (i.e. -ENOMEM), but record an internal state when error happened,
> so the caller can retry again.  For now, there's no retry.

Just make the error codes -EINVAL, please.  I don't think anything else
makes sense.


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
  2022-04-28  3:40           ` Dave Hansen
@ 2022-04-28  3:55             ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-28  3:55 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-04-27 at 20:40 -0700, Dave Hansen wrote:
> On 4/27/22 18:35, Kai Huang wrote:
> > On Wed, 2022-04-27 at 18:07 -0700, Dave Hansen wrote:
> > > Also, considering that you're about to go allocate potentially gigabytes
> > > of physically contiguous memory, it seems laughable that you'd go to any
> > > trouble at all to allocate an array of pointers here.  Why not just
> > > 
> > > 	kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tmdr_info), ...);
> > 
> > kmalloc() guarantees the size-alignment if the size is power-of-two.  TDMR_INFO
> > (512-bytes) itself is  power of two, but the 'max_tdmrs x sizeof(TDMR_INFO)' may
> > not be power of two.  For instance, when max_tdmrs == 3, the result is not
> > power-of-two.
> > 
> > Or am I wrong? I am not good at math though.
> 
> No, you're right, the kcalloc() wouldn't work for odd sizes.
> 
> But, the point is still that you don't need an array of pointers.  Use
> vmalloc().  Use a plain old alloc_pages_exact().  Why bother wasting
> the memory and addiong the complexity of an array of pointers?

OK.  This makes sense.

One thing I didn't say clearly is TDMR_INFO is 512-byte aligned, but not could
be larger than 512 bytes, and the maximum number of reserved areas in TDMR_INFO
is enumerated via TDSYSINFO_STRUCT.  We can always roundup TDMR_INFO size to be
512-byte aligned, and calculate enough pages to hold maximum number of
TDMR_INFO.  In this case, we can still guarantee each TDMR_INFO is 512-byte
aligned.

I'll change to use alloc_pages_exact(), since we can get physical address of
TDMR_INFO from it easily.

> 
> > > Or, heck, just vmalloc() the dang thing.  Why even bother with the array
> > > of pointers?
> > > 
> > > 
> > > > > > +	if (!tdmr_array) {
> > > > > > +		ret = -ENOMEM;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	/* Construct TDMRs to build TDX memory */
> > > > > > +	ret = construct_tdmrs(tdmr_array, &tdmr_num);
> > > > > > +	if (ret)
> > > > > > +		goto out_free_tdmrs;
> > > > > > +
> > > > > >  	/*
> > > > > >  	 * Return -EFAULT until all steps of TDX module
> > > > > >  	 * initialization are done.
> > > > > >  	 */
> > > > > >  	ret = -EFAULT;
> > > > > 
> > > > > There's the -EFAULT again.  I'd replace these with a better error code.
> > > > 
> > > > I couldn't think out a better error code.  -EINVAL looks doesn't suit.  -EAGAIN
> > > > also doesn't make sense for now since we always shutdown the TDX module in case
> > > > of any error so caller should never retry.  I think we need some error code to
> > > > tell "the job isn't done yet".  Perhaps -EBUSY?
> > > 
> > > Is this going to retry if it sees -EFAULT or -EBUSY?
> > 
> > No.  Currently we always shutdown the module in case of any error.  Caller won't
> > be able to retry.
> > 
> > In the future, this can be optimized.  We don't shutdown the module in case of
> > *some* error (i.e. -ENOMEM), but record an internal state when error happened,
> > so the caller can retry again.  For now, there's no retry.
> 
> Just make the error codes -EINVAL, please.  I don't think anything else
> makes sense.
> 

OK will do.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-28  0:15     ` Kai Huang
@ 2022-04-28 14:06       ` Dave Hansen
  2022-04-28 23:14         ` Kai Huang
  2022-05-18 22:30       ` Sagi Shahar
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 14:06 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/27/22 17:15, Kai Huang wrote:
> On Wed, 2022-04-27 at 15:15 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> TDX provides increased levels of memory confidentiality and integrity.
>>> This requires special hardware support for features like memory
>>> encryption and storage of memory integrity checksums.  Not all memory
>>> satisfies these requirements.
>>>
>>> As a result, TDX introduced the concept of a "Convertible Memory Region"
>>> (CMR).  During boot, the firmware builds a list of all of the memory
>>> ranges which can provide the TDX security guarantees.  The list of these
>>> ranges, along with TDX module information, is available to the kernel by
>>> querying the TDX module via TDH.SYS.INFO SEAMCALL.
>>>
>>> Host kernel can choose whether or not to use all convertible memory
>>> regions as TDX memory.  Before TDX module is ready to create any TD
>>> guests, all TDX memory regions that host kernel intends to use must be
>>> configured to the TDX module, using specific data structures defined by
>>> TDX architecture.  Constructing those structures requires information of
>>> both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
>>> to get this information as preparation to construct those structures.
>>>
>>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>>> ---
>>>  arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
>>>  arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
>>>  2 files changed, 192 insertions(+)
>>>
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index ef2718423f0f..482e6d858181 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
>>>  
>>>  static struct p_seamldr_info p_seamldr_info;
>>>  
>>> +/* Base address of CMR array needs to be 512 bytes aligned. */
>>> +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
>>> +static int tdx_cmr_num;
>>> +static struct tdsysinfo_struct tdx_sysinfo;
>>
>> I really dislike mixing hardware and software structures.  Please make
>> it clear which of these are fully software-defined and which are part of
>> the hardware ABI.
> 
> Both 'struct tdsysinfo_struct' and 'struct cmr_info' are hardware structures. 
> They are defined in tdx.h which has a comment saying the data structures below
> this comment is hardware structures:
> 
> 	+/*
> 	+ * TDX architectural data structures
> 	+ */
> 
> It is introduced in the P-SEAMLDR patch.
> 
> Should I explicitly add comments around the variables saying they are used by
> hardware, something like:
> 
> 	/*
> 	 * Data structures used by TDH.SYS.INFO SEAMCALL to return CMRs and
> 	 * TDX module system information.
> 	 */

I think we know they are data structures. :)

But, saying:

	/* Used in TDH.SYS.INFO SEAMCALL ABI: */

*is* actually helpful.  It (probably) tells us where in the spec we can
find the definition and tells how it gets used.  Plus, it tells us this
isn't a software data structure.

>>> +	/* Get TDX module information and CMRs */
>>> +	ret = tdx_get_sysinfo();
>>> +	if (ret)
>>> +		goto out;
>>
>> Couldn't we get rid of that comment if you did something like:
>>
>> 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> 
> Yes will do.
> 
>> and preferably make the variables function-local.
> 
> 'tdx_sysinfo' will be used by KVM too.

In other words, it's not a part of this series so I can't review whether
this statement is correct or whether there's a better way to hand this
information over to KVM.

This (minor) nugget influencing the design also isn't even commented or
addressed in the changelog.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-28  0:00         ` Kai Huang
@ 2022-04-28 14:27           ` Dave Hansen
  2022-04-28 23:44             ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 14:27 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/27/22 17:00, Kai Huang wrote:
> On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> I think we can use pr_info_once() when all_cpus_booted() returns false, and get
> rid of printing "SEAMRR not enabled" in seamrr_enabled().  How about below?
> 
> static bool seamrr_enabled(void)
> {
> 	if (!all_cpus_booted())
> 		pr_info_once("Not all present CPUs have been booted.  Report
> SEAMRR as not enabled");
> 
> 	return __seamrr_enabled();
> }
> 
> And we don't print "SEAMRR not enabled".

That's better, but even better than that would be removing all that
SEAMRR gunk in the first place.

>>>>> +	/*
>>>>> +	 * TDX requires at least two KeyIDs: one global KeyID to
>>>>> +	 * protect the metadata of the TDX module and one or more
>>>>> +	 * KeyIDs to run TD guests.
>>>>> +	 */
>>>>> +	return tdx_keyid_num >= 2;
>>>>> +}
>>>>> +
>>>>> +static int __tdx_detect(void)
>>>>> +{
>>>>> +	/* The TDX module is not loaded if SEAMRR is disabled */
>>>>> +	if (!seamrr_enabled()) {
>>>>> +		pr_info("SEAMRR not enabled.\n");
>>>>> +		goto no_tdx_module;
>>>>> +	}
>>>>
>>>> Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
>>>> the module with SEAMCALL.  Why not just use that directly?
>>>
>>> SEAMCALL will cause #GP if SEAMRR is not enabled.  We should check whether
>>> SEAMRR is enabled before making SEAMCALL.
>>
>> So...  You could actually get rid of all this code.  if SEAMCALL #GP's,
>> then you say, "Whoops, the firmware didn't load the TDX module
>> correctly, sorry."
> 
> Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
> module is loaded.  If SEAMCALL is successful, the module is loaded.
> 
> One problem is currently the patch to flush cache for kexec() uses
> seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
> flush the cache.  The reason is, similar to SME, the flush is done in
> stop_this_cpu(), but the status of TDX module initialization is protected by
> mutex, so we cannot use TDX module status in stop_this_cpu() to determine
> whether to flush.
> 
> If that patch makes sense, I think we still need to detect SEAMRR?

Please go look at stop_this_cpu() closely.  What are the AMD folks doing
for SME exactly?  Do they, for instance, do the WBINVD when the kernel
used SME?  No, they just use a pretty low-level check if the processor
supports SME.

Doing the same kind of thing for TDX is fine.  You could check the MTRR
MSR bits that tell you if SEAMRR is supported and then read the MSR
directly.  You could check the CPUID enumeration for MKTME or
CPUID.B.0.EDX (I'm not even sure what this is but the SEAMCALL spec says
it is part of SEAMCALL operation).

Just like the SME test, it doesn't even need to be precise.  It just
needs to be 100% accurate in that it is *ALWAYS* set for any system that
might have dirtied cache aliases.

I'm not sure why you are so fixated on SEAMRR specifically for this.


...
> "During initializing the TDX module, one step requires some SEAMCALL must be
> done on all logical cpus enabled by BIOS, otherwise a later step will fail. 
> Disable CPU hotplug during the initialization process to prevent any CPU going
> offline during initializing the TDX module.  Note it is caller's responsibility
> to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> are online."

But, what if a CPU went offline just before this lock was taken?  What
if the caller make sure all present CPUs are online, makes the call,
then a CPU is taken offline.  The lock wouldn't do any good.

What purpose does the lock serve?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory
  2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
  2022-04-20 20:55   ` Isaku Yamahata
@ 2022-04-28 15:54   ` Dave Hansen
  2022-04-29  7:32     ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 15:54 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> As one step of initializing the TDX module, the memory regions that the
> TDX module can use must be configured to it via an array of 'TD Memory

"can use must be"?

> Regions' (TDMR).  The kernel is responsible for choosing which memory
> regions to be used as TDX memory and building the array of TDMRs to
> cover those memory regions.
> 
> The first generation of TDX-capable platforms basically guarantees all
> system RAM regions during machine boot are Convertible Memory Regions
> (excluding the memory below 1MB) and can be used by TDX.  The memory
> pages allocated to TD guests can be any pages managed by the page
> allocator.  To avoid having to modify the page allocator to distinguish
> TDX and non-TDX memory allocation, adopt a simple policy to use all
> system RAM regions as TDX memory.  The low 1MB pages are excluded from
> TDX memory since they are not in CMRs in some platforms (those pages are
> reserved at boot time and won't be managed by page allocator anyway).
> 
> This policy could be revised later if future TDX generations break
> the guarantee or when the size of the metadata (~1/256th of the size of
> the TDX usable memory) becomes a concern.  At that time a CMR-aware
> page allocator may be necessary.

Remember that you have basically three or four short sentences to get a
reviewer's attention.  There's a lot of noise in that changelog.  Can
you trim it down or at least make the first bit less jargon-packed and
more readable?

> Also, on the first generation of TDX-capable machine, the system RAM
> ranges discovered during boot time are all memory regions that kernel
> can use during its runtime.  This is because the first generation of TDX
> architecturally doesn't support ACPI memory hotplug 

"Architecturally" usually means: written down and agreed to by hardware
and software alike.  Is this truly written down somewhere?  I don't
recall seeing it in the architecture documents.

I fear this is almost the _opposite_ of architecture: it's basically a
fortunate coincidence.

> (CMRs are generated
> during machine boot and are static during machine's runtime).  Also, the
> first generation of TDX-capable platform doesn't support TDX and ACPI
> memory hotplug at the same time on a single machine.  Another case of
> memory hotplug is user may use NVDIMM as system RAM via kmem driver.
> But the first generation of TDX-capable machine doesn't support TDX and
> NVDIMM simultaneously, therefore in practice it cannot happen.  One
> special case is user may use 'memmap' kernel command line to reserve
> part of system RAM as x86 legacy PMEMs, and user can theoretically add
> them as system RAM via kmem driver.  This can be resolved by always
> treating legacy PMEMs as TDX memory.

Again, there's a ton of noise here.  I'm struggling to get the point.

> Implement a helper to loop over all RAM entries in e820 table to find
> all system RAM ranges, as a preparation to covert all of them to TDX
> memory.  Use 'e820_table', rather than 'e820_table_firmware' to honor
> 'mem' and 'memmap' command lines. 

*How* does this honor them?  For instance, if I do mem=4G, will the TDX
code limit itself to converting 4GB for TDX?

> Following e820__memblock_setup(),
> both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN types are treated as TDX
> memory, and contiguous ranges in the same NUMA node are merged together.

Again, you're just rehashing the code's logic in English.  That's not
what a changelog is for.

> One difference is, as mentioned above, x86 legacy PMEMs (E820_TYPE_PRAM)
> are also always treated as TDX memory.  They are underneath RAM, and
> they could be used as TD guest memory.  Always including them as TDX
> memory also avoids having to modify memory hotplug code to handle adding
> them as system RAM via kmem driver.

I think you can replace virtually this entire changelog with the following:

	Consider a wide variety of e820 entries as RAM.  This ensures
	that any RAM that might *possibly* end up being used for TDX is
	fully converted.  If the selection here was more conservative,
	it might lead to errors adding memory to TDX guests at runtime
	which would be very hard to handle.

> To begin with, sanity check all memory regions found in e820 are fully
> covered by any CMR and can be used as TDX memory.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 228 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 228 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9113bf09f358..7414625b938f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1972,6 +1972,7 @@ config INTEL_TDX_HOST
>  	default n
>  	depends on CPU_SUP_INTEL
>  	depends on X86_64
> +	select NUMA_KEEP_MEMINFO if NUMA
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX

*This* NUMA_KEEP_MEMINFO thing is exactly what you should talk about in
the changelog.  Add a sentence telling us why it is needed, or add a
good comment.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ec27350d53c1..6b0c51aaa7f2 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -14,11 +14,13 @@
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
>  #include <linux/slab.h>
> +#include <linux/math.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/cpufeature.h>
>  #include <asm/cpufeatures.h>
>  #include <asm/virtext.h>
> +#include <asm/e820/api.h>
>  #include <asm/tdx.h>
>  #include "tdx.h"
>  
> @@ -595,6 +597,222 @@ static int tdx_get_sysinfo(void)
>  	return sanitize_cmrs(tdx_cmr_array, cmr_num);
>  }
>  
> +/* Check whether one e820 entry is RAM and could be used as TDX memory */
> +static bool e820_entry_is_ram(struct e820_entry *entry)

It's not "is RAM" it's "might be used for TDX".

> +{
> +	/*
> +	 * Besides E820_TYPE_RAM, E820_TYPE_RESERVED_KERN type entries
> +	 * are also treated as TDX memory as they are also added to
> +	 * memblock.memory in e820__memblock_setup().

"are also treated" is really passive language.  Make it imperative voice:

	Treat E820_TYPE_RESERVED_KERN entries as possible TDX memory in
	addition to E820_TYPE_RAM.

I'm also not sure I buy the argument about E820_TYPE_RESERVED_KERN.  I
don't understand the implications of it being added to memblock.memory.

> +	 * E820_TYPE_SOFT_RESERVED type entries are excluded as they are
> +	 * marked as reserved and are not later freed to page allocator
> +	 * (only part of kernel image, initrd, etc are freed to page
> +	 * allocator).

Again, not "are excluded".  Make it: "Exclude E820_TYPE_SOFT_RESERVED
type entries ..."

Let's also look at the comments for E820_TYPE_SOFT_RESERVED:

>         /*
>          * Special-purpose memory is indicated to the system via the
>          * EFI_MEMORY_SP attribute. Define an e820 translation of this
>          * memory type for the purpose of reserving this range and
>          * marking it with the IORES_DESC_SOFT_RESERVED designation.
>          */
>         E820_TYPE_SOFT_RESERVED = 0xefffffff,

What makes you think this can never be freed back into the page allocator?


> +	 * Also unconditionally treat x86 legacy PMEMs (E820_TYPE_PRAM)
> +	 * as TDX memory since they are RAM underneath, and could be used
> +	 * as TD guest memory.
> +	 */
> +	return (entry->type == E820_TYPE_RAM) ||
> +		(entry->type == E820_TYPE_RESERVED_KERN) ||
> +		(entry->type == E820_TYPE_PRAM);
> +}

I really dislike how you did this.

Imagine it was:

	/* Plain old RAM, obviously needs TDX protection: */
	if (entry->type == E820_TYPE_RAM)
		return true;
	
	/*
	 * Talk specifically about E820_TYPE_RESERVED_KERN ...
	 */
	if (entry->type == E820_TYPE_RESERVED_KERN)
		return true;

See how that actually puts the comment close to the code that is
relevant to the comment?  It's way better than one massive, rambling
comment that isn't obviously connected to one of the code lines that
follows.

> +/*
> + * The low memory below 1MB is not covered by CMRs on some TDX platforms.
> + * In practice, this range cannot be used for guest memory because it is
> + * not managed by the page allocator due to boot-time reservation.  Just
> + * skip the low 1MB so this range won't be treated as TDX memory.

I'm going to steal a tglx term: word salad.  This patch set is full of
word salad.

If, in practice, this memory couldn't be used for TDX guests, we
wouldn't need to do anything here.  So, what's this code doing?

Are you saying that under no circumstances can memory <1MB get stuck in
ZONE_DMA and might end up getting used as TDX memory?

> + * Return true if the e820 entry is completely skipped, in which case
> + * caller should ignore this entry.  Otherwise the actual memory range
> + * after skipping the low 1MB is returned via @start and @end.
> + */
> +static bool e820_entry_skip_lowmem(struct e820_entry *entry, u64 *start,
> +				   u64 *end)
> +{
> +	u64 _start = entry->addr;
> +	u64 _end = entry->addr + entry->size;
> +
> +	if (_start < SZ_1M)
> +		_start = SZ_1M;
> +
> +	*start = _start;
> +	*end = _end;
> +
> +	return _start >= _end;
> +}

I see (barely) how this is excluding the lower 1MB from being used

> +/*
> + * Trim away non-page-aligned memory at the beginning and the end for a
> + * given region.  Return true when there are still pages remaining after
> + * trimming, and the trimmed region is returned via @start and @end.
> + */
> +static bool e820_entry_trim(u64 *start, u64 *end)
> +{
> +	u64 s, e;
> +
> +	s = round_up(*start, PAGE_SIZE);
> +	e = round_down(*end, PAGE_SIZE);
> +
> +	if (s >= e)
> +		return false;
> +
> +	*start = s;
> +	*end = e;
> +
> +	return true;
> +}
> +
> +/*
> + * Get the next memory region (excluding low 1MB) in e820.  @idx points
> + * to the entry to start to walk with.  Multiple memory regions in the
> + * same NUMA node that are contiguous are merged together (following
> + * e820__memblock_setup()).  The merged range is returned via @start and
> + * @end.  After return, @idx points to the next entry of the last RAM
> + * entry that has been walked, or table->nr_entries (indicating all
> + * entries in the e820 table have been walked).
> + */
> +static void e820_next_mem(struct e820_table *table, int *idx, u64 *start,
> +			  u64 *end)
> +{
> +	u64 rs, re;

Please give these real variable names.

> +	int rnid, i;
> +
> +again:
> +	rs = re = 0;
> +	for (i = *idx; i < table->nr_entries; i++) {
> +		struct e820_entry *entry = &table->entries[i];
> +		u64 s, e;
> +		int nid;
> +
> +		if (!e820_entry_is_ram(entry))
> +			continue;
> +
> +		if (e820_entry_skip_lowmem(entry, &s, &e))
> +			continue;
> +
> +		/*
> +		 * Found the first RAM entry.  Record it and keep
> +		 * looping to find other RAM entries that can be
> +		 * merged.
> +		 */
> +		if (!rs) {
> +			rs = s;
> +			re = e;
> +			rnid = phys_to_target_node(rs);
> +			if (WARN_ON_ONCE(rnid == NUMA_NO_NODE))
> +				rnid = 0;
> +			continue;
> +		}
> +
> +		/*
> +		 * Try to merge with previous RAM entry.  E820 entries
> +		 * are not necessarily page aligned.  For instance, the
> +		 * setup_data elements in boot_params are marked as
> +		 * E820_TYPE_RESERVED_KERN, and they may not be page
> +		 * aligned.  In e820__memblock_setup() all adjancent

							     ^ adjacent

Kai, please make sure to fire up your editor's spell checker.  My mail
client has one, so I'll catch all of these.  You might as well do it too.
		
> +		 * memory regions within the same NUMA node are merged to
> +		 * a single one, and the non-page-aligned parts (at the
> +		 * beginning and the end) are trimmed.  Follow the same
> +		 * rule here.
> +		 */
> +		nid = phys_to_target_node(s);
> +		if (WARN_ON_ONCE(nid == NUMA_NO_NODE))
> +			nid = 0;
> +		if ((nid == rnid) && (s == re)) {
> +			/* Merge with previous range and update the end */
> +			re = e;
> +			continue;
> +		}
> +
> +		/*
> +		 * Stop if current entry cannot be merged with previous
> +		 * one (or more) entries.
> +		 */
> +		break;
> +	}
> +
> +	/*
> +	 * @i is either the RAM entry that cannot be merged with previous
> +	 * one (or more) entries, or table->nr_entries.
> +	 */
> +	*idx = i;
> +	/*
> +	 * Trim non-page-aligned parts of [@rs, @re), which is either a
> +	 * valid memory region, or empty.  If there's nothing left after
> +	 * trimming and there are still entries that have not been
> +	 * walked, continue to walk.
> +	 */
> +	if (!e820_entry_trim(&rs, &re) && i < table->nr_entries)
> +		goto again;
> +
> +	*start = rs;
> +	*end = re;
> +}
> +
> +/*
> + * Helper to loop all e820 RAM entries with low 1MB excluded
> + * in a given e820 table.
> + */
> +#define _e820_for_each_mem(_table, _i, _start, _end)				\
> +	for ((_i) = 0, e820_next_mem((_table), &(_i), &(_start), &(_end));	\
> +		(_start) < (_end);						\
> +		e820_next_mem((_table), &(_i), &(_start), &(_end)))
> +
> +/*
> + * Helper to loop all e820 RAM entries with low 1MB excluded
> + * in kernel modified 'e820_table' to honor 'mem' and 'memmap' kernel
> + * command lines.
> + */
> +#define e820_for_each_mem(_i, _start, _end)	\
> +	_e820_for_each_mem(e820_table, _i, _start, _end)

This effectively adds a bunch of private, well-hidden e820 munging.  I'm
a bit surprised that this is all added code and doesn't reuse or
refactor a single line of the existing e820 code.

Also, this needs to throw away memory. But, it's happening much later in
boot, like when the first TD is launched.  Isn't this too late for e820
changes to affect what goes into the page allocator?

> +/* Check whether first range is the subrange of the second */
> +static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
> +{
> +	return (r1_start >= r2_start && r1_end <= r2_end) ? true : false;
> +}

Why is this bothering with the ?: form?

Won't this:

	return (r1_start >= r2_start && r1_end <= r2_end);

do *precisely* the same thing?


> +/* Check whether address range is covered by any CMR or not. */
> +static bool range_covered_by_cmr(struct cmr_info *cmr_array, int cmr_num,
> +				 u64 start, u64 end)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_num; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +
> +		if (is_subrange(start, end, cmr->base, cmr->base + cmr->size))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/* Sanity check whether all e820 RAM entries are fully covered by CMRs. */
> +static int e820_check_against_cmrs(void)
> +{
> +	u64 start, end;
> +	int i;
> +
> +	/*
> +	 * Loop over e820_table to find all RAM entries and check
> +	 * whether they are all fully covered by any CMR.
> +	 */
> +	e820_for_each_mem(i, start, end) {
> +		if (!range_covered_by_cmr(tdx_cmr_array, tdx_cmr_num,
> +					start, end)) {
> +			pr_err("[0x%llx, 0x%llx) is not fully convertible memory\n",
> +					start, end);
> +			return -EFAULT;
> +		}
> +	}
> +
> +	return 0;
> +}

This is actively going out and modifying the e820, right?  That isn't
obvious at all.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM
  2022-04-06  4:49 ` [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM Kai Huang
@ 2022-04-28 16:22   ` Dave Hansen
  2022-04-29  7:24     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 16:22 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> The kernel configures TDX usable memory regions to the TDX module via
> an array of "TD Memory Region" (TDMR). 

One bit of language that's repeated in these changelogs that I don't
like is "configure ... to".  I think that's a misuse of the word
configure.  I'd say something more like:

	The kernel configures TDX-usable memory regions by passing an
	array of "TD Memory Regions" (TDMRs) to the TDX module.

Could you please take a look over this series and reword those?

> Each TDMR entry (TDMR_INFO)
> contains the information of the base/size of a memory region, the
> base/size of the associated Physical Address Metadata Table (PAMT) and
> a list of reserved areas in the region.
> 
> Create a number of TDMRs according to the verified e820 RAM entries.
> As the first step only set up the base/size information for each TDMR.
> 
> TDMR must be 1G aligned and the size must be in 1G granularity.  This

 ^ Each

> implies that one TDMR could cover multiple e820 RAM entries.  If a RAM
> entry spans the 1GB boundary and the former part is already covered by
> the previous TDMR, just create a new TDMR for the latter part.
> 
> TDX only supports a limited number of TDMRs (currently 64).  Abort the
> TDMR construction process when the number of TDMRs exceeds this
> limitation.

... and what does this *MEAN*?  Is TDX disabled?  Does it throw away the
RAM?  Does it eat puppies?

>  arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 138 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6b0c51aaa7f2..82534e70df96 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -54,6 +54,18 @@
>  		((u32)(((_keyid_part) & 0xffffffffull) + 1))
>  #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
>  
> +/* TDMR must be 1gb aligned */
> +#define TDMR_ALIGNMENT		BIT_ULL(30)
> +#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
> +
> +/* Align up and down the address to TDMR boundary */
> +#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +/* TDMR's start and end address */
> +#define TDMR_START(_tdmr)	((_tdmr)->base)
> +#define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)

Make these 'static inline's please.  #defines are only for constants or
things that can't use real functions.

>  /*
>   * TDX module status during initialization
>   */
> @@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
>  	return 0;
>  }
>  
> +/* The starting offset of reserved areas within TDMR_INFO */
> +#define TDMR_RSVD_START		64

				^ extra whitespace

> +static struct tdmr_info *__alloc_tdmr(void)
> +{
> +	int tdmr_sz;
> +
> +	/*
> +	 * TDMR_INFO's actual size depends on maximum number of reserved
> +	 * areas that one TDMR supports.
> +	 */
> +	tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
> +		sizeof(struct tdmr_reserved_area);

You have a structure for this.  I know this because it's the return type
of the function.  You have TDMR_RSVD_START available via the structure
itself.  So, derive that 64 either via:

	sizeof(struct tdmr_info)

or,

	offsetof(struct tdmr_info, reserved_areas);

Which would make things look like this:

	tdmr_base_sz = sizeof(struct tdmr_info);
	tdmr_reserved_area_sz = sizeof(struct tdmr_reserved_area) *
				tdx_sysinfo.max_reserved_per_tdmr;

	tdmr_sz = tdmr_base_sz + tdmr_reserved_area_sz;

Could you explain why on earth you felt the need for the TDMR_RSVD_START
#define?

> +	/*
> +	 * TDX requires TDMR_INFO to be 512 aligned.  Always align up

Again, 512 what?  512 pages?  512 hippos?

> +	 * TDMR_INFO size to 512 so the memory allocated via kzalloc()
> +	 * can meet the alignment requirement.
> +	 */
> +	tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> +
> +	return kzalloc(tdmr_sz, GFP_KERNEL);
> +}
> +
> +/* Create a new TDMR at given index in the TDMR array */
> +static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
> +{
> +	struct tdmr_info *tdmr;
> +
> +	if (WARN_ON_ONCE(tdmr_array[idx]))
> +		return NULL;
> +
> +	tdmr = __alloc_tdmr();
> +	tdmr_array[idx] = tdmr;
> +
> +	return tdmr;
> +}
> +
>  static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
>  {
>  	int i;
> @@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
>  	}
>  }
>  
> +/*
> + * Create TDMRs to cover all RAM entries in e820_table.  The created
> + * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
> + * number of TDMRs.  All entries in @tdmr_array must be initially NULL.
> + */
> +static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> +{
> +	struct tdmr_info *tdmr;
> +	u64 start, end;
> +	int i, tdmr_idx;
> +	int ret = 0;
> +
> +	tdmr_idx = 0;
> +	tdmr = alloc_tdmr(tdmr_array, 0);
> +	if (!tdmr)
> +		return -ENOMEM;
> +	/*
> +	 * Loop over all RAM entries in e820 and create TDMRs to cover
> +	 * them.  To keep it simple, always try to use one TDMR to cover
> +	 * one RAM entry.
> +	 */
> +	e820_for_each_mem(i, start, end) {
> +		start = TDMR_ALIGN_DOWN(start);
> +		end = TDMR_ALIGN_UP(end);
			    ^ vertically align those ='s, please.


> +		/*
> +		 * If the current TDMR's size hasn't been initialized, it
> +		 * is a new allocated TDMR to cover the new RAM entry.
> +		 * Otherwise the current TDMR already covers the previous
> +		 * RAM entry.  In the latter case, check whether the
> +		 * current RAM entry has been fully or partially covered
> +		 * by the current TDMR, since TDMR is 1G aligned.
> +		 */
> +		if (tdmr->size) {
> +			/*
> +			 * Loop to next RAM entry if the current entry
> +			 * is already fully covered by the current TDMR.
> +			 */
> +			if (end <= TDMR_END(tdmr))
> +				continue;

This loop is actually pretty well commented and looks OK.  The
TDMR_END() construct even adds to readability.  *BUT*, the

> +			/*
> +			 * If part of current RAM entry has already been
> +			 * covered by current TDMR, skip the already
> +			 * covered part.
> +			 */
> +			if (start < TDMR_END(tdmr))
> +				start = TDMR_END(tdmr);
> +
> +			/*
> +			 * Create a new TDMR to cover the current RAM
> +			 * entry, or the remaining part of it.
> +			 */
> +			tdmr_idx++;
> +			if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
> +				ret = -E2BIG;
> +				goto err;
> +			}
> +			tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
> +			if (!tdmr) {
> +				ret = -ENOMEM;
> +				goto err;
> +			}

This is a bit verbose for this loop.  Why not just hide the 'max_tdmrs'
inside the alloc_tdmr() function?  That will make this loop smaller and
easier to read.

> +		}
> +
> +		tdmr->base = start;
> +		tdmr->size = end - start;
> +	}
> +
> +	/* @tdmr_idx is always the index of last valid TDMR. */
> +	*tdmr_num = tdmr_idx + 1;
> +
> +	return 0;
> +err:
> +	/*
> +	 * Clean up already allocated TDMRs in case of error.  @tdmr_idx
> +	 * indicates the last TDMR that wasn't created successfully,
> +	 * therefore only needs to free @tdmr_idx TDMRs.
> +	 */
> +	free_tdmrs(tdmr_array, tdmr_idx);
> +	return ret;
> +}
> +
>  static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
>  {
>  	int ret;
> @@ -834,8 +967,13 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
>  	if (ret)
>  		goto err;
>  
> +	ret = create_tdmrs(tdmr_array, tdmr_num);
> +	if (ret)
> +		goto err;
> +
>  	/* Return -EFAULT until constructing TDMRs is done */
>  	ret = -EFAULT;
> +	free_tdmrs(tdmr_array, *tdmr_num);
>  err:
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-06  4:49 ` [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2022-04-28 17:12   ` Dave Hansen
  2022-04-29  7:46     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 17:12 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> In order to provide crypto protection to guests, the TDX module uses
> additional metadata to record things like which guest "owns" a given
> page of memory.  This metadata, referred as Physical Address Metadata
> Table (PAMT), essentially serves as the 'struct page' for the TDX
> module.  PAMTs are not reserved by hardware upfront.  They must be
> allocated by the kernel and then given to the TDX module.
> 
> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.

s/respectively//

> Each PAMT must be a physically contiguous area from the Convertible

							^ s/the/a/

> Memory Regions (CMR).  However, the PAMTs which track pages in one TDMR
> do not need to reside within that TDMR but can be anywhere in CMRs.
> If one PAMT overlaps with any TDMR, the overlapping part must be
> reported as a reserved area in that particular TDMR.
> 
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).

This is also a good place to note the downsides of using
alloc_contig_pages().

> The current version of TDX supports at most 16 reserved areas per TDMR
> to cover both PAMTs and potential memory holes within the TDMR.  If many
> PAMTs are allocated within a single TDMR, 16 reserved areas may not be
> sufficient to cover all of them.
> 
> Adopt the following policies when allocating PAMTs for a given TDMR:
> 
>   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>     the total number of reserved areas consumed for PAMTs.
>   - Try to first allocate PAMT from the local node of the TDMR for better
>     NUMA locality.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 166 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7414625b938f..ff68d0829bd7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
>  	depends on CPU_SUP_INTEL
>  	depends on X86_64
>  	select NUMA_KEEP_MEMINFO if NUMA
> +	depends on CONTIG_ALLOC
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 82534e70df96..1b807dcbc101 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -21,6 +21,7 @@
>  #include <asm/cpufeatures.h>
>  #include <asm/virtext.h>
>  #include <asm/e820/api.h>
> +#include <asm/pgtable.h>
>  #include <asm/tdx.h>
>  #include "tdx.h"
>  
> @@ -66,6 +67,16 @@
>  #define TDMR_START(_tdmr)	((_tdmr)->base)
>  #define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
>  
> +/* Page sizes supported by TDX */
> +enum tdx_page_sz {
> +	TDX_PG_4K = 0,
> +	TDX_PG_2M,
> +	TDX_PG_1G,
> +	TDX_PG_MAX,
> +};

Is that =0 required?  I thought the first enum was defined to be 0.

> +#define TDX_HPAGE_SHIFT	9
> +
>  /*
>   * TDX module status during initialization
>   */
> @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
>  	return ret;
>  }
>  
> +/* Calculate PAMT size given a TDMR and a page size */
> +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> +					enum tdx_page_sz pgsz)
> +{
> +	unsigned long pamt_sz;
> +
> +	pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
> +		tdx_sysinfo.pamt_entry_size;

That 'pgsz' thing is just hideous.  I'd *much* rather see something like
this:

static int tdx_page_size_shift(enum tdx_page_sz page_sz)
{
	switch (page_sz) {
	case TDX_PG_4K:
		return PAGE_SIZE;
	...
	}
}

That's easy to figure out what's going on.

> +	/* PAMT size must be 4K aligned */
> +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> +	return pamt_sz;
> +}
> +
> +/* Calculate the size of all PAMTs for a TDMR */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
> +{
> +	enum tdx_page_sz pgsz;
> +	unsigned long pamt_sz;
> +
> +	pamt_sz = 0;
> +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
> +		pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
> +
> +	return pamt_sz;
> +}

But, there are 3 separate pointers pointing to 3 separate PAMTs.  Why do
they all have to be contiguously allocated?

> +/*
> + * Locate the NUMA node containing the start of the given TDMR's first
> + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
> + */

Please add a sentence or two on the implications here of what this means
when it happens.  Also, the joining of e820 regions seems like it might
span NUMA nodes.  What prevents that code from just creating one large
e820 area that leads to one large TDMR and horrible NUMA affinity for
these structures?

> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> +	u64 start, end;
> +	int i;
> +
> +	/* Find the first RAM entry covered by the TDMR */
> +	e820_for_each_mem(i, start, end)
> +		if (end > TDMR_START(tdmr))
> +			break;

Brackets around the big loop, please.

> +	/*
> +	 * One TDMR must cover at least one (or partial) RAM entry,
> +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> +	 */
> +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> +		return 0;
> +
> +	/*
> +	 * The first RAM entry may be partially covered by the previous
> +	 * TDMR.  In this case, use TDMR's start to find the NUMA node.
> +	 */
> +	if (start < TDMR_START(tdmr))
> +		start = TDMR_START(tdmr);
> +
> +	return phys_to_target_node(start);
> +}
> +
> +static int tdmr_setup_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
> +	unsigned long pamt_sz[TDX_PG_MAX];
> +	unsigned long pamt_npages;
> +	struct page *pamt;
> +	enum tdx_page_sz pgsz;
> +	int nid;

Sooooooooooooooooooo close to reverse Christmas tree, but no cigar.
Please fix it.

> +	/*
> +	 * Allocate one chunk of physically contiguous memory for all
> +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> +	 * in overlapped TDMRs.
> +	 */

Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
just two lines of code, I'd probably just the helper and open-code it
here.  Then you only have one place to comment on it.

> +	nid = tdmr_get_nid(tdmr);
> +	pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
> +	pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
> +			&node_online_map);
> +	if (!pamt)
> +		return -ENOMEM;
> +
> +	/* Calculate PAMT base and size for all supported page sizes. */
> +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> +		unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
> +
> +		pamt_base[pgsz] = tdmr_pamt_base;
> +		pamt_sz[pgsz] = sz;
> +
> +		tdmr_pamt_base += sz;
> +	}
> +
> +	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> +	tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
> +	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> +	tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
> +	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> +	tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];

This would all vertically align nicely if you renamed pamt_sz -> pamt_size.

> +	return 0;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long pamt_pfn, pamt_sz;
> +
> +	pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;

Comment, please:

	/*
	 * The PAMT was allocated in one contiguous unit.  The 4k PAMT
	 * should always point to the beginning of that allocation.
	 */

> +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> +	if (!pamt_sz)
> +		return;
> +
> +	if (WARN_ON(!pamt_pfn))
> +		return;
> +
> +	free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_num; i++)
> +		tdmr_free_pamt(tdmr_array[i]);
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)

	"set_up", please, not "setup".

> +{
> +	int i, ret;
> +
> +	for (i = 0; i < tdmr_num; i++) {
> +		ret = tdmr_setup_pamt(tdmr_array[i]);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> +	return -ENOMEM;
> +}
> +
>  static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
>  {
>  	int ret;
> @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
>  	if (ret)
>  		goto err;
>  
> +	ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
> +	if (ret)
> +		goto err_free_tdmrs;
> +
>  	/* Return -EFAULT until constructing TDMRs is done */
>  	ret = -EFAULT;
> +	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> +err_free_tdmrs:
>  	free_tdmrs(tdmr_array, *tdmr_num);
>  err:
>  	return ret;
> @@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
>  	 * initialization are done.
>  	 */
>  	ret = -EFAULT;
> +	/*
> +	 * Free PAMTs allocated in construct_tdmrs() when TDX module
> +	 * initialization fails.
> +	 */
> +	if (ret)
> +		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
>  out_free_tdmrs:
>  	/*
>  	 * TDMRs are only used during initializing TDX module.  Always

In a follow-on patch, I'd like this to dump out (in a pr_debug() or
pr_info()) how much memory is consumed by PAMT allocations.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support
  2022-04-06  4:49 ` [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support Kai Huang
@ 2022-04-28 17:25   ` Dave Hansen
  0 siblings, 0 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 17:25 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/5/22 21:49, Kai Huang wrote:
> Enabling TDX consumes additional memory (used by TDX as metadata) and
> additional initialization time.  Introduce a kernel command line to
> allow to opt-in TDX host kernel support when user truly wants to use
> TDX.

From the cover letter:

	"This series doesn't initialize TDX at boot time"

Could you please square that circle for me?  How does a feature that
doesn't get initialized a boot time need a boot-time command line opt-in?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-28 14:06       ` Dave Hansen
@ 2022-04-28 23:14         ` Kai Huang
  2022-04-29 17:47           ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28 23:14 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
> On 4/27/22 17:15, Kai Huang wrote:
> > On Wed, 2022-04-27 at 15:15 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > TDX provides increased levels of memory confidentiality and integrity.
> > > > This requires special hardware support for features like memory
> > > > encryption and storage of memory integrity checksums.  Not all memory
> > > > satisfies these requirements.
> > > > 
> > > > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > > > (CMR).  During boot, the firmware builds a list of all of the memory
> > > > ranges which can provide the TDX security guarantees.  The list of these
> > > > ranges, along with TDX module information, is available to the kernel by
> > > > querying the TDX module via TDH.SYS.INFO SEAMCALL.
> > > > 
> > > > Host kernel can choose whether or not to use all convertible memory
> > > > regions as TDX memory.  Before TDX module is ready to create any TD
> > > > guests, all TDX memory regions that host kernel intends to use must be
> > > > configured to the TDX module, using specific data structures defined by
> > > > TDX architecture.  Constructing those structures requires information of
> > > > both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
> > > > to get this information as preparation to construct those structures.
> > > > 
> > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > ---
> > > >  arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
> > > >  arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
> > > >  2 files changed, 192 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index ef2718423f0f..482e6d858181 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
> > > >  
> > > >  static struct p_seamldr_info p_seamldr_info;
> > > >  
> > > > +/* Base address of CMR array needs to be 512 bytes aligned. */
> > > > +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> > > > +static int tdx_cmr_num;
> > > > +static struct tdsysinfo_struct tdx_sysinfo;
> > > 
> > > I really dislike mixing hardware and software structures.  Please make
> > > it clear which of these are fully software-defined and which are part of
> > > the hardware ABI.
> > 
> > Both 'struct tdsysinfo_struct' and 'struct cmr_info' are hardware structures. 
> > They are defined in tdx.h which has a comment saying the data structures below
> > this comment is hardware structures:
> > 
> > 	+/*
> > 	+ * TDX architectural data structures
> > 	+ */
> > 
> > It is introduced in the P-SEAMLDR patch.
> > 
> > Should I explicitly add comments around the variables saying they are used by
> > hardware, something like:
> > 
> > 	/*
> > 	 * Data structures used by TDH.SYS.INFO SEAMCALL to return CMRs and
> > 	 * TDX module system information.
> > 	 */
> 
> I think we know they are data structures. :)
> 
> But, saying:
> 
> 	/* Used in TDH.SYS.INFO SEAMCALL ABI: */
> 
> *is* actually helpful.  It (probably) tells us where in the spec we can
> find the definition and tells how it gets used.  Plus, it tells us this
> isn't a software data structure.

Right.  I'll use your above comment.

> 
> > > > +	/* Get TDX module information and CMRs */
> > > > +	ret = tdx_get_sysinfo();
> > > > +	if (ret)
> > > > +		goto out;
> > > 
> > > Couldn't we get rid of that comment if you did something like:
> > > 
> > > 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> > 
> > Yes will do.
> > 
> > > and preferably make the variables function-local.
> > 
> > 'tdx_sysinfo' will be used by KVM too.
> 
> In other words, it's not a part of this series so I can't review whether
> this statement is correct or whether there's a better way to hand this
> information over to KVM.
> 
> This (minor) nugget influencing the design also isn't even commented or
> addressed in the changelog.

TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
in the stack.  I can change to use dynamic allocation at the beginning and free
it at the end of the function.  KVM support patches can change it to static
variable in the file.

Or I can add a sentence saying KVM will need to use 'tdx_tdsysinfo' so use
static variable.  However currently KVM doesn't use CMR so no justification for
CMR array.

But I am thinking about memory hotplug interaction with TDX module
initialization.  That may use CMR info.  Let me send out proposal and close that
first to see whether this series needs to use CMR info out of this function.



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-28 14:27           ` Dave Hansen
@ 2022-04-28 23:44             ` Kai Huang
  2022-04-28 23:53               ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-28 23:44 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 07:27 -0700, Dave Hansen wrote:
> On 4/27/22 17:00, Kai Huang wrote:
> > On Wed, 2022-04-27 at 07:49 -0700, Dave Hansen wrote:
> > I think we can use pr_info_once() when all_cpus_booted() returns false, and get
> > rid of printing "SEAMRR not enabled" in seamrr_enabled().  How about below?
> > 
> > static bool seamrr_enabled(void)
> > {
> > 	if (!all_cpus_booted())
> > 		pr_info_once("Not all present CPUs have been booted.  Report
> > SEAMRR as not enabled");
> > 
> > 	return __seamrr_enabled();
> > }
> > 
> > And we don't print "SEAMRR not enabled".
> 
> That's better, but even better than that would be removing all that
> SEAMRR gunk in the first place.

Agreed.

> > > > > > +	/*
> > > > > > +	 * TDX requires at least two KeyIDs: one global KeyID to
> > > > > > +	 * protect the metadata of the TDX module and one or more
> > > > > > +	 * KeyIDs to run TD guests.
> > > > > > +	 */
> > > > > > +	return tdx_keyid_num >= 2;
> > > > > > +}
> > > > > > +
> > > > > > +static int __tdx_detect(void)
> > > > > > +{
> > > > > > +	/* The TDX module is not loaded if SEAMRR is disabled */
> > > > > > +	if (!seamrr_enabled()) {
> > > > > > +		pr_info("SEAMRR not enabled.\n");
> > > > > > +		goto no_tdx_module;
> > > > > > +	}
> > > > > 
> > > > > Why even bother with the SEAMRR stuff?  It sounded like you can "ping"
> > > > > the module with SEAMCALL.  Why not just use that directly?
> > > > 
> > > > SEAMCALL will cause #GP if SEAMRR is not enabled.  We should check whether
> > > > SEAMRR is enabled before making SEAMCALL.
> > > 
> > > So...  You could actually get rid of all this code.  if SEAMCALL #GP's,
> > > then you say, "Whoops, the firmware didn't load the TDX module
> > > correctly, sorry."
> > 
> > Yes we can just use the first SEAMCALL (TDH.SYS.INIT) to detect whether TDX
> > module is loaded.  If SEAMCALL is successful, the module is loaded.
> > 
> > One problem is currently the patch to flush cache for kexec() uses
> > seamrr_enabled() and tdx_keyid_sufficient() to determine whether we need to
> > flush the cache.  The reason is, similar to SME, the flush is done in
> > stop_this_cpu(), but the status of TDX module initialization is protected by
> > mutex, so we cannot use TDX module status in stop_this_cpu() to determine
> > whether to flush.
> > 
> > If that patch makes sense, I think we still need to detect SEAMRR?
> 
> Please go look at stop_this_cpu() closely.  What are the AMD folks doing
> for SME exactly?  Do they, for instance, do the WBINVD when the kernel
> used SME?  No, they just use a pretty low-level check if the processor
> supports SME.
> 
> Doing the same kind of thing for TDX is fine.  You could check the MTRR
> MSR bits that tell you if SEAMRR is supported and then read the MSR
> directly.  You could check the CPUID enumeration for MKTME or
> CPUID.B.0.EDX (I'm not even sure what this is but the SEAMCALL spec says
> it is part of SEAMCALL operation).

I am not sure about this CPUID either.  

> 
> Just like the SME test, it doesn't even need to be precise.  It just
> needs to be 100% accurate in that it is *ALWAYS* set for any system that
> might have dirtied cache aliases.
> 
> I'm not sure why you are so fixated on SEAMRR specifically for this.

I see.  I think I can simply use MTRR.SEAMRR bit check.  If CPU supports SEAMRR,
then basically it supports MKTME.

Is this look good for you?

	
> 
> 
> ...
> > "During initializing the TDX module, one step requires some SEAMCALL must be
> > done on all logical cpus enabled by BIOS, otherwise a later step will fail. 
> > Disable CPU hotplug during the initialization process to prevent any CPU going
> > offline during initializing the TDX module.  Note it is caller's responsibility
> > to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> > are online."
> 
> But, what if a CPU went offline just before this lock was taken?  What
> if the caller make sure all present CPUs are online, makes the call,
> then a CPU is taken offline.  The lock wouldn't do any good.
> 
> What purpose does the lock serve?

I thought cpus_read_lock() can prevent any CPU from going offline, no?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-28 23:44             ` Kai Huang
@ 2022-04-28 23:53               ` Dave Hansen
  2022-04-29  0:11                 ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-28 23:53 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/28/22 16:44, Kai Huang wrote:
>> Just like the SME test, it doesn't even need to be precise.  It just
>> needs to be 100% accurate in that it is *ALWAYS* set for any system that
>> might have dirtied cache aliases.
>>
>> I'm not sure why you are so fixated on SEAMRR specifically for this.
> I see.  I think I can simply use MTRR.SEAMRR bit check.  If CPU supports SEAMRR,
> then basically it supports MKTME.
> 
> Is this look good for you?

Sure, fine, as long as it comes with a coherent description that
explains why the check is good enough.

>>> "During initializing the TDX module, one step requires some SEAMCALL must be
>>> done on all logical cpus enabled by BIOS, otherwise a later step will fail. 
>>> Disable CPU hotplug during the initialization process to prevent any CPU going
>>> offline during initializing the TDX module.  Note it is caller's responsibility
>>> to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
>>> are online."
>> But, what if a CPU went offline just before this lock was taken?  What
>> if the caller make sure all present CPUs are online, makes the call,
>> then a CPU is taken offline.  The lock wouldn't do any good.
>>
>> What purpose does the lock serve?
> I thought cpus_read_lock() can prevent any CPU from going offline, no?

It doesn't prevent squat before the lock is taken, though.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-28 23:53               ` Dave Hansen
@ 2022-04-29  0:11                 ` Kai Huang
  2022-04-29  0:26                   ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-29  0:11 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 16:53 -0700, Dave Hansen wrote:
> On 4/28/22 16:44, Kai Huang wrote:
> > > Just like the SME test, it doesn't even need to be precise.  It just
> > > needs to be 100% accurate in that it is *ALWAYS* set for any system that
> > > might have dirtied cache aliases.
> > > 
> > > I'm not sure why you are so fixated on SEAMRR specifically for this.
> > I see.  I think I can simply use MTRR.SEAMRR bit check.  If CPU supports SEAMRR,
> > then basically it supports MKTME.
> > 
> > Is this look good for you?
> 
> Sure, fine, as long as it comes with a coherent description that
> explains why the check is good enough.
> 
> > > > "During initializing the TDX module, one step requires some SEAMCALL must be
> > > > done on all logical cpus enabled by BIOS, otherwise a later step will fail. 
> > > > Disable CPU hotplug during the initialization process to prevent any CPU going
> > > > offline during initializing the TDX module.  Note it is caller's responsibility
> > > > to guarantee all BIOS-enabled CPUs are in cpu_present_mask and all present CPUs
> > > > are online."
> > > But, what if a CPU went offline just before this lock was taken?  What
> > > if the caller make sure all present CPUs are online, makes the call,
> > > then a CPU is taken offline.  The lock wouldn't do any good.
> > > 
> > > What purpose does the lock serve?
> > I thought cpus_read_lock() can prevent any CPU from going offline, no?
> 
> It doesn't prevent squat before the lock is taken, though.

This is true.  So I think w/o taking the lock is also fine, as the TDX module
initialization is a state machine.  If any cpu goes offline during logical-cpu
level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
TDH.SYS.CONFIG will fail.  Similarly, if any cpu going offline causes
TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
fail.

A problem (I realized it exists in current implementation too) is shutting down
the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
cpus.  Kernel can do this SEAMCALL at most for all present cpus.  However when
any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.

Any suggestion?  Thanks!


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-29  0:11                 ` Kai Huang
@ 2022-04-29  0:26                   ` Dave Hansen
  2022-04-29  0:59                     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-29  0:26 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/28/22 17:11, Kai Huang wrote:
> This is true.  So I think w/o taking the lock is also fine, as the TDX module
> initialization is a state machine.  If any cpu goes offline during logical-cpu
> level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
> TDH.SYS.CONFIG will fail.  Similarly, if any cpu going offline causes
> TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
> fail.

Right.  The worst-case scenario is someone is mucking around with CPU
hotplug during TDX initialization is that TDX initialization will fail.

We *can* fix some of this at least and provide coherent error messages
with a pattern like this:

	cpus_read_lock();
	// check that all MADT-enumerated CPUs are online
	tdx_init();
	cpus_read_unlock();

That, of course, *does* prevent CPUs from going offline during
tdx_init().  It also provides a nice place for an error message:

	pr_warn("You offlined a CPU then want to use TDX?  Sod off.\n");

> A problem (I realized it exists in current implementation too) is shutting down
> the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
> cpus.  Kernel can do this SEAMCALL at most for all present cpus.  However when
> any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
> add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.

Hold on a sec.  If you call TDH.SYS.LP.SHUTDOWN on any CPU, then TDX
stops working everywhere, right?  But, if someone offlines one CPU, we
don't want TDX to stop working everywhere.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand
  2022-04-29  0:26                   ` Dave Hansen
@ 2022-04-29  0:59                     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-29  0:59 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 17:26 -0700, Dave Hansen wrote:
> On 4/28/22 17:11, Kai Huang wrote:
> > This is true.  So I think w/o taking the lock is also fine, as the TDX module
> > initialization is a state machine.  If any cpu goes offline during logical-cpu
> > level initialization and TDH.SYS.LP.INIT isn't done on that cpu, then later the
> > TDH.SYS.CONFIG will fail.  Similarly, if any cpu going offline causes
> > TDH.SYS.KEY.CONFIG is not done for any package, then TDH.SYS.TDMR.INIT will
> > fail.
> 
> Right.  The worst-case scenario is someone is mucking around with CPU
> hotplug during TDX initialization is that TDX initialization will fail.
> 
> We *can* fix some of this at least and provide coherent error messages
> with a pattern like this:
> 
> 	cpus_read_lock();
> 	// check that all MADT-enumerated CPUs are online
> 	tdx_init();
> 	cpus_read_unlock();
> 
> That, of course, *does* prevent CPUs from going offline during
> tdx_init().  It also provides a nice place for an error message:
> 
> 	pr_warn("You offlined a CPU then want to use TDX?  Sod off.\n");

Yes this is better.

The problem is how to check MADT-enumerated CPUs are online?

I checked the code, and it seems we can use 'num_processors + disabled_cpus' as
MADT-enumerated CPUs?  In fact, there should be no 'disabled_cpus' for TDX, so I
think:

	if (disabled_cpus || num_processors != num_online_cpus()) {
		pr_err("Initializing the TDX module requires all MADT-
enumerated CPUs being onine.");
		return -EINVAL;
	}

But I may have misunderstanding.

> 
> > A problem (I realized it exists in current implementation too) is shutting down
> > the TDX module, which requires calling TDH.SYS.LP.SHUTDOWN on all BIOS-enabled
> > cpus.  Kernel can do this SEAMCALL at most for all present cpus.  However when
> > any cpu is offline, this SEAMCALL won't be called on it, and it seems we need to
> > add new CPU hotplug callback to call this SEAMCALL when the cpu is online again.
> 
> Hold on a sec.  If you call TDH.SYS.LP.SHUTDOWN on any CPU, then TDX
> stops working everywhere, right?  
> 

Yes.

But tot shut down the TDX module, it's better to call  LP.SHUTDOWN on all 
logical cpus as suggested by spec.

> But, if someone offlines one CPU, we
> don't want TDX to stop working everywhere.

Right.   I am talking about when initializing fails due to any reason (i.e. -
ENOMEM), currently we shutdown the TDX module.  When shutting down the TDX
module, we want to call LP.SHUTDOWN on all logical cpus.  If there's any CPU
being offline when we do the shutdown, then LP.SHUTDOWN won't be called for that
cpu. 

But as you suggested above, if we have an early check whether all MADT-
enumerated CPUs are online and if not we return w/o shutting down the TDX
module, then if we shutdown the module the LP.SHUTDOWN will be called on all
cpus.


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-28  0:58           ` Kai Huang
@ 2022-04-29  1:40             ` Kai Huang
  2022-04-29  3:04               ` Dan Williams
  2022-05-03 23:59               ` Kai Huang
  0 siblings, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-29  1:40 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > On 4/27/22 17:37, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > 
> > > I thought we could document this in the documentation saying that this code can
> > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > can change the code and the documentation  when we add the support of those
> > > features in the future, and update the documentation.
> > > 
> > > If 5 years later someone takes this code, he/she should take a look at the
> > > documentation and figure out that he/she should choose a newer kernel if the
> > > machine support those features.
> > > 
> > > I'll think about design solutions if above doesn't look good for you.
> > 
> > No, it doesn't look good to me.
> > 
> > You can't just say:
> > 
> > 	/*
> > 	 * This code will eat puppies if used on systems with hotplug.
> > 	 */
> > 
> > and merrily await the puppy bloodbath.
> > 
> > If it's not compatible, then you have to *MAKE* it not compatible in a
> > safe, controlled way.
> > 
> > > > You can't just ignore the problems because they're not present on one
> > > > version of the hardware.
> > 
> > Please, please read this again ^^
> 
> OK.  I'll think about solutions and come back later.
> > 

Hi Dave,

I think we have two approaches to handle memory hotplug interaction with the TDX
module initialization.  

The first approach is simple.  We just block memory from being added as system
RAM managed by page allocator when the platform supports TDX [1]. It seems we
can add some arch-specific-check to __add_memory_resource() and reject the new
memory resource if platform supports TDX.  __add_memory_resource() is called by
both __add_memory() and add_memory_driver_managed() so it prevents from adding
NVDIMM as system RAM and normal ACPI memory hotplug [2].

The second approach is relatively more complicated.  Instead of directly
rejecting the new memory resource in __add_memory_resource(), we check whether
the memory resource can be added based on CMR and the TDX module initialization
status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
during kernel boots.  And in __add_memory_resource() we do below check:

	tdx_init_disable();	/*similar to cpu_hotplug_disable() */
	if (tdx_module_initialized())
		// reject memory hotplug
	else if (new_memory_resource NOT in CMRs)
		// reject memory hotplug
	else
		allow memory hotplug
	tdx_init_enable();	/*similar to cpu_hotplug_enable() */

tdx_init_disable() temporarily disables TDX module initialization by trying to
grab the mutex.  If the TDX module initialization is already on going, then it
waits until it completes.

This should work better for future platforms, but would requires non-trivial
more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
shutdown the TDX module in kexec().

But for this series I think the second approach is overkill and we can choose to
use the first simple approach?

Any suggestions?

[1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
keyIDs.  Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
enabled means the machine is TDX-capable, and for now a TDX-capable machine
doesn't support ACPI memory hotplug.

[2] It prevents adding legacy PMEM as system RAM too but I think it's fine.  If
user wants legacy PMEM then it is unlikely user will add it back and use as
system RAM.  User is unlikely to use legacy PMEM as TD guest memory directly as
TD guests is likely to use a new memfd backend which allows private page not
accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.

[3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-28  1:21     ` Kai Huang
@ 2022-04-29  2:58       ` Dan Williams
  2022-04-29  5:43         ` Kai Huang
  2022-04-29 14:39         ` Dave Hansen
  0 siblings, 2 replies; 156+ messages in thread
From: Dan Williams @ 2022-04-29  2:58 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > [..]
> > > > 3. Memory hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > >
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> >
> > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> >
> > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > >
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> >
> > The assumption is already broken today with NVDIMM-N. The lack of
> > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > persistent memory solutions. In other words, please describe the
> > actual software and hardware conflicts at play here, and do not make
> > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > "no NVDIMM support".
>
> Sorry I got this information from planning team or execution team I guess. I was
> told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> machine.  "co-exist" means they cannot be turned on simultaneously on the same
> platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
> and DDR-T based persistent memory.  Could you give some more background here so
> I can take a look?

My rough understanding is that TDX makes use of metadata communicated
"on the wire" for DDR, but that infrastructure is not there for DDR-T.
However, there are plenty of DDR based NVDIMMs that use super-caps /
batteries and flash to save contents. I believe the concern for TDX is
that the kernel needs to know not use TDX accepted PMEM as PMEM
because the contents saved by the DIMM's onboard energy source are
unreadable outside of a TD.

Here is one of the links that comes up in a search for NVDIMM-N.

https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

>
> >
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> >
> > I am not sure what you are trying to say here?
>
> We want to always make sure the memory managed by page allocator is TDX memory.

That only seems possible if the kernel is given a TDX capable physical
address map at the beginning of time.

> So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> don't need to prevent them from being added as system memory via kmem driver.

I think that is too narrow of a focus.

Does a memory map exist for the physical address ranges that are TDX
capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
ambiguous beyond the point of utility across the industry's entire
range of confidential computing memory capabilities.

One strawman would be an ACPI table with contents like:

struct acpi_protected_memory {
   struct range range;
   uuid_t platform_mem_crypto_capability;
};

With some way to map those uuids to a set of platform vendor specific
constraints and specifications. Some would be shared across
confidential computing vendors, some might be unique. Otherwise, I do
not see how you enforce the expectation of "all memory in the page
allocator is TDX capable". The other alternative is that *none* of the
memory in the page allocator is TDX capable and a special memory
allocation device is used to map memory for TDs. In either case a map
of all possible TDX memory is needed and the discussion above seems
like an incomplete / "hopeful" proposal about the memory dax_kmem, or
other sources, might online. See the CXL CEDT CFWMS (CXL Fixed Memory
Window Structure) as an example of an ACPI table that sets the
kernel's expectations about how a physical address range might be
used.

https://www.computeexpresslink.org/spec-landing

>
> >
> > > > 4. CPU hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> >
> > What are the actual challenges posed to TDX with respect to CPU hotplug?
>
> During the TDX module initialization, there is a step to call SEAMCALL on all
> logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
> the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
> to enforce this check too I guess but I don't know details about this.

Is there an ACPI table that indicates CPU-x passed the check? Or since
the BIOS is invoked in the CPU-online path, is it trusted to suppress
those events for CPUs outside of the mcheck domain?

> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> >
> > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > /proc/cpuinfo for example.
>
> It means the CPUs with "enable" bit set in the MADT table.

That just indicates to the present CPUs and then a hot add event
changes the state of now present CPUs to enabled. Per above is the
BIOS responsible for rejecting those new CPUs, or is the kernel?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29  1:40             ` Kai Huang
@ 2022-04-29  3:04               ` Dan Williams
  2022-04-29  5:35                 ` Kai Huang
  2022-05-03 23:59               ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-29  3:04 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > >     /*
> > >      * This code will eat puppies if used on systems with hotplug.
> > >      */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK.  I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

What if the memory being added *is* TDX capable? What if someone
wanted to manage a memory range as soft-reserved and move it back and
forth from the core-mm to device access. That should be perfectly
acceptable as long as the memory is TDX capable.

> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
>
>         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
>         if (tdx_module_initialized())
>                 // reject memory hotplug
>         else if (new_memory_resource NOT in CMRs)
>                 // reject memory hotplug
>         else
>                 allow memory hotplug
>         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?

This still sounds like it is trying to solve symptoms and not the root
problem. Why must the core-mm never have non-TDX memory when VMs are
fine to operate with either core-mm pages or memory from other sources
like hugetlbfs and device-dax?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29  3:04               ` Dan Williams
@ 2022-04-29  5:35                 ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-29  5:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Thu, 2022-04-28 at 20:04 -0700, Dan Williams wrote:
> On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > > 
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > can change the code and the documentation  when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > > 
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > > 
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > > 
> > > > No, it doesn't look good to me.
> > > > 
> > > > You can't just say:
> > > > 
> > > >     /*
> > > >      * This code will eat puppies if used on systems with hotplug.
> > > >      */
> > > > 
> > > > and merrily await the puppy bloodbath.
> > > > 
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > > 
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > > 
> > > > Please, please read this again ^^
> > > 
> > > OK.  I'll think about solutions and come back later.
> > > > 
> > 
> > Hi Dave,
> > 
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> > 
> > The first approach is simple.  We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
> 
> What if the memory being added *is* TDX capable? What if someone
> wanted to manage a memory range as soft-reserved and move it back and
> forth from the core-mm to device access. That should be perfectly
> acceptable as long as the memory is TDX capable.

Please see below.

> 
> > The second approach is relatively more complicated.  Instead of directly
> > rejecting the new memory resource in __add_memory_resource(), we check whether
> > the memory resource can be added based on CMR and the TDX module initialization
> > status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> > CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> > during kernel boots.  And in __add_memory_resource() we do below check:
> > 
> >         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
> >         if (tdx_module_initialized())
> >                 // reject memory hotplug
> >         else if (new_memory_resource NOT in CMRs)
> >                 // reject memory hotplug
> >         else
> >                 allow memory hotplug
> >         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
> > 
> > tdx_init_disable() temporarily disables TDX module initialization by trying to
> > grab the mutex.  If the TDX module initialization is already on going, then it
> > waits until it completes.
> > 
> > This should work better for future platforms, but would requires non-trivial
> > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> > CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> > shutdown the TDX module in kexec().
> > 
> > But for this series I think the second approach is overkill and we can choose to
> > use the first simple approach?
> 
> This still sounds like it is trying to solve symptoms and not the root
> problem. Why must the core-mm never have non-TDX memory when VMs are
> fine to operate with either core-mm pages or memory from other sources
> like hugetlbfs and device-dax?

Basically we don't want to modify page allocator API to distinguish TDX and non-
TDX allocation.  For instance, we don't want a new GFP_TDX.

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace.  

https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/

More specifically, memfd will support a new MFD_INACCESSIBLE flag when it is
created so all pages associated with this memfd will be TDX capable memory.  The
backend will need to implement a new memfile_notifier_ops to allow KVM to get
and put the memory page.

struct memfile_pfn_ops {
	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
	void (*put_unlock_pfn)(unsigned long pfn);
};

With that, it is backend's responsibility to implement get_lock_pfn() callback
in which the backend needs to ensure a TDX private page is allocated.

For TD guest, KVM should enforced to only use those fd-based backend.  I am not
sure whether anonymous pages should be supported anymore.

Sean, please correct me if I am wrong?

Currently only shmem is extended to support it.  By ensuring pages in page
allocator are all TDX memory, shmem can be extended easily to support TD guests.
 
If device-dax and hugetlbfs wants to support TD guests then they should
implement those callbacks and ensure only TDX memory is allocated.  For
instance, when future TDX supports NVDIMM (i.e. NVDIMM is included to CMRs),
then device-dax pages can be included as TDX memory when initializing the TDX
module and device-dax can implement it's own to support allocating page for TD
guests.

But TDX architecture can be changed to support memory hotplug in a more graceful
way in the future.  For instance, it can choose to support dynamically adding
any convertible memory as TDX memory *after* TDX module initialization.  But
this is just my brainstorming.

Anyway, for now, since only shmem (or + anonymous pages) can be used to create
TD guests, I think we can just reject any memory hot-add when platform supports
TDX as described in the first simple approach.  Eventually we may need something
like the second approach but TDX architecture can evolve too.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29  2:58       ` Dan Williams
@ 2022-04-29  5:43         ` Kai Huang
  2022-04-29 14:39         ` Dave Hansen
  1 sibling, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-29  5:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Thu, 2022-04-28 at 19:58 -0700, Dan Williams wrote:
> On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > [..]
> > > > > 3. Memory hotplug
> > > > > 
> > > > > The first generation of TDX architecturally doesn't support memory
> > > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > > > 
> > > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > > 
> > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> > > 
> > > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > > happen either.
> > > > 
> > > > What prevents this code from today's code being run on tomorrow's
> > > > platforms and breaking these assumptions?
> > > 
> > > The assumption is already broken today with NVDIMM-N. The lack of
> > > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > > persistent memory solutions. In other words, please describe the
> > > actual software and hardware conflicts at play here, and do not make
> > > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > > "no NVDIMM support".
> > 
> > Sorry I got this information from planning team or execution team I guess. I was
> > told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> > machine.  "co-exist" means they cannot be turned on simultaneously on the same
> > platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
> > and DDR-T based persistent memory.  Could you give some more background here so
> > I can take a look?
> 
> My rough understanding is that TDX makes use of metadata communicated
> "on the wire" for DDR, but that infrastructure is not there for DDR-T.
> However, there are plenty of DDR based NVDIMMs that use super-caps /
> batteries and flash to save contents. I believe the concern for TDX is
> that the kernel needs to know not use TDX accepted PMEM as PMEM
> because the contents saved by the DIMM's onboard energy source are
> unreadable outside of a TD.
> 
> Here is one of the links that comes up in a search for NVDIMM-N.
> 
> https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

Thanks for the info.  I need some more time to digest those different types of
DDRs and NVDIMMs.  However I guess they are not quite relevant since TDX has a
concept of "Convertible Memory Region".  Please see below.

> 
> > 
> > > 
> > > > > Another case is admin can use 'memmap' kernel command line to create
> > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > > hotplug code to prevent this from happening, this series always include
> > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > > 
> > > I am not sure what you are trying to say here?
> > 
> > We want to always make sure the memory managed by page allocator is TDX memory.
> 
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

Yes TDX architecture has a concept "Convertible Memory Region" (CMR). The memory
used by TDX must be convertible memory.  BIOS generates an array of CMR entries
during boot and they are verified by MCHECK.  CMRs are static during machine's
runtime.

> 
> > So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> > don't need to prevent them from being added as system memory via kmem driver.
> 
> I think that is too narrow of a focus.
> 
> Does a memory map exist for the physical address ranges that are TDX
> capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
> ambiguous beyond the point of utility across the industry's entire
> range of confidential computing memory capabilities.
> 
> One strawman would be an ACPI table with contents like:
> 
> struct acpi_protected_memory {
>    struct range range;
>    uuid_t platform_mem_crypto_capability;
> };
> 
> With some way to map those uuids to a set of platform vendor specific
> constraints and specifications. Some would be shared across
> confidential computing vendors, some might be unique. Otherwise, I do
> not see how you enforce the expectation of "all memory in the page
> allocator is TDX capable". 
> 

Please see above.  TDX has CMR.

> The other alternative is that *none* of the
> memory in the page allocator is TDX capable and a special memory
> allocation device is used to map memory for TDs. In either case a map
> of all possible TDX memory is needed and the discussion above seems
> like an incomplete / "hopeful" proposal about the memory dax_kmem, or
> other sources, might online. 

Yes we are also developing a new memfd based approach to support TD guest
memory.  Please see my another reply to you.


> See the CXL CEDT CFWMS (CXL Fixed Memory
> Window Structure) as an example of an ACPI table that sets the
> kernel's expectations about how a physical address range might be
> used.
> 
> https://www.computeexpresslink.org/spec-landing

Thanks for the info. I'll take a look to get some background.

> 
> > 
> > > 
> > > > > 4. CPU hotplug
> > > > > 
> > > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > > check in ACPI CPU hotplug code path to disable it.
> > > 
> > > What are the actual challenges posed to TDX with respect to CPU hotplug?
> > 
> > During the TDX module initialization, there is a step to call SEAMCALL on all
> > logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
> > the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
> > to enforce this check too I guess but I don't know details about this.
> 
> Is there an ACPI table that indicates CPU-x passed the check? Or since
> the BIOS is invoked in the CPU-online path, is it trusted to suppress
> those events for CPUs outside of the mcheck domain?

No the TDX module (and the P-SEAMLDR) internally maintains some data to record
the total number of LPs and packages, and which logical cpu has been
initialized, etc.

I asked Intel guys whether BIOS would suppress an ACPI CPU hotplug event but I
never got a concrete answer.  I'll try again.

> 
> > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > > 
> > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > > /proc/cpuinfo for example.
> > 
> > It means the CPUs with "enable" bit set in the MADT table.
> 
> That just indicates to the present CPUs and then a hot add event
> changes the state of now present CPUs to enabled. Per above is the
> BIOS responsible for rejecting those new CPUs, or is the kernel?

I'll ask BIOS guys again to see whether BIOS will suppress ACPI CPU hotplug
event.  But I think we can have a simple patch to reject ACPI CPU hotplug if
platform is TDX-capable?

Or do you think we don't need to explicitly reject ACPI CPU hotplug if we can
confirm with BIOS guys that it will suppress on TDX capable machine?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM
  2022-04-28 16:22   ` Dave Hansen
@ 2022-04-29  7:24     ` Kai Huang
  2022-04-29 13:52       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-29  7:24 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 09:22 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > The kernel configures TDX usable memory regions to the TDX module via
> > an array of "TD Memory Region" (TDMR). 
> 
> One bit of language that's repeated in these changelogs that I don't
> like is "configure ... to".  I think that's a misuse of the word
> configure.  I'd say something more like:
> 
> 	The kernel configures TDX-usable memory regions by passing an
> 	array of "TD Memory Regions" (TDMRs) to the TDX module.
> 
> Could you please take a look over this series and reword those?

Thanks will do.

> 
> > Each TDMR entry (TDMR_INFO)
> > contains the information of the base/size of a memory region, the
> > base/size of the associated Physical Address Metadata Table (PAMT) and
> > a list of reserved areas in the region.
> > 
> > Create a number of TDMRs according to the verified e820 RAM entries.
> > As the first step only set up the base/size information for each TDMR.
> > 
> > TDMR must be 1G aligned and the size must be in 1G granularity.  This
> 
>  ^ Each

OK.

> 
> > implies that one TDMR could cover multiple e820 RAM entries.  If a RAM
> > entry spans the 1GB boundary and the former part is already covered by
> > the previous TDMR, just create a new TDMR for the latter part.
> > 
> > TDX only supports a limited number of TDMRs (currently 64).  Abort the
> > TDMR construction process when the number of TDMRs exceeds this
> > limitation.
> 
> ... and what does this *MEAN*?  Is TDX disabled?  Does it throw away the
> RAM?  Does it eat puppies?

How about:

	TDX only supports a limited number of TDMRs.  Simply return error when
	the number of TDMRs exceeds the limitation.  TDX is disabled in this
	case.

> 
> >  arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 138 insertions(+)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 6b0c51aaa7f2..82534e70df96 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -54,6 +54,18 @@
> >  		((u32)(((_keyid_part) & 0xffffffffull) + 1))
> >  #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
> >  
> > +/* TDMR must be 1gb aligned */
> > +#define TDMR_ALIGNMENT		BIT_ULL(30)
> > +#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
> > +
> > +/* Align up and down the address to TDMR boundary */
> > +#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> > +#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
> > +
> > +/* TDMR's start and end address */
> > +#define TDMR_START(_tdmr)	((_tdmr)->base)
> > +#define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
> 
> Make these 'static inline's please.  #defines are only for constants or
> things that can't use real functions.

OK.

> 
> >  /*
> >   * TDX module status during initialization
> >   */
> > @@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
> >  	return 0;
> >  }
> >  
> > +/* The starting offset of reserved areas within TDMR_INFO */
> > +#define TDMR_RSVD_START		64
> 
> 				^ extra whitespace

Will remove.

> 
> > +static struct tdmr_info *__alloc_tdmr(void)
> > +{
> > +	int tdmr_sz;
> > +
> > +	/*
> > +	 * TDMR_INFO's actual size depends on maximum number of reserved
> > +	 * areas that one TDMR supports.
> > +	 */
> > +	tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
> > +		sizeof(struct tdmr_reserved_area);
> 
> You have a structure for this.  I know this because it's the return type
> of the function.  You have TDMR_RSVD_START available via the structure
> itself.  So, derive that 64 either via:
> 
> 	sizeof(struct tdmr_info)
> 
> or,
> 
> 	offsetof(struct tdmr_info, reserved_areas);
> 
> Which would make things look like this:
> 
> 	tdmr_base_sz = sizeof(struct tdmr_info);
> 	tdmr_reserved_area_sz = sizeof(struct tdmr_reserved_area) *
> 				tdx_sysinfo.max_reserved_per_tdmr;
> 
> 	tdmr_sz = tdmr_base_sz + tdmr_reserved_area_sz;
> 
> Could you explain why on earth you felt the need for the TDMR_RSVD_START
> #define?

Will use sizeof (struct tdmr_info).  Thanks for the tip.

> 
> > +	/*
> > +	 * TDX requires TDMR_INFO to be 512 aligned.  Always align up
> 
> Again, 512 what?  512 pages?  512 hippos?

Will change to 512-byte aligned.

> 
> > +	 * TDMR_INFO size to 512 so the memory allocated via kzalloc()
> > +	 * can meet the alignment requirement.
> > +	 */
> > +	tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> > +
> > +	return kzalloc(tdmr_sz, GFP_KERNEL);
> > +}
> > +
> > +/* Create a new TDMR at given index in the TDMR array */
> > +static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
> > +{
> > +	struct tdmr_info *tdmr;
> > +
> > +	if (WARN_ON_ONCE(tdmr_array[idx]))
> > +		return NULL;
> > +
> > +	tdmr = __alloc_tdmr();
> > +	tdmr_array[idx] = tdmr;
> > +
> > +	return tdmr;
> > +}
> > +
> >  static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> >  {
> >  	int i;
> > @@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
> >  	}
> >  }
> >  
> > +/*
> > + * Create TDMRs to cover all RAM entries in e820_table.  The created
> > + * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
> > + * number of TDMRs.  All entries in @tdmr_array must be initially NULL.
> > + */
> > +static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> > +{
> > +	struct tdmr_info *tdmr;
> > +	u64 start, end;
> > +	int i, tdmr_idx;
> > +	int ret = 0;
> > +
> > +	tdmr_idx = 0;
> > +	tdmr = alloc_tdmr(tdmr_array, 0);
> > +	if (!tdmr)
> > +		return -ENOMEM;
> > +	/*
> > +	 * Loop over all RAM entries in e820 and create TDMRs to cover
> > +	 * them.  To keep it simple, always try to use one TDMR to cover
> > +	 * one RAM entry.
> > +	 */
> > +	e820_for_each_mem(i, start, end) {
> > +		start = TDMR_ALIGN_DOWN(start);
> > +		end = TDMR_ALIGN_UP(end);
> 			    ^ vertically align those ='s, please.

OK.

> 
> 
> > +		/*
> > +		 * If the current TDMR's size hasn't been initialized, it
> > +		 * is a new allocated TDMR to cover the new RAM entry.
> > +		 * Otherwise the current TDMR already covers the previous
> > +		 * RAM entry.  In the latter case, check whether the
> > +		 * current RAM entry has been fully or partially covered
> > +		 * by the current TDMR, since TDMR is 1G aligned.
> > +		 */
> > +		if (tdmr->size) {
> > +			/*
> > +			 * Loop to next RAM entry if the current entry
> > +			 * is already fully covered by the current TDMR.
> > +			 */
> > +			if (end <= TDMR_END(tdmr))
> > +				continue;
> 
> This loop is actually pretty well commented and looks OK.  The
> TDMR_END() construct even adds to readability.  *BUT*, the
> 
> > +			/*
> > +			 * If part of current RAM entry has already been
> > +			 * covered by current TDMR, skip the already
> > +			 * covered part.
> > +			 */
> > +			if (start < TDMR_END(tdmr))
> > +				start = TDMR_END(tdmr);
> > +
> > +			/*
> > +			 * Create a new TDMR to cover the current RAM
> > +			 * entry, or the remaining part of it.
> > +			 */
> > +			tdmr_idx++;
> > +			if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
> > +				ret = -E2BIG;
> > +				goto err;
> > +			}
> > +			tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
> > +			if (!tdmr) {
> > +				ret = -ENOMEM;
> > +				goto err;
> > +			}
> 
> This is a bit verbose for this loop.  Why not just hide the 'max_tdmrs'
> inside the alloc_tdmr() function?  That will make this loop smaller and
> easier to read.

Based on suggestion, I'll change to use alloc_pages_exact() to allocate those
TDMRs at once, so no need to allocate for each TDMR again here.  I'll remove the
alloc_tdmr() but keep the max_tdmrs check here.
 


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory
  2022-04-28 15:54   ` Dave Hansen
@ 2022-04-29  7:32     ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-04-29  7:32 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 08:54 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > As one step of initializing the TDX module, the memory regions that the
> > TDX module can use must be configured to it via an array of 'TD Memory
> 
> "can use must be"?

"can use" applies to "the TDX module".  "must be" applies to "the memory
regions".  

Sorry for bad english.  Yes it's not good.  I'll use your words suggested in
another patch:

	The kernel configures TDX-usable memory regions by passing an
	array of "TD Memory Regions" (TDMRs) to the TDX module.

> 
> > Regions' (TDMR).  The kernel is responsible for choosing which memory
> > regions to be used as TDX memory and building the array of TDMRs to
> > cover those memory regions.
> > 
> > The first generation of TDX-capable platforms basically guarantees all
> > system RAM regions during machine boot are Convertible Memory Regions
> > (excluding the memory below 1MB) and can be used by TDX.  The memory
> > pages allocated to TD guests can be any pages managed by the page
> > allocator.  To avoid having to modify the page allocator to distinguish
> > TDX and non-TDX memory allocation, adopt a simple policy to use all
> > system RAM regions as TDX memory.  The low 1MB pages are excluded from
> > TDX memory since they are not in CMRs in some platforms (those pages are
> > reserved at boot time and won't be managed by page allocator anyway).
> > 
> > This policy could be revised later if future TDX generations break
> > the guarantee or when the size of the metadata (~1/256th of the size of
> > the TDX usable memory) becomes a concern.  At that time a CMR-aware
> > page allocator may be necessary.
> 
> Remember that you have basically three or four short sentences to get a
> reviewer's attention.  There's a lot of noise in that changelog.  Can
> you trim it down or at least make the first bit less jargon-packed and
> more readable?
> 
> > Also, on the first generation of TDX-capable machine, the system RAM
> > ranges discovered during boot time are all memory regions that kernel
> > can use during its runtime.  This is because the first generation of TDX
> > architecturally doesn't support ACPI memory hotplug 
> 
> "Architecturally" usually means: written down and agreed to by hardware
> and software alike.  Is this truly written down somewhere?  I don't
> recall seeing it in the architecture documents.
> 
> I fear this is almost the _opposite_ of architecture: it's basically a
> fortunate coincidence.
> 
> > (CMRs are generated
> > during machine boot and are static during machine's runtime).  Also, the
> > first generation of TDX-capable platform doesn't support TDX and ACPI
> > memory hotplug at the same time on a single machine.  Another case of
> > memory hotplug is user may use NVDIMM as system RAM via kmem driver.
> > But the first generation of TDX-capable machine doesn't support TDX and
> > NVDIMM simultaneously, therefore in practice it cannot happen.  One
> > special case is user may use 'memmap' kernel command line to reserve
> > part of system RAM as x86 legacy PMEMs, and user can theoretically add
> > them as system RAM via kmem driver.  This can be resolved by always
> > treating legacy PMEMs as TDX memory.
> 
> Again, there's a ton of noise here.  I'm struggling to get the point.
> 
> > Implement a helper to loop over all RAM entries in e820 table to find
> > all system RAM ranges, as a preparation to covert all of them to TDX
> > memory.  Use 'e820_table', rather than 'e820_table_firmware' to honor
> > 'mem' and 'memmap' command lines. 
> 
> *How* does this honor them?  For instance, if I do mem=4G, will the TDX
> code limit itself to converting 4GB for TDX?

Yes.

> 
> > Following e820__memblock_setup(),
> > both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN types are treated as TDX
> > memory, and contiguous ranges in the same NUMA node are merged together.
> 
> Again, you're just rehashing the code's logic in English.  That's not
> what a changelog is for.

Sorry you are right.

I'll address rest of your comments after we settle memory hotplug handling
discussion.

Thanks!



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-28 17:12   ` Dave Hansen
@ 2022-04-29  7:46     ` Kai Huang
  2022-04-29 14:20       ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-04-29  7:46 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > In order to provide crypto protection to guests, the TDX module uses
> > additional metadata to record things like which guest "owns" a given
> > page of memory.  This metadata, referred as Physical Address Metadata
> > Table (PAMT), essentially serves as the 'struct page' for the TDX
> > module.  PAMTs are not reserved by hardware upfront.  They must be
> > allocated by the kernel and then given to the TDX module.
> > 
> > TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> > (TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
> 
> s/respectively//
> 

Will remove.

> > Each PAMT must be a physically contiguous area from the Convertible
> 
> 							^ s/the/a/

OK.

> 
> > Memory Regions (CMR).  However, the PAMTs which track pages in one TDMR
> > do not need to reside within that TDMR but can be anywhere in CMRs.
> > If one PAMT overlaps with any TDMR, the overlapping part must be
> > reported as a reserved area in that particular TDMR.
> > 
> > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > and it may be potentially large (~1/256th of the size of the given TDMR).
> 
> This is also a good place to note the downsides of using
> alloc_contig_pages().

For instance:

	The allocation may fail when memory usage is under pressure.

?

> 
> > The current version of TDX supports at most 16 reserved areas per TDMR
> > to cover both PAMTs and potential memory holes within the TDMR.  If many
> > PAMTs are allocated within a single TDMR, 16 reserved areas may not be
> > sufficient to cover all of them.
> > 
> > Adopt the following policies when allocating PAMTs for a given TDMR:
> > 
> >   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> >     the total number of reserved areas consumed for PAMTs.
> >   - Try to first allocate PAMT from the local node of the TDMR for better
> >     NUMA locality.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/Kconfig            |   1 +
> >  arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 166 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7414625b938f..ff68d0829bd7 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
> >  	depends on CPU_SUP_INTEL
> >  	depends on X86_64
> >  	select NUMA_KEEP_MEMINFO if NUMA
> > +	depends on CONTIG_ALLOC
> >  	help
> >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> >  	  host and certain physical attacks.  This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 82534e70df96..1b807dcbc101 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/cpufeatures.h>
> >  #include <asm/virtext.h>
> >  #include <asm/e820/api.h>
> > +#include <asm/pgtable.h>
> >  #include <asm/tdx.h>
> >  #include "tdx.h"
> >  
> > @@ -66,6 +67,16 @@
> >  #define TDMR_START(_tdmr)	((_tdmr)->base)
> >  #define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
> >  
> > +/* Page sizes supported by TDX */
> > +enum tdx_page_sz {
> > +	TDX_PG_4K = 0,
> > +	TDX_PG_2M,
> > +	TDX_PG_1G,
> > +	TDX_PG_MAX,
> > +};
> 
> Is that =0 required?  I thought the first enum was defined to be 0.

No it's not required.  Will remove.

> 
> > +#define TDX_HPAGE_SHIFT	9
> > +
> >  /*
> >   * TDX module status during initialization
> >   */
> > @@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  	return ret;
> >  }
> >  
> > +/* Calculate PAMT size given a TDMR and a page size */
> > +static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
> > +					enum tdx_page_sz pgsz)
> > +{
> > +	unsigned long pamt_sz;
> > +
> > +	pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
> > +		tdx_sysinfo.pamt_entry_size;
> 
> That 'pgsz' thing is just hideous.  I'd *much* rather see something like
> this:
> 
> static int tdx_page_size_shift(enum tdx_page_sz page_sz)
> {
> 	switch (page_sz) {
> 	case TDX_PG_4K:
> 		return PAGE_SIZE;
> 	...
> 	}
> }
> 
> That's easy to figure out what's going on.

OK. Will do.

> 
> > +	/* PAMT size must be 4K aligned */
> > +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> > +
> > +	return pamt_sz;
> > +}
> > +
> > +/* Calculate the size of all PAMTs for a TDMR */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
> > +{
> > +	enum tdx_page_sz pgsz;
> > +	unsigned long pamt_sz;
> > +
> > +	pamt_sz = 0;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
> > +		pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > +	return pamt_sz;
> > +}
> 
> But, there are 3 separate pointers pointing to 3 separate PAMTs.  Why do
> they all have to be contiguously allocated?

It is also explained in the changelog (the last two paragraphs).

> 
> > +/*
> > + * Locate the NUMA node containing the start of the given TDMR's first
> > + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
> > + */
> 
> Please add a sentence or two on the implications here of what this means
> when it happens.  Also, the joining of e820 regions seems like it might
> span NUMA nodes.  What prevents that code from just creating one large
> e820 area that leads to one large TDMR and horrible NUMA affinity for
> these structures?

How about adding:

	When TDMR is created, it stops spanning at NUAM boundary.

> 
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > +	u64 start, end;
> > +	int i;
> > +
> > +	/* Find the first RAM entry covered by the TDMR */
> > +	e820_for_each_mem(i, start, end)
> > +		if (end > TDMR_START(tdmr))
> > +			break;
> 
> Brackets around the big loop, please.

OK.

> 
> > +	/*
> > +	 * One TDMR must cover at least one (or partial) RAM entry,
> > +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> > +	 */
> > +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > +		return 0;
> > +
> > +	/*
> > +	 * The first RAM entry may be partially covered by the previous
> > +	 * TDMR.  In this case, use TDMR's start to find the NUMA node.
> > +	 */
> > +	if (start < TDMR_START(tdmr))
> > +		start = TDMR_START(tdmr);
> > +
> > +	return phys_to_target_node(start);
> > +}
> > +
> > +static int tdmr_setup_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
> > +	unsigned long pamt_sz[TDX_PG_MAX];
> > +	unsigned long pamt_npages;
> > +	struct page *pamt;
> > +	enum tdx_page_sz pgsz;
> > +	int nid;
> 
> Sooooooooooooooooooo close to reverse Christmas tree, but no cigar.
> Please fix it.

Will fix.  Thanks.

> 
> > +	/*
> > +	 * Allocate one chunk of physically contiguous memory for all
> > +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> > +	 * in overlapped TDMRs.
> > +	 */
> 
> Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
> just two lines of code, I'd probably just the helper and open-code it
> here.  Then you only have one place to comment on it.

It has a loop and internally calls __tdmr_get_pamt_sz().  It looks doesn't fit
if we open-code it here.

How about move this comment to tdmr_get_pamt_sz()?

	
> 
> > +	nid = tdmr_get_nid(tdmr);
> > +	pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
> > +	pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
> > +			&node_online_map);
> > +	if (!pamt)
> > +		return -ENOMEM;
> > +
> > +	/* Calculate PAMT base and size for all supported page sizes. */
> > +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> > +	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
> > +		unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
> > +
> > +		pamt_base[pgsz] = tdmr_pamt_base;
> > +		pamt_sz[pgsz] = sz;
> > +
> > +		tdmr_pamt_base += sz;
> > +	}
> > +
> > +	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
> > +	tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
> > +	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
> > +	tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
> > +	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
> > +	tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
> 
> This would all vertically align nicely if you renamed pamt_sz -> pamt_size.

OK.

> 
> > +	return 0;
> > +}
> > +
> > +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> > +{
> > +	unsigned long pamt_pfn, pamt_sz;
> > +
> > +	pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
> 
> Comment, please:
> 
> 	/*
> 	 * The PAMT was allocated in one contiguous unit.  The 4k PAMT
> 	 * should always point to the beginning of that allocation.
> 	 */

Thanks will add.

> 
> > +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> > +
> > +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> > +	if (!pamt_sz)
> > +		return;
> > +
> > +	if (WARN_ON(!pamt_pfn))
> > +		return;
> > +
> > +	free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
> > +}
> > +
> > +static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdmr_num; i++)
> > +		tdmr_free_pamt(tdmr_array[i]);
> > +}
> > +
> > +/* Allocate and set up PAMTs for all TDMRs */
> > +static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
> 
> 	"set_up", please, not "setup".

OK.

> 
> > +{
> > +	int i, ret;
> > +
> > +	for (i = 0; i < tdmr_num; i++) {
> > +		ret = tdmr_setup_pamt(tdmr_array[i]);
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	return 0;
> > +err:
> > +	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> > +	return -ENOMEM;
> > +}
> > +
> >  static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  {
> >  	int ret;
> > @@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
> >  	if (ret)
> >  		goto err;
> >  
> > +	ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
> > +	if (ret)
> > +		goto err_free_tdmrs;
> > +
> >  	/* Return -EFAULT until constructing TDMRs is done */
> >  	ret = -EFAULT;
> > +	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
> > +err_free_tdmrs:
> >  	free_tdmrs(tdmr_array, *tdmr_num);
> >  err:
> >  	return ret;
> > @@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
> >  	 * initialization are done.
> >  	 */
> >  	ret = -EFAULT;
> > +	/*
> > +	 * Free PAMTs allocated in construct_tdmrs() when TDX module
> > +	 * initialization fails.
> > +	 */
> > +	if (ret)
> > +		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
> >  out_free_tdmrs:
> >  	/*
> >  	 * TDMRs are only used during initializing TDX module.  Always
> 
> In a follow-on patch, I'd like this to dump out (in a pr_debug() or
> pr_info()) how much memory is consumed by PAMT allocations.

OK.



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM
  2022-04-29  7:24     ` Kai Huang
@ 2022-04-29 13:52       ` Dave Hansen
  0 siblings, 0 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 13:52 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/29/22 00:24, Kai Huang wrote:
> On Thu, 2022-04-28 at 09:22 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> implies that one TDMR could cover multiple e820 RAM entries.  If a RAM
>>> entry spans the 1GB boundary and the former part is already covered by
>>> the previous TDMR, just create a new TDMR for the latter part.
>>>
>>> TDX only supports a limited number of TDMRs (currently 64).  Abort the
>>> TDMR construction process when the number of TDMRs exceeds this
>>> limitation.
>>
>> ... and what does this *MEAN*?  Is TDX disabled?  Does it throw away the
>> RAM?  Does it eat puppies?
> 
> How about:
> 
> 	TDX only supports a limited number of TDMRs.  Simply return error when
> 	the number of TDMRs exceeds the limitation.  TDX is disabled in this
> 	case.

Better, but two things there that need to be improved.  This is a cover
letter.  Talking at the function level ("return error") is too
low-level.  It's also slipping into passive mode "is disabled".  Fixing
those, it looks like this:

	TDX only supports a limited number of TDMRs.  Disable TDX if all
	TDMRs are consumed but there is more RAM to cover.


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29  7:46     ` Kai Huang
@ 2022-04-29 14:20       ` Dave Hansen
  2022-04-29 14:30         ` Sean Christopherson
  2022-05-02  5:59         ` Kai Huang
  0 siblings, 2 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 14:20 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/29/22 00:46, Kai Huang wrote:
> On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
>> This is also a good place to note the downsides of using
>> alloc_contig_pages().
> 
> For instance:
> 
> 	The allocation may fail when memory usage is under pressure.

It's not really memory pressure, though.  The larger the allocation, the
more likely it is to fail.  The more likely it is that the kernel can't
free the memory or that if you need 1GB of contiguous memory that
999.996MB gets freed, but there is one stubborn page left.

alloc_contig_pages() can and will fail.  The only mitigation which is
guaranteed to avoid this is doing the allocation at boot.  But, you're
not doing that to avoid wasting memory on every TDX system that doesn't
use TDX.

A *good* way (although not foolproof) is to launch a TDX VM early in
boot before memory gets fragmented or consumed.  You might even want to
recommend this in the documentation.

>>> +/*
>>> + * Locate the NUMA node containing the start of the given TDMR's first
>>> + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
>>> + */
>>
>> Please add a sentence or two on the implications here of what this means
>> when it happens.  Also, the joining of e820 regions seems like it might
>> span NUMA nodes.  What prevents that code from just creating one large
>> e820 area that leads to one large TDMR and horrible NUMA affinity for
>> these structures?
> 
> How about adding:
> 
> 	When TDMR is created, it stops spanning at NUAM boundary.

I actually don't know what that means at all.  I was thinking of
something like this.

/*
 * Pick a NUMA node on which to allocate this TDMR's metadata.
 *
 * This is imprecise since TDMRs are 1GB aligned and NUMA nodes might
 * not be.  If the TDMR covers more than one node, just use the _first_
 * one.  This can lead to small areas of off-node metadata for some
 * memory.
 */

>>> +static int tdmr_get_nid(struct tdmr_info *tdmr)
>>> +{
>>> +	u64 start, end;
>>> +	int i;
>>> +
>>> +	/* Find the first RAM entry covered by the TDMR */

There's something else missing in here.  Why not just do:

	return phys_to_target_node(TDMR_START(tdmr));

This would explain it:

	/*
	 * The beginning of the TDMR might not point to RAM.
	 * Find its first RAM address which which its node can
	 * be found.
	 */

>>> +	e820_for_each_mem(i, start, end)
>>> +		if (end > TDMR_START(tdmr))
>>> +			break;
>>
>> Brackets around the big loop, please.
> 
> OK.
> 
>>
>>> +	/*
>>> +	 * One TDMR must cover at least one (or partial) RAM entry,
>>> +	 * otherwise it is kernel bug.  WARN_ON() in this case.
>>> +	 */
>>> +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
>>> +		return 0;

This really means "no RAM found for this TDMR", right?  Can we say that,
please.


>>> +	/*
>>> +	 * Allocate one chunk of physically contiguous memory for all
>>> +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
>>> +	 * in overlapped TDMRs.
>>> +	 */
>>
>> Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
>> just two lines of code, I'd probably just the helper and open-code it
>> here.  Then you only have one place to comment on it.
> 
> It has a loop and internally calls __tdmr_get_pamt_sz().  It looks doesn't fit
> if we open-code it here.
> 
> How about move this comment to tdmr_get_pamt_sz()?

I thought about that.  But tdmr_get_pamt_sz() isn't itself doing any
allocation so it doesn't make a whole lot of logical sense.  This is a
place where a helper _can_ be removed.  Remove it, please.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29 14:20       ` Dave Hansen
@ 2022-04-29 14:30         ` Sean Christopherson
  2022-04-29 17:46           ` Dave Hansen
  2022-05-02  5:59         ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Sean Christopherson @ 2022-04-29 14:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, Apr 29, 2022, Dave Hansen wrote:
> On 4/29/22 00:46, Kai Huang wrote:
> > On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> >> This is also a good place to note the downsides of using
> >> alloc_contig_pages().
> > 
> > For instance:
> > 
> > 	The allocation may fail when memory usage is under pressure.
> 
> It's not really memory pressure, though.  The larger the allocation, the
> more likely it is to fail.  The more likely it is that the kernel can't
> free the memory or that if you need 1GB of contiguous memory that
> 999.996MB gets freed, but there is one stubborn page left.
> 
> alloc_contig_pages() can and will fail.  The only mitigation which is
> guaranteed to avoid this is doing the allocation at boot.  But, you're
> not doing that to avoid wasting memory on every TDX system that doesn't
> use TDX.
> 
> A *good* way (although not foolproof) is to launch a TDX VM early in
> boot before memory gets fragmented or consumed.  You might even want to
> recommend this in the documentation.

What about providing a kernel param to tell the kernel to do the allocation during
boot?  Or maybe a sysfs knob to reserve/free the memory, a la nr_overcommit_hugepages?

I suspect that most/all deployments that actually want to use TDX would much prefer
to eat the overhead if TDX VMs are never scheduled on the host, as opposed to having
to deal with a host in a TDX pool not actually being able to run TDX VMs.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29  2:58       ` Dan Williams
  2022-04-29  5:43         ` Kai Huang
@ 2022-04-29 14:39         ` Dave Hansen
  2022-04-29 15:18           ` Dan Williams
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 14:39 UTC (permalink / raw)
  To: Dan Williams, Kai Huang
  Cc: Linux Kernel Mailing List, KVM list, Sean Christopherson,
	Paolo Bonzini, Brown, Len, Luck, Tony, Rafael J Wysocki,
	Reinette Chatre, Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, Isaku Yamahata

On 4/28/22 19:58, Dan Williams wrote:
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

TDX actually brings along its own memory map.  The "EAS"[1]. has a lot
of info on it, if you know where to find it.  Here's the relevant chunk:

CMR - Convertible Memory Range -
	A range of physical memory configured by BIOS and verified by
	MCHECK. MCHECK verificatio n is intended to help ensure that a
	CMR may be used to hold TDX memory pages encrypted with a
	private HKID.

So, the BIOS has the platform knowledge to enumerate this range.  It
stashes the information off somewhere that the TDX module can find it.
Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
TDX module and gets the list of CMRs.

The OS then has to reconcile this CMR "memory map" against the regular
old BIOS-provided memory map, tossing out any memory regions which are
RAM, but not covered by a CMR, or disabling TDX entirely.

Fun, eh?

I'm still grappling with how this series handles the policy of what
memory to throw away when.

1.
https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 14:39         ` Dave Hansen
@ 2022-04-29 15:18           ` Dan Williams
  2022-04-29 17:18             ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-29 15:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Fri, Apr 29, 2022 at 7:39 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/28/22 19:58, Dan Williams wrote:
> > That only seems possible if the kernel is given a TDX capable physical
> > address map at the beginning of time.
>
> TDX actually brings along its own memory map.  The "EAS"[1]. has a lot
> of info on it, if you know where to find it.  Here's the relevant chunk:
>
> CMR - Convertible Memory Range -
>         A range of physical memory configured by BIOS and verified by
>         MCHECK. MCHECK verificatio n is intended to help ensure that a
>         CMR may be used to hold TDX memory pages encrypted with a
>         private HKID.
>
> So, the BIOS has the platform knowledge to enumerate this range.  It
> stashes the information off somewhere that the TDX module can find it.
> Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
> TDX module and gets the list of CMRs.
>
> The OS then has to reconcile this CMR "memory map" against the regular
> old BIOS-provided memory map, tossing out any memory regions which are
> RAM, but not covered by a CMR, or disabling TDX entirely.
>
> Fun, eh?

Yes, I want to challenge the idea that all core-mm memory must be TDX
capable. Instead, this feels more like something that wants a
hugetlbfs / dax-device like capability to ask the kernel to gather /
set-aside the enumerated TDX memory out of all the general purpose
memory it knows about and then VMs use that ABI to get access to
convertible memory. Trying to ensure that all page allocator memory is
TDX capable feels too restrictive with all the different ways pfns can
get into the allocator.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 15:18           ` Dan Williams
@ 2022-04-29 17:18             ` Dave Hansen
  2022-04-29 17:48               ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 17:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On 4/29/22 08:18, Dan Williams wrote:
> Yes, I want to challenge the idea that all core-mm memory must be TDX
> capable. Instead, this feels more like something that wants a
> hugetlbfs / dax-device like capability to ask the kernel to gather /
> set-aside the enumerated TDX memory out of all the general purpose
> memory it knows about and then VMs use that ABI to get access to
> convertible memory. Trying to ensure that all page allocator memory is
> TDX capable feels too restrictive with all the different ways pfns can
> get into the allocator.

The KVM users are the problem here.  They use a variety of ABIs to get
memory and then hand it to KVM.  KVM basically just consumes the
physical addresses from the page tables.

Also, there's no _practical_ problem here today.  I can't actually think
of a case where any memory that ends up in the allocator on today's TDX
systems is not TDX capable.

Tomorrow's systems are going to be the problem.  They'll (presumably)
have a mix of CXL devices that will have varying capabilities.  Some
will surely lack the metadata storage for checksums and TD-owner bits.
TDX use will be *safe* on those systems: if you take this code and run
it on one tomorrow's systems, it will notice the TDX-incompatible memory
and will disable TDX.

The only way around this that I can see is to introduce ABI today that
anticipates the needs of the future systems.  We could require that all
the KVM memory be "validated" before handing it to TDX.  Maybe a new
syscall that says: "make sure this mapping works for TDX".  It could be
new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

But, neither of those really help with, say, a device-DAX mapping of
TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
throw up its hands and leave users with the same result: TDX can't be
used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
device-DAX because they don't respect the NUMA policy ABI.

I'm open to ideas here.  If there's a viable ABI we can introduce to
train TDX users today that will work tomorrow too, I'm all for it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29 14:30         ` Sean Christopherson
@ 2022-04-29 17:46           ` Dave Hansen
  2022-04-29 18:19             ` Sean Christopherson
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 17:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/29/22 07:30, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Dave Hansen wrote:
...
>> A *good* way (although not foolproof) is to launch a TDX VM early
>> in boot before memory gets fragmented or consumed.  You might even
>> want to recommend this in the documentation.
> 
> What about providing a kernel param to tell the kernel to do the
> allocation during boot?

I think that's where we'll end up eventually.  But, I also want to defer
that discussion until after we have something merged.

Right now, allocating the PAMTs precisely requires running the TDX
module.  Running the TDX module requires VMXON.  VMXON is only done by
KVM.  KVM isn't necessarily there during boot.  So, it's hard to do
precisely today without a bunch of mucking with VMX.

But, it would be really easy to do something less precise like:

	tdx_reserve_ratio=255

...

u8 *pamt_reserve[MAX_NR_NODES]

	for_each_online_node(n) {
		pamt_pages = (node_spanned_pages(n)/tdx_reserve_ratio) /
			     PAGE_SIZE;
		pamt_reserve[n] = alloc_bootmem([pamt_pages);
	}

Then have the TDX code use pamt_reserve[] instead of allocating more
memory when it is needed later.

That will work just fine as long as you know up front how much metadata
TDX needs.  If the metadata requirements change in an updated TDX
module, the command-line will need to be updated to regain the
guarantee.  But, it can still fall back to the best-effort code that is
 in the series today.

In other words, I think we want what is in the series today no matter
what, and we'll want it forever.  That's why it's the *one* way of doing
things now.  I entirely agree that there will be TDX users that want a
stronger guarantee.

You can arm-wrestle the distro folks who hate adding command-line tweaks
when the time comes. ;)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-28 23:14         ` Kai Huang
@ 2022-04-29 17:47           ` Dave Hansen
  2022-05-02  5:04             ` Kai Huang
  2022-05-25  4:47             ` Kai Huang
  0 siblings, 2 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 17:47 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/28/22 16:14, Kai Huang wrote:
> On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
>> On 4/27/22 17:15, Kai Huang wrote:
>>>> Couldn't we get rid of that comment if you did something like:
>>>>
>>>> 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
>>>
>>> Yes will do.
>>>
>>>> and preferably make the variables function-local.
>>>
>>> 'tdx_sysinfo' will be used by KVM too.
>>
>> In other words, it's not a part of this series so I can't review whether
>> this statement is correct or whether there's a better way to hand this
>> information over to KVM.
>>
>> This (minor) nugget influencing the design also isn't even commented or
>> addressed in the changelog.
> 
> TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
> in the stack.  I can change to use dynamic allocation at the beginning and free
> it at the end of the function.  KVM support patches can change it to static
> variable in the file.

2k of stack is big, but it isn't a deal breaker for something that's not
nested anywhere and that's only called once in a pretty controlled
setting and not in interrupt context.  I wouldn't cry about it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 17:18             ` Dave Hansen
@ 2022-04-29 17:48               ` Dan Williams
  2022-04-29 18:34                 ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-29 17:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 08:18, Dan Williams wrote:
> > Yes, I want to challenge the idea that all core-mm memory must be TDX
> > capable. Instead, this feels more like something that wants a
> > hugetlbfs / dax-device like capability to ask the kernel to gather /
> > set-aside the enumerated TDX memory out of all the general purpose
> > memory it knows about and then VMs use that ABI to get access to
> > convertible memory. Trying to ensure that all page allocator memory is
> > TDX capable feels too restrictive with all the different ways pfns can
> > get into the allocator.
>
> The KVM users are the problem here.  They use a variety of ABIs to get
> memory and then hand it to KVM.  KVM basically just consumes the
> physical addresses from the page tables.
>
> Also, there's no _practical_ problem here today.  I can't actually think
> of a case where any memory that ends up in the allocator on today's TDX
> systems is not TDX capable.
>
> Tomorrow's systems are going to be the problem.  They'll (presumably)
> have a mix of CXL devices that will have varying capabilities.  Some
> will surely lack the metadata storage for checksums and TD-owner bits.
> TDX use will be *safe* on those systems: if you take this code and run
> it on one tomorrow's systems, it will notice the TDX-incompatible memory
> and will disable TDX.
>
> The only way around this that I can see is to introduce ABI today that
> anticipates the needs of the future systems.  We could require that all
> the KVM memory be "validated" before handing it to TDX.  Maybe a new
> syscall that says: "make sure this mapping works for TDX".  It could be
> new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

Yes, node-id seems the only reasonable handle that can be used, and it
does not seem too onerous for a KVM user to have to set a node policy
preferring all the TDX / confidential-computing capable nodes.

> But, neither of those really help with, say, a device-DAX mapping of
> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> throw up its hands and leave users with the same result: TDX can't be
> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> device-DAX because they don't respect the NUMA policy ABI.

They do have "target_node" attributes to associate node specific
metadata, and could certainly express target_node capabilities in its
own ABI. Then it's just a matter of making pfn_to_nid() do the right
thing so KVM kernel side can validate the capabilities of all inbound
pfns.

> I'm open to ideas here.  If there's a viable ABI we can introduce to
> train TDX users today that will work tomorrow too, I'm all for it.

In general, expressing NUMA node perf and node capabilities is
something Linux needs to get better at. HMAT data for example still
exists as sideband information ignored by numactl, but it feels
inevitable that perf and capability details become more of a first
class citizen for applications that have these mem-allocation-policy
constraints in the presence of disparate memory types.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29 17:46           ` Dave Hansen
@ 2022-04-29 18:19             ` Sean Christopherson
  2022-04-29 18:32               ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Sean Christopherson @ 2022-04-29 18:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, Apr 29, 2022, Dave Hansen wrote:
> On 4/29/22 07:30, Sean Christopherson wrote:
> > On Fri, Apr 29, 2022, Dave Hansen wrote:
> ...
> >> A *good* way (although not foolproof) is to launch a TDX VM early
> >> in boot before memory gets fragmented or consumed.  You might even
> >> want to recommend this in the documentation.
> > 
> > What about providing a kernel param to tell the kernel to do the
> > allocation during boot?
> 
> I think that's where we'll end up eventually.  But, I also want to defer
> that discussion until after we have something merged.
> 
> Right now, allocating the PAMTs precisely requires running the TDX
> module.  Running the TDX module requires VMXON.  VMXON is only done by
> KVM.  KVM isn't necessarily there during boot.  So, it's hard to do
> precisely today without a bunch of mucking with VMX.

Meh, it's hard only if we ignore the fact that the PAMT entry size isn't going
to change for a given TDX module, and is extremely unlikely to change in general.

Odds are good the kernel can hardcode a sane default and Just Work.  Or provide
the assumed size of a PAMT entry via module param.  If the size ends up being
wrong, log an error, free the reserved memory, and move on with TDX setup with
the correct size.

> You can arm-wrestle the distro folks who hate adding command-line tweaks
> when the time comes. ;)

Sure, you just find me the person that's going to run TDX guests with an
off-the-shelf distro kernel :-D

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29 18:19             ` Sean Christopherson
@ 2022-04-29 18:32               ` Dave Hansen
  0 siblings, 0 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 18:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, linux-kernel, kvm, pbonzini, len.brown, tony.luck,
	rafael.j.wysocki, reinette.chatre, dan.j.williams, peterz, ak,
	kirill.shutemov, sathyanarayanan.kuppuswamy, isaku.yamahata

On 4/29/22 11:19, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Dave Hansen wrote:
>> On 4/29/22 07:30, Sean Christopherson wrote:
>>> On Fri, Apr 29, 2022, Dave Hansen wrote:
>> ...
>>>> A *good* way (although not foolproof) is to launch a TDX VM early
>>>> in boot before memory gets fragmented or consumed.  You might even
>>>> want to recommend this in the documentation.
>>>
>>> What about providing a kernel param to tell the kernel to do the
>>> allocation during boot?
>>
>> I think that's where we'll end up eventually.  But, I also want to defer
>> that discussion until after we have something merged.
>>
>> Right now, allocating the PAMTs precisely requires running the TDX
>> module.  Running the TDX module requires VMXON.  VMXON is only done by
>> KVM.  KVM isn't necessarily there during boot.  So, it's hard to do
>> precisely today without a bunch of mucking with VMX.
> 
> Meh, it's hard only if we ignore the fact that the PAMT entry size isn't going
> to change for a given TDX module, and is extremely unlikely to change in general.
> 
> Odds are good the kernel can hardcode a sane default and Just Work.  Or provide
> the assumed size of a PAMT entry via module param.  If the size ends up being
> wrong, log an error, free the reserved memory, and move on with TDX setup with
> the correct size.

Sure.  The boot param could be:

	tdx_reserve_whatever=auto

and then it can be overridden if necessary.  I just don't want to have
kernel binaries that are only good as paperweights if Intel decides it
needs another byte of metadata.

>> You can arm-wrestle the distro folks who hate adding command-line tweaks
>> when the time comes. ;)
> 
> Sure, you just find me the person that's going to run TDX guests with an
> off-the-shelf distro kernel :-D

Well, everyone gets their kernel from upstream eventually and everyone
watches upstream.

But, in all seriousness, do you really expect TDX to remain solely in
the non-distro-kernel crowd forever?  I expect that the fancy cloud
providers (with custom kernels) who care the most to deploy TDX fist.
But, things will trickle down to the distro crowd over time.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 17:48               ` Dan Williams
@ 2022-04-29 18:34                 ` Dave Hansen
  2022-04-29 18:47                   ` Dan Williams
  2022-05-02 10:18                   ` Kai Huang
  0 siblings, 2 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 18:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On 4/29/22 10:48, Dan Williams wrote:
>> But, neither of those really help with, say, a device-DAX mapping of
>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
>> throw up its hands and leave users with the same result: TDX can't be
>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>> device-DAX because they don't respect the NUMA policy ABI.
> They do have "target_node" attributes to associate node specific
> metadata, and could certainly express target_node capabilities in its
> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> thing so KVM kernel side can validate the capabilities of all inbound
> pfns.

Let's walk through how this would work with today's kernel on tomorrow's
hardware, without KVM validating PFNs:

1. daxaddr mmap("/dev/dax1234")
2. kvmfd = open("/dev/kvm")
3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
4. guest starts running
5. guest touches 'daxaddr'
6. Page fault handler maps 'daxaddr'
7. KVM finds new 'daxaddr' PTE
8. TDX code tries to add physical address to Secure-EPT
9. TDX "SEAMCALL" fails because page is not convertible
10. Guest dies

All we can do to improve on that is call something that pledges to only
map convertible memory at 'daxaddr'.  We can't *actually* validate the
physical addresses at mmap() time or even
KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
allocated.

Those pledges are hard for anonymous memory though.  To fulfill the
pledge, we not only have to validate that the NUMA policy is compatible
at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
policy that might undermine the pledge.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 18:34                 ` Dave Hansen
@ 2022-04-29 18:47                   ` Dan Williams
  2022-04-29 19:20                     ` Dave Hansen
  2022-05-02 10:18                   ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-29 18:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 10:48, Dan Williams wrote:
> >> But, neither of those really help with, say, a device-DAX mapping of
> >> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> >> throw up its hands and leave users with the same result: TDX can't be
> >> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >> device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
>
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
>
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };

At least for a file backed mapping the capability lookup could be done
here, no need to wait for the fault.

> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
>
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'.  We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
>
> Those pledges are hard for anonymous memory though.  To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

I think it's less that the kernel needs to enforce a pledge and more
that an interface is needed to communicate the guest death reason.
I.e. "here is the impossible thing you asked for, next time set this
policy to avoid this problem".

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 18:47                   ` Dan Williams
@ 2022-04-29 19:20                     ` Dave Hansen
  2022-04-29 21:20                       ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 19:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On 4/29/22 11:47, Dan Williams wrote:
> On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/29/22 10:48, Dan Williams wrote:
>>>> But, neither of those really help with, say, a device-DAX mapping of
>>>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
>>>> throw up its hands and leave users with the same result: TDX can't be
>>>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>>>> device-DAX because they don't respect the NUMA policy ABI.
>>> They do have "target_node" attributes to associate node specific
>>> metadata, and could certainly express target_node capabilities in its
>>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
>>> thing so KVM kernel side can validate the capabilities of all inbound
>>> pfns.
>>
>> Let's walk through how this would work with today's kernel on tomorrow's
>> hardware, without KVM validating PFNs:
>>
>> 1. daxaddr mmap("/dev/dax1234")
>> 2. kvmfd = open("/dev/kvm")
>> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> 
> At least for a file backed mapping the capability lookup could be done
> here, no need to wait for the fault.

For DAX mappings, sure.  But, anything that's backed by page cache, you
can't know until the RAM is allocated.

...
>> Those pledges are hard for anonymous memory though.  To fulfill the
>> pledge, we not only have to validate that the NUMA policy is compatible
>> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
>> policy that might undermine the pledge.
> 
> I think it's less that the kernel needs to enforce a pledge and more
> that an interface is needed to communicate the guest death reason.
> I.e. "here is the impossible thing you asked for, next time set this
> policy to avoid this problem".

IF this code is booted on a system where non-TDX-capable memory is
discovered, do we:
1. Disable TDX, printk() some nasty message, then boot as normal
or,
2a. Boot normally with TDX enabled
2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
    (the "SEAMCALLs" which are the last line of defense and will reject
     the request to add non-TDX-capable memory to a guest).  Or maybe
    an even earlier message.

For #1, if TDX is on, we are quite sure it will work.  But, it will
probably throw up its hands on tomorrow's hardware.  (This patch set).

For #2, TDX might break (guests get killed) at runtime on tomorrow's
hardware, but it also might be just fine.  Users might be able to work
around things by, for instance, figuring out a NUMA policy which
excludes TDX-incapable memory. (I think what Dan is looking for)

Is that a fair summary?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 19:20                     ` Dave Hansen
@ 2022-04-29 21:20                       ` Dan Williams
  2022-04-29 21:27                         ` Dave Hansen
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-04-29 21:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Fri, Apr 29, 2022 at 12:20 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 11:47, Dan Williams wrote:
> > On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 4/29/22 10:48, Dan Williams wrote:
> >>>> But, neither of those really help with, say, a device-DAX mapping of
> >>>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> >>>> throw up its hands and leave users with the same result: TDX can't be
> >>>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >>>> device-DAX because they don't respect the NUMA policy ABI.
> >>> They do have "target_node" attributes to associate node specific
> >>> metadata, and could certainly express target_node capabilities in its
> >>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> >>> thing so KVM kernel side can validate the capabilities of all inbound
> >>> pfns.
> >>
> >> Let's walk through how this would work with today's kernel on tomorrow's
> >> hardware, without KVM validating PFNs:
> >>
> >> 1. daxaddr mmap("/dev/dax1234")
> >> 2. kvmfd = open("/dev/kvm")
> >> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> >
> > At least for a file backed mapping the capability lookup could be done
> > here, no need to wait for the fault.
>
> For DAX mappings, sure.  But, anything that's backed by page cache, you
> can't know until the RAM is allocated.
>
> ...
> >> Those pledges are hard for anonymous memory though.  To fulfill the
> >> pledge, we not only have to validate that the NUMA policy is compatible
> >> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> >> policy that might undermine the pledge.
> >
> > I think it's less that the kernel needs to enforce a pledge and more
> > that an interface is needed to communicate the guest death reason.
> > I.e. "here is the impossible thing you asked for, next time set this
> > policy to avoid this problem".
>
> IF this code is booted on a system where non-TDX-capable memory is
> discovered, do we:
> 1. Disable TDX, printk() some nasty message, then boot as normal
> or,
> 2a. Boot normally with TDX enabled
> 2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
>     (the "SEAMCALLs" which are the last line of defense and will reject
>      the request to add non-TDX-capable memory to a guest).  Or maybe
>     an even earlier message.
>
> For #1, if TDX is on, we are quite sure it will work.  But, it will
> probably throw up its hands on tomorrow's hardware.  (This patch set).
>
> For #2, TDX might break (guests get killed) at runtime on tomorrow's
> hardware, but it also might be just fine.  Users might be able to work
> around things by, for instance, figuring out a NUMA policy which
> excludes TDX-incapable memory. (I think what Dan is looking for)
>
> Is that a fair summary?

Yes, just the option for TDX and non-TDX to live alongside each
other... although in the past I have argued to do option-1 and enforce
it at the lowest level [1]. Like platform BIOS is responsible to
disable CXL if CXL support for a given CPU security feature is
missing. However, I think end users will want to have their
confidential computing and capacity too. As long as that is not
precluded to be added after the fact, option-1 can be a way forward
until a concrete user for mixed mode shows up.

Is there something already like this today for people that, for
example, attempt to use PCI BAR mappings as memory? Or does KVM simply
allow for garbage-in garbage-out?

In the end the patches shouldn't talk about whether or not PMEM is
supported on a platform or not, that's irrelevant. What matters is
that misconfigurations can happen, should be rare to non-existent on
current platforms, and if it becomes a problem the kernel can grow ABI
to let userspace enumerate the conflicts.

[1]: https://lore.kernel.org/linux-cxl/CAPcyv4jMQbHYQssaDDDQFEbOR1v14VUnejcSwOP9VGUnZSsCKw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 21:20                       ` Dan Williams
@ 2022-04-29 21:27                         ` Dave Hansen
  0 siblings, 0 replies; 156+ messages in thread
From: Dave Hansen @ 2022-04-29 21:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kai Huang, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On 4/29/22 14:20, Dan Williams wrote:
> Is there something already like this today for people that, for
> example, attempt to use PCI BAR mappings as memory? Or does KVM simply
> allow for garbage-in garbage-out?

I'm just guessing, but I _assume_ those garbage PCI BAR mappings are how
KVM does device passthrough.

I know that some KVM users even use mem= to chop down the kernel-owned
'struct page'-backed memory, then have a kind of /dev/mem driver to let
the memory get mapped back into userspace.  KVM is happy to pass through
those mappings.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-29 17:47           ` Dave Hansen
@ 2022-05-02  5:04             ` Kai Huang
  2022-05-25  4:47             ` Kai Huang
  1 sibling, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-02  5:04 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-04-29 at 10:47 -0700, Dave Hansen wrote:
> On 4/28/22 16:14, Kai Huang wrote:
> > On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:15, Kai Huang wrote:
> > > > > Couldn't we get rid of that comment if you did something like:
> > > > > 
> > > > > 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> > > > 
> > > > Yes will do.
> > > > 
> > > > > and preferably make the variables function-local.
> > > > 
> > > > 'tdx_sysinfo' will be used by KVM too.
> > > 
> > > In other words, it's not a part of this series so I can't review whether
> > > this statement is correct or whether there's a better way to hand this
> > > information over to KVM.
> > > 
> > > This (minor) nugget influencing the design also isn't even commented or
> > > addressed in the changelog.
> > 
> > TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
> > in the stack.  I can change to use dynamic allocation at the beginning and free
> > it at the end of the function.  KVM support patches can change it to static
> > variable in the file.
> 
> 2k of stack is big, but it isn't a deal breaker for something that's not
> nested anywhere and that's only called once in a pretty controlled
> setting and not in interrupt context.  I wouldn't cry about it.

OK.  I'll change to use function local variables for both of them.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-04-29 14:20       ` Dave Hansen
  2022-04-29 14:30         ` Sean Christopherson
@ 2022-05-02  5:59         ` Kai Huang
  2022-05-02 14:17           ` Dave Hansen
  1 sibling, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-02  5:59 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> On 4/29/22 00:46, Kai Huang wrote:
> > On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> > > This is also a good place to note the downsides of using
> > > alloc_contig_pages().
> > 
> > For instance:
> > 
> > 	The allocation may fail when memory usage is under pressure.
> 
> It's not really memory pressure, though.  The larger the allocation, the
> more likely it is to fail.  The more likely it is that the kernel can't
> free the memory or that if you need 1GB of contiguous memory that
> 999.996MB gets freed, but there is one stubborn page left.
> 
> alloc_contig_pages() can and will fail.  The only mitigation which is
> guaranteed to avoid this is doing the allocation at boot.  But, you're
> not doing that to avoid wasting memory on every TDX system that doesn't
> use TDX.
> 
> A *good* way (although not foolproof) is to launch a TDX VM early in
> boot before memory gets fragmented or consumed.  You might even want to
> recommend this in the documentation.

"launch a TDX VM early in boot" I suppose you mean having some boot-time service
which launches a TDX VM before we get the login interface.  I'll put this in the
documentation.

How about adding below in the changelog:

"
However using alloc_contig_pages() to allocate large physically contiguous
memory at runtime may fail.  The larger the allocation, the more likely it is to
fail.  Due to the fragmentation, the kernel may need to move pages out of the
to-be-allocated contiguous memory range but it may fail to move even the last
stubborn page.  A good way (although not foolproof) is to launch a TD VM early
in boot to get PAMTs allocated before memory gets fragmented or consumed.
"

> 
> > > > +/*
> > > > + * Locate the NUMA node containing the start of the given TDMR's first
> > > > + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
> > > > + */
> > > 
> > > Please add a sentence or two on the implications here of what this means
> > > when it happens.  Also, the joining of e820 regions seems like it might
> > > span NUMA nodes.  What prevents that code from just creating one large
> > > e820 area that leads to one large TDMR and horrible NUMA affinity for
> > > these structures?
> > 
> > How about adding:
> > 
> > 	When TDMR is created, it stops spanning at NUAM boundary.
> 
> I actually don't know what that means at all.  I was thinking of
> something like this.
> 
> /*
>  * Pick a NUMA node on which to allocate this TDMR's metadata.
>  *
>  * This is imprecise since TDMRs are 1GB aligned and NUMA nodes might
>  * not be.  If the TDMR covers more than one node, just use the _first_
>  * one.  This can lead to small areas of off-node metadata for some
>  * memory.
>  */

Thanks.

> 
> > > > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > > > +{
> > > > +	u64 start, end;
> > > > +	int i;
> > > > +
> > > > +	/* Find the first RAM entry covered by the TDMR */
> 
> There's something else missing in here.  Why not just do:
> 
> 	return phys_to_target_node(TDMR_START(tdmr));
> 
> This would explain it:
> 
> 	/*
> 	 * The beginning of the TDMR might not point to RAM.
> 	 * Find its first RAM address which which its node can
> 	 * be found.
> 	 */

Will use this.  Thanks.

> 
> > > > +	e820_for_each_mem(i, start, end)
> > > > +		if (end > TDMR_START(tdmr))
> > > > +			break;
> > > 
> > > Brackets around the big loop, please.
> > 
> > OK.
> > 
> > > 
> > > > +	/*
> > > > +	 * One TDMR must cover at least one (or partial) RAM entry,
> > > > +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> > > > +	 */
> > > > +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > > > +		return 0;
> 
> This really means "no RAM found for this TDMR", right?  Can we say that,
> please.

OK will add it.  How about:

	/*
	 * No RAM found for this TDMR.  WARN() in this case, as it
	 * cannot happen otherwise it is a kernel bug.
	 */

> 
> 
> > > > +	/*
> > > > +	 * Allocate one chunk of physically contiguous memory for all
> > > > +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> > > > +	 * in overlapped TDMRs.
> > > > +	 */
> > > 
> > > Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
> > > just two lines of code, I'd probably just the helper and open-code it
> > > here.  Then you only have one place to comment on it.
> > 
> > It has a loop and internally calls __tdmr_get_pamt_sz().  It looks doesn't fit
> > if we open-code it here.
> > 
> > How about move this comment to tdmr_get_pamt_sz()?
> 
> I thought about that.  But tdmr_get_pamt_sz() isn't itself doing any
> allocation so it doesn't make a whole lot of logical sense.  This is a
> place where a helper _can_ be removed.  Remove it, please.

OK.  Will remove the helper.  Thanks.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29 18:34                 ` Dave Hansen
  2022-04-29 18:47                   ` Dan Williams
@ 2022-05-02 10:18                   ` Kai Huang
  1 sibling, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-02 10:18 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Linux Kernel Mailing List, KVM list, Sean Christopherson,
	Paolo Bonzini, Brown, Len, Luck, Tony, Rafael J Wysocki,
	Reinette Chatre, Peter Zijlstra, Andi Kleen, Kirill A. Shutemov,
	Kuppuswamy Sathyanarayanan, Isaku Yamahata, chao.p.peng

On Fri, 2022-04-29 at 11:34 -0700, Dave Hansen wrote:
> On 4/29/22 10:48, Dan Williams wrote:
> > > But, neither of those really help with, say, a device-DAX mapping of
> > > TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> > > throw up its hands and leave users with the same result: TDX can't be
> > > used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> > > device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
> 
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
> 
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
> 
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'.  We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
> 
> Those pledges are hard for anonymous memory though.  To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

Hi Dave,

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace.  Please see my reply
below:

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mf9bf10a63eaaf0968c46ab33bdaf06bd2cfdd948

My understanding is for TDX guest KVM will be enforced to use the new mechanism.
So when TDX supports NVDIMM in the future, dax can be extended to support the
new mechanism to support using it as TD guest backend.

Sean, Paolo, Isaku, Chao,

Please correct me if I am wrong?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-05-02  5:59         ` Kai Huang
@ 2022-05-02 14:17           ` Dave Hansen
  2022-05-02 21:55             ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-05-02 14:17 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 5/1/22 22:59, Kai Huang wrote:
> On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> How about adding below in the changelog:
> 
> "
> However using alloc_contig_pages() to allocate large physically contiguous
> memory at runtime may fail.  The larger the allocation, the more likely it is to
> fail.  Due to the fragmentation, the kernel may need to move pages out of the
> to-be-allocated contiguous memory range but it may fail to move even the last
> stubborn page.  A good way (although not foolproof) is to launch a TD VM early
> in boot to get PAMTs allocated before memory gets fragmented or consumed.
> "

Better, although it's getting a bit off topic for this changelog.

Just be short and sweet:

1. the allocation can fail
2. Launch a VM early to (badly) mitigate this
3. the only way to fix it is to add a boot option


>>>>> +	/*
>>>>> +	 * One TDMR must cover at least one (or partial) RAM entry,
>>>>> +	 * otherwise it is kernel bug.  WARN_ON() in this case.
>>>>> +	 */
>>>>> +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
>>>>> +		return 0;
>>
>> This really means "no RAM found for this TDMR", right?  Can we say that,
>> please.
> 
> OK will add it.  How about:
> 
> 	/*
> 	 * No RAM found for this TDMR.  WARN() in this case, as it
> 	 * cannot happen otherwise it is a kernel bug.
> 	 */

The only useful information in that comment is the first sentence.  The
jibberish about WARN() is patently obvious from the next two lines of code.

*WHY* can't this happen?  How might it have actually happened?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-05-02 14:17           ` Dave Hansen
@ 2022-05-02 21:55             ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-02 21:55 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Mon, 2022-05-02 at 07:17 -0700, Dave Hansen wrote:
> On 5/1/22 22:59, Kai Huang wrote:
> > On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> > How about adding below in the changelog:
> > 
> > "
> > However using alloc_contig_pages() to allocate large physically contiguous
> > memory at runtime may fail.  The larger the allocation, the more likely it is to
> > fail.  Due to the fragmentation, the kernel may need to move pages out of the
> > to-be-allocated contiguous memory range but it may fail to move even the last
> > stubborn page.  A good way (although not foolproof) is to launch a TD VM early
> > in boot to get PAMTs allocated before memory gets fragmented or consumed.
> > "
> 
> Better, although it's getting a bit off topic for this changelog.
> 
> Just be short and sweet:
> 
> 1. the allocation can fail
> 2. Launch a VM early to (badly) mitigate this
> 3. the only way to fix it is to add a boot option
> 
OK. Thanks.

> 
> > > > > > +	/*
> > > > > > +	 * One TDMR must cover at least one (or partial) RAM entry,
> > > > > > +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> > > > > > +	 */
> > > > > > +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > > > > > +		return 0;
> > > 
> > > This really means "no RAM found for this TDMR", right?  Can we say that,
> > > please.
> > 
> > OK will add it.  How about:
> > 
> > 	/*
> > 	 * No RAM found for this TDMR.  WARN() in this case, as it
> > 	 * cannot happen otherwise it is a kernel bug.
> > 	 */
> 
> The only useful information in that comment is the first sentence.  The
> jibberish about WARN() is patently obvious from the next two lines of code.
> 
> *WHY* can't this happen?  How might it have actually happened?

When TDMRs are created, we already have made sure one TDMR must cover at least
one or partial RAM entry.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-04-29  1:40             ` Kai Huang
  2022-04-29  3:04               ` Dan Williams
@ 2022-05-03 23:59               ` Kai Huang
  2022-05-04  0:25                 ` Dave Hansen
  2022-05-04 14:31                 ` Dan Williams
  1 sibling, 2 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-03 23:59 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > 
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > > 
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > > 
> > > > I'll think about design solutions if above doesn't look good for you.
> > > 
> > > No, it doesn't look good to me.
> > > 
> > > You can't just say:
> > > 
> > > 	/*
> > > 	 * This code will eat puppies if used on systems with hotplug.
> > > 	 */
> > > 
> > > and merrily await the puppy bloodbath.
> > > 
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > > 
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > > 
> > > Please, please read this again ^^
> > 
> > OK.  I'll think about solutions and come back later.
> > > 
> 
> Hi Dave,
> 
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.  
> 
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

Hi Dave,

Try to close how to handle memory hotplug.  Any comments to below?

For the first approach, I forgot to think about memory hot-remove case.  If we
just reject adding new memory resource when TDX is capable on the platform, then
if the memory is hot-removed, we won't be able to add it back.  My thinking is
we still want to support memory online/offline because it is purely in software
but has nothing to do with TDX.  But if one memory resource can be put to
offline, it seems we don't have any way to prevent it from being removed. 

So if we do above, on the future platforms when memory hotplug can co-exist with
TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
memory is hot-removed, it won't be able to be added back (even it is included in
CMR, or TDMRs after TDX module is initialized).

Is this behavior acceptable?  Or perhaps I have misunderstanding?

The second approach will behave more nicely, but I don't know whether it is
worth to do it now.

Btw, below logic when adding a new memory resource has a minor problem, please
see below...

> 
> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
> 
> 	tdx_init_disable();	/*similar to cpu_hotplug_disable() */
> 	if (tdx_module_initialized())
> 		// reject memory hotplug
> 	else if (new_memory_resource NOT in CMRs)
> 		// reject memory hotplug
> 	else
> 		allow memory hotplug
> 	tdx_init_enable();	/*similar to cpu_hotplug_enable() */

...

Should be:

	// prevent racing with TDX module initialization */
	tdx_init_disable();

	if (tdx_module_initialized()) {
		if (new_memory_resource in TDMRs)
			// allow memory hot-add
		else
			// reject memory hot-add
	} else if (new_memory_resource in CMR) {
		// add new memory to TDX memory so it can be
		// included into TDMRs

		// allow memory hot-add
	}
	else
		// reject memory hot-add
	
	tdx_module_enable();

And when platform doesn't TDX, always allow memory hot-add.


> 
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
> 
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
> 
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?
> 
> Any suggestions?
> 
> [1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
> keyIDs.  Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
> enabled means the machine is TDX-capable, and for now a TDX-capable machine
> doesn't support ACPI memory hotplug.
> 
> [2] It prevents adding legacy PMEM as system RAM too but I think it's fine.  If
> user wants legacy PMEM then it is unlikely user will add it back and use as
> system RAM.  User is unlikely to use legacy PMEM as TD guest memory directly as
> TD guests is likely to use a new memfd backend which allows private page not
> accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.
> 
> [3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > > 


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-03 23:59               ` Kai Huang
@ 2022-05-04  0:25                 ` Dave Hansen
  2022-05-04  1:15                   ` Kai Huang
  2022-05-04 14:31                 ` Dan Williams
  1 sibling, 1 reply; 156+ messages in thread
From: Dave Hansen @ 2022-05-04  0:25 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On 5/3/22 16:59, Kai Huang wrote:
> Should be:
> 
> 	// prevent racing with TDX module initialization */
> 	tdx_init_disable();
> 
> 	if (tdx_module_initialized()) {
> 		if (new_memory_resource in TDMRs)
> 			// allow memory hot-add
> 		else
> 			// reject memory hot-add
> 	} else if (new_memory_resource in CMR) {
> 		// add new memory to TDX memory so it can be
> 		// included into TDMRs
> 
> 		// allow memory hot-add
> 	}
> 	else
> 		// reject memory hot-add
> 	
> 	tdx_module_enable();
> 
> And when platform doesn't TDX, always allow memory hot-add.

I don't think it even needs to be *that* complicated.

It could just be winner take all: if TDX is initialized first, don't
allow memory hotplug.  If memory hotplug happens first, don't allow TDX
to be initialized.

That's fine at least for a minimal patch set.

What you have up above is probably where you want to go eventually, but
it means doing things like augmenting the e820 since it's the single
source of truth for creating the TMDRs right now.


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-04  0:25                 ` Dave Hansen
@ 2022-05-04  1:15                   ` Kai Huang
  2022-05-05  9:54                     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-04  1:15 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> On 5/3/22 16:59, Kai Huang wrote:
> > Should be:
> > 
> > 	// prevent racing with TDX module initialization */
> > 	tdx_init_disable();
> > 
> > 	if (tdx_module_initialized()) {
> > 		if (new_memory_resource in TDMRs)
> > 			// allow memory hot-add
> > 		else
> > 			// reject memory hot-add
> > 	} else if (new_memory_resource in CMR) {
> > 		// add new memory to TDX memory so it can be
> > 		// included into TDMRs
> > 
> > 		// allow memory hot-add
> > 	}
> > 	else
> > 		// reject memory hot-add
> > 	
> > 	tdx_module_enable();
> > 
> > And when platform doesn't TDX, always allow memory hot-add.
> 
> I don't think it even needs to be *that* complicated.
> 
> It could just be winner take all: if TDX is initialized first, don't
> allow memory hotplug.  If memory hotplug happens first, don't allow TDX
> to be initialized.
> 
> That's fine at least for a minimal patch set.

OK. This should also work.

We will need tdx_init_disable() which grabs the mutex to prevent TDX module
initialization from running concurrently, and to disable TDX module
initialization once for all.
 

> 
> What you have up above is probably where you want to go eventually, but
> it means doing things like augmenting the e820 since it's the single
> source of truth for creating the TMDRs right now.
> 

Yes.  But in this case, I am thinking about probably we should switch from
consulting e820 to consulting memblock.  The advantage of using e820 is it's
easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
see in e820_for_each_mem() loop) I am actually merging contiguous different
types of RAM entries in order to be consistent with the behavior of
e820_memblock_setup().  This is not nice.

If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
used as TDX memory anyway now.  The reason is TDX guest will very likely needing
to use the new fd-based backend (see my reply in other emails), but not just
some random backend.  To me it's totally fine to not support using legacy PMEM
directly as TD guest backend (and if we create a TD with real NVDIMM as backend
using dax, the TD cannot be created anyway).  Given we cannot kmem-hot-add
legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
TDMRs.

In this case, we can just create TDMRs based on memblock directly.  One problem
is memblock will be gone after kernel boots, but this can be solved either by
keeping the memblock, or build the TDX memory early when memblock is still
alive.

Btw, eventually, as it's likely we need to support different source of TDX
memory (CLX memory, etc), I think eventually we will need some data structures
to represent TDX memory block and APIs to add those blocks to the whole TDX
memory so those TDX memory ranges from different source can be added before
initializing the TDX module.

	struct tdx_memblock {
		struct list_head list;
		phys_addr_t start;
		phys_addr_t end;
		int nid;
		...
	};

	struct tdx_memory {
		struct list_head tmb_list;
		...
	};

	int tdx_memory_add_memblock(start, end, nid, ...);

And the TDMRs can be created based on 'struct tdx_memory'.

For now, we only need to add memblock to TDX memory.

Any comments?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-03 23:59               ` Kai Huang
  2022-05-04  0:25                 ` Dave Hansen
@ 2022-05-04 14:31                 ` Dan Williams
  2022-05-04 22:50                   ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-05-04 14:31 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > >
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > can change the code and the documentation  when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > >
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > >
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > >
> > > > No, it doesn't look good to me.
> > > >
> > > > You can't just say:
> > > >
> > > >   /*
> > > >    * This code will eat puppies if used on systems with hotplug.
> > > >    */
> > > >
> > > > and merrily await the puppy bloodbath.
> > > >
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > >
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > >
> > > > Please, please read this again ^^
> > >
> > > OK.  I'll think about solutions and come back later.
> > > >
> >
> > Hi Dave,
> >
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> >
> > The first approach is simple.  We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
>
> Hi Dave,
>
> Try to close how to handle memory hotplug.  Any comments to below?
>
> For the first approach, I forgot to think about memory hot-remove case.  If we
> just reject adding new memory resource when TDX is capable on the platform, then
> if the memory is hot-removed, we won't be able to add it back.  My thinking is
> we still want to support memory online/offline because it is purely in software
> but has nothing to do with TDX.  But if one memory resource can be put to
> offline, it seems we don't have any way to prevent it from being removed.
>
> So if we do above, on the future platforms when memory hotplug can co-exist with
> TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
> memory is hot-removed, it won't be able to be added back (even it is included in
> CMR, or TDMRs after TDX module is initialized).
>
> Is this behavior acceptable?  Or perhaps I have misunderstanding?

Memory online at boot uses similar kernel paths as memory-online at
run time, so it sounds like your question is confusing physical vs
logical remove. Consider the case of logical offline then re-online
where the proposed TDX sanity check blocks the memory online, but then
a new kernel is kexec'd and that kernel again trusts the memory as TD
convertible again just because it onlines the memory in the boot path.
For physical memory remove it seems up to the platform to block that
if it conflicts with TDX, not for the kernel to add extra assumptions
that logical offline / online is incompatible with TDX.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-04 14:31                 ` Dan Williams
@ 2022-05-04 22:50                   ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-04 22:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Wed, 2022-05-04 at 07:31 -0700, Dan Williams wrote:
> On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > > > 
> > > > > > I thought we could document this in the documentation saying that this code can
> > > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > > can change the code and the documentation  when we add the support of those
> > > > > > features in the future, and update the documentation.
> > > > > > 
> > > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > > machine support those features.
> > > > > > 
> > > > > > I'll think about design solutions if above doesn't look good for you.
> > > > > 
> > > > > No, it doesn't look good to me.
> > > > > 
> > > > > You can't just say:
> > > > > 
> > > > >   /*
> > > > >    * This code will eat puppies if used on systems with hotplug.
> > > > >    */
> > > > > 
> > > > > and merrily await the puppy bloodbath.
> > > > > 
> > > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > > safe, controlled way.
> > > > > 
> > > > > > > You can't just ignore the problems because they're not present on one
> > > > > > > version of the hardware.
> > > > > 
> > > > > Please, please read this again ^^
> > > > 
> > > > OK.  I'll think about solutions and come back later.
> > > > > 
> > > 
> > > Hi Dave,
> > > 
> > > I think we have two approaches to handle memory hotplug interaction with the TDX
> > > module initialization.
> > > 
> > > The first approach is simple.  We just block memory from being added as system
> > > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > > can add some arch-specific-check to __add_memory_resource() and reject the new
> > > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > > NVDIMM as system RAM and normal ACPI memory hotplug [2].
> > 
> > Hi Dave,
> > 
> > Try to close how to handle memory hotplug.  Any comments to below?
> > 
> > For the first approach, I forgot to think about memory hot-remove case.  If we
> > just reject adding new memory resource when TDX is capable on the platform, then
> > if the memory is hot-removed, we won't be able to add it back.  My thinking is
> > we still want to support memory online/offline because it is purely in software
> > but has nothing to do with TDX.  But if one memory resource can be put to
> > offline, it seems we don't have any way to prevent it from being removed.
> > 
> > So if we do above, on the future platforms when memory hotplug can co-exist with
> > TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
> > memory is hot-removed, it won't be able to be added back (even it is included in
> > CMR, or TDMRs after TDX module is initialized).
> > 
> > Is this behavior acceptable?  Or perhaps I have misunderstanding?
> 
> Memory online at boot uses similar kernel paths as memory-online at
> run time, so it sounds like your question is confusing physical vs
> logical remove. Consider the case of logical offline then re-online
> where the proposed TDX sanity check blocks the memory online, but then
> a new kernel is kexec'd and that kernel again trusts the memory as TD
> convertible again just because it onlines the memory in the boot path.
> For physical memory remove it seems up to the platform to block that
> if it conflicts with TDX, not for the kernel to add extra assumptions
> that logical offline / online is incompatible with TDX.

Hi Dan,

No we don't block memory online, but we block memory add.  The code I mentioned
is add_memory_resource(), while memory online code path is
memory_block_online().  Or am I wrong?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-04  1:15                   ` Kai Huang
@ 2022-05-05  9:54                     ` Kai Huang
  2022-05-05 13:51                       ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-05  9:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-05-04 at 13:15 +1200, Kai Huang wrote:
> On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> > On 5/3/22 16:59, Kai Huang wrote:
> > > Should be:
> > > 
> > > 	// prevent racing with TDX module initialization */
> > > 	tdx_init_disable();
> > > 
> > > 	if (tdx_module_initialized()) {
> > > 		if (new_memory_resource in TDMRs)
> > > 			// allow memory hot-add
> > > 		else
> > > 			// reject memory hot-add
> > > 	} else if (new_memory_resource in CMR) {
> > > 		// add new memory to TDX memory so it can be
> > > 		// included into TDMRs
> > > 
> > > 		// allow memory hot-add
> > > 	}
> > > 	else
> > > 		// reject memory hot-add
> > > 	
> > > 	tdx_module_enable();
> > > 
> > > And when platform doesn't TDX, always allow memory hot-add.
> > 
> > I don't think it even needs to be *that* complicated.
> > 
> > It could just be winner take all: if TDX is initialized first, don't
> > allow memory hotplug.  If memory hotplug happens first, don't allow TDX
> > to be initialized.
> > 
> > That's fine at least for a minimal patch set.
> 
> OK. This should also work.
> 
> We will need tdx_init_disable() which grabs the mutex to prevent TDX module
> initialization from running concurrently, and to disable TDX module
> initialization once for all.
>  
> 
> > 
> > What you have up above is probably where you want to go eventually, but
> > it means doing things like augmenting the e820 since it's the single
> > source of truth for creating the TMDRs right now.
> > 
> 
> Yes.  But in this case, I am thinking about probably we should switch from
> consulting e820 to consulting memblock.  The advantage of using e820 is it's
> easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
> see in e820_for_each_mem() loop) I am actually merging contiguous different
> types of RAM entries in order to be consistent with the behavior of
> e820_memblock_setup().  This is not nice.
> 
> If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
> used as TDX memory anyway now.  The reason is TDX guest will very likely needing
> to use the new fd-based backend (see my reply in other emails), but not just
> some random backend.  To me it's totally fine to not support using legacy PMEM
> directly as TD guest backend (and if we create a TD with real NVDIMM as backend
> using dax, the TD cannot be created anyway).  Given we cannot kmem-hot-add
> legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
> TDMRs.
> 
> In this case, we can just create TDMRs based on memblock directly.  One problem
> is memblock will be gone after kernel boots, but this can be solved either by
> keeping the memblock, or build the TDX memory early when memblock is still
> alive.
> 
> Btw, eventually, as it's likely we need to support different source of TDX
> memory (CLX memory, etc), I think eventually we will need some data structures
> to represent TDX memory block and APIs to add those blocks to the whole TDX
> memory so those TDX memory ranges from different source can be added before
> initializing the TDX module.
> 
> 	struct tdx_memblock {
> 		struct list_head list;
> 		phys_addr_t start;
> 		phys_addr_t end;
> 		int nid;
> 		...
> 	};
> 
> 	struct tdx_memory {
> 		struct list_head tmb_list;
> 		...
> 	};
> 
> 	int tdx_memory_add_memblock(start, end, nid, ...);
> 
> And the TDMRs can be created based on 'struct tdx_memory'.
> 
> For now, we only need to add memblock to TDX memory.
> 
> Any comments?
> 

Hi Dave,

Sorry to ping (trying to close this).

Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
initialization, I think for now it's totally fine to exclude legacy PMEMs from
TDMRs.  The worst case is when someone tries to use them as TD guest backend
directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
that no one should just use some random backend to run TD.

I think w/o needing to include legacy PMEM, it's better to get all TDX memory
blocks based on memblock, but not e820.  The pages managed by page allocator are
from memblock anyway (w/o those from memory hotplug).

And I also think it makes more sense to introduce 'tdx_memblock' and
'tdx_memory' data structures to gather all TDX memory blocks during boot when
memblock is still alive.  When TDX module is initialized during runtime, TDMRs
can be created based on the 'struct tdx_memory' which contains all TDX memory
blocks we gathered based on memblock during boot.  This is also more flexible to
support other TDX memory from other sources such as CLX memory in the future.

Please let me know if you have any objection?  Thanks!

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-05  9:54                     ` Kai Huang
@ 2022-05-05 13:51                       ` Dan Williams
  2022-05-05 22:14                         ` Kai Huang
  2022-05-07  0:09                         ` Mike Rapoport
  0 siblings, 2 replies; 156+ messages in thread
From: Dan Williams @ 2022-05-05 13:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

[ add Mike ]


On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
[..]
>
> Hi Dave,
>
> Sorry to ping (trying to close this).
>
> Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> initialization, I think for now it's totally fine to exclude legacy PMEMs from
> TDMRs.  The worst case is when someone tries to use them as TD guest backend
> directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> that no one should just use some random backend to run TD.

The platform will already do this, right? I don't understand why this
is trying to take proactive action versus documenting the error
conditions and steps someone needs to take to avoid unconvertible
memory. There is already the CONFIG_HMEM_REPORTING that describes
relative performance properties between initiators and targets, it
seems fitting to also add security properties between initiators and
targets so someone can enumerate the numa-mempolicy that avoids
unconvertible memory.

No, special casing in hotplug code paths needed.

>
> I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> blocks based on memblock, but not e820.  The pages managed by page allocator are
> from memblock anyway (w/o those from memory hotplug).
>
> And I also think it makes more sense to introduce 'tdx_memblock' and
> 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> can be created based on the 'struct tdx_memory' which contains all TDX memory
> blocks we gathered based on memblock during boot.  This is also more flexible to
> support other TDX memory from other sources such as CLX memory in the future.
>
> Please let me know if you have any objection?  Thanks!

It's already the case that x86 maintains sideband structures to
preserve memory after exiting the early memblock code. Mike, correct
me if I am wrong, but adding more is less desirable than just keeping
the memblock around?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-05 13:51                       ` Dan Williams
@ 2022-05-05 22:14                         ` Kai Huang
  2022-05-06  0:22                           ` Dan Williams
  2022-05-07  0:09                         ` Mike Rapoport
  1 sibling, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-05 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

Thanks for feedback!

On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> [ add Mike ]
> 
> 
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> [..]
> > 
> > Hi Dave,
> > 
> > Sorry to ping (trying to close this).
> > 
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
> 
> The platform will already do this, right? 
> 

In the current v3 implementation, we don't have any code to handle memory
hotplug, therefore nothing prevents people from adding legacy PMEMs as system
RAM using kmem driver.  In order to guarantee all pages managed by page
allocator are all TDX memory, the v3 implementation needs to always include
legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
RAM, we can still guarantee all pages in page allocator are TDX memory.

Of course, a side benefit of always including legacy PMEMs is people
theoretically can use them directly as TD guest backend, but this is just a
bonus but not something that we need to guarantee.


> I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.

I don't think there's anything related to performance properties here.  The only
goal here is to make sure all pages in page allocator are TDX memory pages.

> 
> No, special casing in hotplug code paths needed.
> 
> > 
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> > 
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot.  This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> > 
> > Please let me know if you have any objection?  Thanks!
> 
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. 
> 

May I ask what data structures are you referring to?

Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
memblock info during boot.  It is also used to provide a common data structure
that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
based on a common 'tdx_memory' like below:

	int construct_tdmrs(struct tdx_memory *tmem, ...);

It would be much easier to support other TDX memory resources in the future.

The thing I am not sure is Dave wants to keep the code minimal (as this series
is already very big in terms of LoC) to make TDX running, and for now in
practice there's only system RAM during boot is TDX capable, so I am not sure we
should introduce those structures now.

> Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-05 22:14                         ` Kai Huang
@ 2022-05-06  0:22                           ` Dan Williams
  2022-05-06  0:45                             ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-05-06  0:22 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
>
> Thanks for feedback!
>
> On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > [ add Mike ]
> >
> >
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > [..]
> > >
> > > Hi Dave,
> > >
> > > Sorry to ping (trying to close this).
> > >
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> >
> > The platform will already do this, right?
> >
>
> In the current v3 implementation, we don't have any code to handle memory
> hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> RAM using kmem driver.  In order to guarantee all pages managed by page

That's the fundamental question I am asking why is "guarantee all
pages managed by page allocator are TDX memory". That seems overkill
compared to indicating the incompatibility after the fact.

> allocator are all TDX memory, the v3 implementation needs to always include
> legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> RAM, we can still guarantee all pages in page allocator are TDX memory.

Why?

> Of course, a side benefit of always including legacy PMEMs is people
> theoretically can use them directly as TD guest backend, but this is just a
> bonus but not something that we need to guarantee.
>
>
> > I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
>
> I don't think there's anything related to performance properties here.  The only
> goal here is to make sure all pages in page allocator are TDX memory pages.

Please reconsider or re-clarify that goal.

>
> >
> > No, special casing in hotplug code paths needed.
> >
> > >
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > >
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > >
> > > Please let me know if you have any objection?  Thanks!
> >
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code.
> >
>
> May I ask what data structures are you referring to?

struct numa_meminfo.

> Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> memblock info during boot.  It is also used to provide a common data structure
> that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
> logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
> based on a common 'tdx_memory' like below:
>
>         int construct_tdmrs(struct tdx_memory *tmem, ...);
>
> It would be much easier to support other TDX memory resources in the future.

"in the future" is a prompt to ask "Why not wait until that future /
need arrives before adding new infrastructure?"

> The thing I am not sure is Dave wants to keep the code minimal (as this series
> is already very big in terms of LoC) to make TDX running, and for now in
> practice there's only system RAM during boot is TDX capable, so I am not sure we
> should introduce those structures now.
>
> > Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-06  0:22                           ` Dan Williams
@ 2022-05-06  0:45                             ` Kai Huang
  2022-05-06  1:15                               ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-06  0:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > Thanks for feedback!
> > 
> > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > [ add Mike ]
> > > 
> > > 
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > [..]
> > > > 
> > > > Hi Dave,
> > > > 
> > > > Sorry to ping (trying to close this).
> > > > 
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > > 
> > > The platform will already do this, right?
> > > 
> > 
> > In the current v3 implementation, we don't have any code to handle memory
> > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > RAM using kmem driver.  In order to guarantee all pages managed by page
> 
> That's the fundamental question I am asking why is "guarantee all
> pages managed by page allocator are TDX memory". That seems overkill
> compared to indicating the incompatibility after the fact.

As I explained, the reason is I don't want to modify page allocator to
distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
and GFP_TDX.

KVM depends on host's page fault handler to allocate the page.  In fact KVM only
consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
guaranteeing all pages in page allocator is TDX memory, we can easily use
anonymous pages as TD guest memory.  This also allows us to easily extend the
shmem to support a new fd-based backend which doesn't require having to mmap()
TD guest memory to host userspace:

https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/

Also, besides TD guest memory, there are some per-TD control data structures
(which must be TDX memory too) need to be allocated for each TD.  Normal memory
allocation APIs can be used for such allocation if we guarantee all pages in
page allocator is TDX memory.

> 
> > allocator are all TDX memory, the v3 implementation needs to always include
> > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > RAM, we can still guarantee all pages in page allocator are TDX memory.
> 
> Why?

If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
system RAM using kmem driver, the assumption of "all pages in page allocator are
TDX memory" is broken.  A TD can be killed during runtime.

> 
> > Of course, a side benefit of always including legacy PMEMs is people
> > theoretically can use them directly as TD guest backend, but this is just a
> > bonus but not something that we need to guarantee.
> > 
> > 
> > > I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> > 
> > I don't think there's anything related to performance properties here.  The only
> > goal here is to make sure all pages in page allocator are TDX memory pages.
> 
> Please reconsider or re-clarify that goal.
> 
> > 
> > > 
> > > No, special casing in hotplug code paths needed.
> > > 
> > > > 
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > > 
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > > 
> > > > Please let me know if you have any objection?  Thanks!
> > > 
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code.
> > > 
> > 
> > May I ask what data structures are you referring to?
> 
> struct numa_meminfo.
> 
> > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> > memblock info during boot.  It is also used to provide a common data structure
> > that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
> > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> > construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
> > based on a common 'tdx_memory' like below:
> > 
> >         int construct_tdmrs(struct tdx_memory *tmem, ...);
> > 
> > It would be much easier to support other TDX memory resources in the future.
> 
> "in the future" is a prompt to ask "Why not wait until that future /
> need arrives before adding new infrastructure?"

Fine to me.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-06  0:45                             ` Kai Huang
@ 2022-05-06  1:15                               ` Dan Williams
  2022-05-06  1:46                                 ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-05-06  1:15 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > >
> > > Thanks for feedback!
> > >
> > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > [ add Mike ]
> > > >
> > > >
> > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > [..]
> > > > >
> > > > > Hi Dave,
> > > > >
> > > > > Sorry to ping (trying to close this).
> > > > >
> > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > that no one should just use some random backend to run TD.
> > > >
> > > > The platform will already do this, right?
> > > >
> > >
> > > In the current v3 implementation, we don't have any code to handle memory
> > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > RAM using kmem driver.  In order to guarantee all pages managed by page
> >
> > That's the fundamental question I am asking why is "guarantee all
> > pages managed by page allocator are TDX memory". That seems overkill
> > compared to indicating the incompatibility after the fact.
>
> As I explained, the reason is I don't want to modify page allocator to
> distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> and GFP_TDX.

Right, TDX details do not belong at that level, but it will work
almost all the time if you do nothing to "guarantee" all TDX capable
pages all the time.

> KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> guaranteeing all pages in page allocator is TDX memory, we can easily use
> anonymous pages as TD guest memory.

Again, TDX capable pages will be the overwhelming default, why are you
worried about cluttering the memory hotplug path for nice corner
cases.

Consider the fact that end users can break the kernel by specifying
invalid memmap= command line options. The memory hotplug code does not
take any steps to add safety in those cases because there are already
too many ways it can go wrong. TDX is just one more corner case where
the memmap= user needs to be careful. Otherwise, it is up to the
platform firmware to make sure everything in the base memory map is
TDX capable, and then all you need is documentation about the failure
mode when extending "System RAM" beyond that baseline.

> shmem to support a new fd-based backend which doesn't require having to mmap()
> TD guest memory to host userspace:
>
> https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
>
> Also, besides TD guest memory, there are some per-TD control data structures
> (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> allocation APIs can be used for such allocation if we guarantee all pages in
> page allocator is TDX memory.

You don't need that guarantee, just check it after the fact and fail
if that assertion fails. It should almost always be the case that it
succeeds and if it doesn't then something special is happening with
that system and the end user has effectively opt-ed out of TDX
operation.

> > > allocator are all TDX memory, the v3 implementation needs to always include
> > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> >
> > Why?
>
> If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> system RAM using kmem driver, the assumption of "all pages in page allocator are
> TDX memory" is broken.  A TD can be killed during runtime.

Yes, that is what the end user asked for. If they don't want that to
happen then the policy decision about using kmem needs to be updated
in userspace, not hard code that policy decision towards TDX inside
the kernel.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-06  1:15                               ` Dan Williams
@ 2022-05-06  1:46                                 ` Kai Huang
  2022-05-06 15:57                                   ` Dan Williams
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-06  1:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > 
> > > > Thanks for feedback!
> > > > 
> > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > [ add Mike ]
> > > > > 
> > > > > 
> > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > [..]
> > > > > > 
> > > > > > Hi Dave,
> > > > > > 
> > > > > > Sorry to ping (trying to close this).
> > > > > > 
> > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > that no one should just use some random backend to run TD.
> > > > > 
> > > > > The platform will already do this, right?
> > > > > 
> > > > 
> > > > In the current v3 implementation, we don't have any code to handle memory
> > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > 
> > > That's the fundamental question I am asking why is "guarantee all
> > > pages managed by page allocator are TDX memory". That seems overkill
> > > compared to indicating the incompatibility after the fact.
> > 
> > As I explained, the reason is I don't want to modify page allocator to
> > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > and GFP_TDX.
> 
> Right, TDX details do not belong at that level, but it will work
> almost all the time if you do nothing to "guarantee" all TDX capable
> pages all the time.

"almost all the time" do you mean?

> 
> > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > anonymous pages as TD guest memory.
> 
> Again, TDX capable pages will be the overwhelming default, why are you
> worried about cluttering the memory hotplug path for nice corner
> cases.

Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
let me clarify first:

1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
indicate which memory regions *can* be used as TDX memory.  This basically means
all RAM during boot for now.

2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
usable memory.  The TDX module needs to be configured which (convertible) memory
regions can be used as TDX memory.  Kernel is responsible for choosing the
ranges, and configure to the TDX module.  If a convertible memory page is not
included into TDMRs, the TDX module will report error when it is assigned to  a
TD.

> 
> Consider the fact that end users can break the kernel by specifying
> invalid memmap= command line options. The memory hotplug code does not
> take any steps to add safety in those cases because there are already
> too many ways it can go wrong. TDX is just one more corner case where
> the memmap= user needs to be careful. Otherwise, it is up to the
> platform firmware to make sure everything in the base memory map is
> TDX capable, and then all you need is documentation about the failure
> mode when extending "System RAM" beyond that baseline.

So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
system RAM, a live TD may eventually be killed.

If such case is a corner case that we don't need to guarantee, then even better.
And we have an additional reason that those legacy PMEMs don't need to be in
TDMRs.  As you suggested,  we can add some documentation to point out.

But the point we want to do some code check and prevent memory hotplug is, as
Dave said, we want this piece of code to work on *ANY* TDX capable machines,
including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
memory.  If we don't do any code check in  memory hotplug in this series, then
when this code runs in future platforms, user can plug NVDIMM or CLX memory as
system RAM thus break the assumption "all pages in page allocator are TDX
memory", which eventually leads to live TDs being killed potentially.

Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
documentation saying it only works one some platforms and you shouldn't do
things on other platforms are not good enough:

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

> 
> 
> > shmem to support a new fd-based backend which doesn't require having to mmap()
> > TD guest memory to host userspace:
> > 
> > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > 
> > Also, besides TD guest memory, there are some per-TD control data structures
> > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > allocation APIs can be used for such allocation if we guarantee all pages in
> > page allocator is TDX memory.
> 
> You don't need that guarantee, just check it after the fact and fail
> if that assertion fails. It should almost always be the case that it
> succeeds and if it doesn't then something special is happening with
> that system and the end user has effectively opt-ed out of TDX
> operation.

This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
created, while the second may fail.  We should provide a consistent service.

The thing is anyway we need to configure some memory regions to the TDX module.
To me there's no reason we don't want to guarantee all pages in page allocator
are TDX memory. 

> 
> > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > 
> > > Why?
> > 
> > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > TDX memory" is broken.  A TD can be killed during runtime.
> 
> Yes, that is what the end user asked for. If they don't want that to
> happen then the policy decision about using kmem needs to be updated
> in userspace, not hard code that policy decision towards TDX inside
> the kernel.

This is also fine to me.  But please also see above Dave's comment.

Thanks for those valuable feedback!


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-06  1:46                                 ` Kai Huang
@ 2022-05-06 15:57                                   ` Dan Williams
  2022-05-09  2:46                                     ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Dan Williams @ 2022-05-06 15:57 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > >
> > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > >
> > > > > Thanks for feedback!
> > > > >
> > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > [ add Mike ]
> > > > > >
> > > > > >
> > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > [..]
> > > > > > >
> > > > > > > Hi Dave,
> > > > > > >
> > > > > > > Sorry to ping (trying to close this).
> > > > > > >
> > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > > that no one should just use some random backend to run TD.
> > > > > >
> > > > > > The platform will already do this, right?
> > > > > >
> > > > >
> > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > >
> > > > That's the fundamental question I am asking why is "guarantee all
> > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > compared to indicating the incompatibility after the fact.
> > >
> > > As I explained, the reason is I don't want to modify page allocator to
> > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > and GFP_TDX.
> >
> > Right, TDX details do not belong at that level, but it will work
> > almost all the time if you do nothing to "guarantee" all TDX capable
> > pages all the time.
>
> "almost all the time" do you mean?
>
> >
> > > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > anonymous pages as TD guest memory.
> >
> > Again, TDX capable pages will be the overwhelming default, why are you
> > worried about cluttering the memory hotplug path for nice corner
> > cases.
>
> Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> let me clarify first:
>
> 1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
> indicate which memory regions *can* be used as TDX memory.  This basically means
> all RAM during boot for now.
>
> 2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
> usable memory.  The TDX module needs to be configured which (convertible) memory
> regions can be used as TDX memory.  Kernel is responsible for choosing the
> ranges, and configure to the TDX module.  If a convertible memory page is not
> included into TDMRs, the TDX module will report error when it is assigned to  a
> TD.
>
> >
> > Consider the fact that end users can break the kernel by specifying
> > invalid memmap= command line options. The memory hotplug code does not
> > take any steps to add safety in those cases because there are already
> > too many ways it can go wrong. TDX is just one more corner case where
> > the memmap= user needs to be careful. Otherwise, it is up to the
> > platform firmware to make sure everything in the base memory map is
> > TDX capable, and then all you need is documentation about the failure
> > mode when extending "System RAM" beyond that baseline.
>
> So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> system RAM, a live TD may eventually be killed.
>
> If such case is a corner case that we don't need to guarantee, then even better.
> And we have an additional reason that those legacy PMEMs don't need to be in
> TDMRs.  As you suggested,  we can add some documentation to point out.
>
> But the point we want to do some code check and prevent memory hotplug is, as
> Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> memory.  If we don't do any code check in  memory hotplug in this series, then
> when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> system RAM thus break the assumption "all pages in page allocator are TDX
> memory", which eventually leads to live TDs being killed potentially.
>
> Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> documentation saying it only works one some platforms and you shouldn't do
> things on other platforms are not good enough:
>
> https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

Yes, the incompatible cases cannot be ignored, but I disagree that
they actively need to be prevented. One way to achieve that is to
explicitly enumerate TDX capable memory and document how mempolicy can
be used to avoid killing TDs.

> > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > TD guest memory to host userspace:
> > >
> > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > >
> > > Also, besides TD guest memory, there are some per-TD control data structures
> > > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > page allocator is TDX memory.
> >
> > You don't need that guarantee, just check it after the fact and fail
> > if that assertion fails. It should almost always be the case that it
> > succeeds and if it doesn't then something special is happening with
> > that system and the end user has effectively opt-ed out of TDX
> > operation.
>
> This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
> created, while the second may fail.  We should provide a consistent service.

Yes, there needs to be enumeration and policy knobs to avoid failures,
hard coded "no memory hotplug" hacks do not seem the right enumeration
and policy knobs to me.

> The thing is anyway we need to configure some memory regions to the TDX module.
> To me there's no reason we don't want to guarantee all pages in page allocator
> are TDX memory.
>
> >
> > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > >
> > > > Why?
> > >
> > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > TDX memory" is broken.  A TD can be killed during runtime.
> >
> > Yes, that is what the end user asked for. If they don't want that to
> > happen then the policy decision about using kmem needs to be updated
> > in userspace, not hard code that policy decision towards TDX inside
> > the kernel.
>
> This is also fine to me.  But please also see above Dave's comment.

Dave is right, the implementation can not just ignore the conflict. To
me, enumeration plus error reporting allows for flexibility without
hard coding policy in the kernel.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-05 13:51                       ` Dan Williams
  2022-05-05 22:14                         ` Kai Huang
@ 2022-05-07  0:09                         ` Mike Rapoport
  2022-05-08 10:00                           ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Mike Rapoport @ 2022-05-07  0:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kai Huang, Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> [ add Mike ]
> 
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> [..]
> >
> > Hi Dave,
> >
> > Sorry to ping (trying to close this).
> >
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
> 
> The platform will already do this, right? I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.
> 
> No, special casing in hotplug code paths needed.
> 
> >
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> >
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot.  This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> >
> > Please let me know if you have any objection?  Thanks!
> 
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

TBH, I didn't read the entire thread yet, but at the first glance, keeping
memblock around is much more preferable that adding yet another { .start,
.end, .flags } data structure. To keep memblock after boot all is needed is
something like

	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST

I'll take a closer look next week on the entire series, maybe I'm missing
some details.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-07  0:09                         ` Mike Rapoport
@ 2022-05-08 10:00                           ` Kai Huang
  2022-05-09 10:33                             ` Mike Rapoport
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-08 10:00 UTC (permalink / raw)
  To: Mike Rapoport, Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > [ add Mike ]
> > 
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > [..]
> > > 
> > > Hi Dave,
> > > 
> > > Sorry to ping (trying to close this).
> > > 
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> > 
> > The platform will already do this, right? I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
> > 
> > No, special casing in hotplug code paths needed.
> > 
> > > 
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > > 
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > > 
> > > Please let me know if you have any objection?  Thanks!
> > 
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code. Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?
> 
> TBH, I didn't read the entire thread yet, but at the first glance, keeping
> memblock around is much more preferable that adding yet another { .start,
> .end, .flags } data structure. To keep memblock after boot all is needed is
> something like
> 
> 	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
> 
> I'll take a closer look next week on the entire series, maybe I'm missing
> some details.
> 

Hi Mike,

Thanks for feedback.

Perhaps I haven't put a lot details of the new TDX data structures, so let me
point out that the new two data structures 'struct tdx_memblock' and 'struct
tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
which is pretty standalone.  They are not supposed to be some basic
infrastructure that can be widely used by other random kernel components. 

In fact, currently the only operation we need is to allow memblock to register
all memory regions as TDX memory blocks when the memblock is still alive. 
Therefore, in fact, the new data structures can even be completely invisible to
other kernel components.  For instance, TDX code can provide below API w/o
exposing any data structures to other kernel components:

int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);

And we call above API for each memory region in memblock when it is alive.

TDX code internally manages those memory regions via the new data structures
that I mentioned above, so we don't need to keep memblock after boot.  The
advantage of this approach is it is more flexible to support other potential TDX
memory resources (such as CLX memory) in the future.

Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-06 15:57                                   ` Dan Williams
@ 2022-05-09  2:46                                     ` Kai Huang
  2022-05-10 10:25                                       ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-09  2:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

On Fri, 2022-05-06 at 08:57 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > 
> > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > 
> > > > > > Thanks for feedback!
> > > > > > 
> > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > > [ add Mike ]
> > > > > > > 
> > > > > > > 
> > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > > [..]
> > > > > > > > 
> > > > > > > > Hi Dave,
> > > > > > > > 
> > > > > > > > Sorry to ping (trying to close this).
> > > > > > > > 
> > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > > > that no one should just use some random backend to run TD.
> > > > > > > 
> > > > > > > The platform will already do this, right?
> > > > > > > 
> > > > > > 
> > > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > > > 
> > > > > That's the fundamental question I am asking why is "guarantee all
> > > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > > compared to indicating the incompatibility after the fact.
> > > > 
> > > > As I explained, the reason is I don't want to modify page allocator to
> > > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > > and GFP_TDX.
> > > 
> > > Right, TDX details do not belong at that level, but it will work
> > > almost all the time if you do nothing to "guarantee" all TDX capable
> > > pages all the time.
> > 
> > "almost all the time" do you mean?
> > 
> > > 
> > > > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > > > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > > anonymous pages as TD guest memory.
> > > 
> > > Again, TDX capable pages will be the overwhelming default, why are you
> > > worried about cluttering the memory hotplug path for nice corner
> > > cases.
> > 
> > Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> > let me clarify first:
> > 
> > 1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
> > indicate which memory regions *can* be used as TDX memory.  This basically means
> > all RAM during boot for now.
> > 
> > 2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
> > usable memory.  The TDX module needs to be configured which (convertible) memory
> > regions can be used as TDX memory.  Kernel is responsible for choosing the
> > ranges, and configure to the TDX module.  If a convertible memory page is not
> > included into TDMRs, the TDX module will report error when it is assigned to  a
> > TD.
> > 
> > > 
> > > Consider the fact that end users can break the kernel by specifying
> > > invalid memmap= command line options. The memory hotplug code does not
> > > take any steps to add safety in those cases because there are already
> > > too many ways it can go wrong. TDX is just one more corner case where
> > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > platform firmware to make sure everything in the base memory map is
> > > TDX capable, and then all you need is documentation about the failure
> > > mode when extending "System RAM" beyond that baseline.
> > 
> > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > system RAM, a live TD may eventually be killed.
> > 
> > If such case is a corner case that we don't need to guarantee, then even better.
> > And we have an additional reason that those legacy PMEMs don't need to be in
> > TDMRs.  As you suggested,  we can add some documentation to point out.
> > 
> > But the point we want to do some code check and prevent memory hotplug is, as
> > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > memory.  If we don't do any code check in  memory hotplug in this series, then
> > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > system RAM thus break the assumption "all pages in page allocator are TDX
> > memory", which eventually leads to live TDs being killed potentially.
> > 
> > Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> > documentation saying it only works one some platforms and you shouldn't do
> > things on other platforms are not good enough:
> > 
> > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
> 
> Yes, the incompatible cases cannot be ignored, but I disagree that
> they actively need to be prevented. One way to achieve that is to
> explicitly enumerate TDX capable memory and document how mempolicy can
> be used to avoid killing TDs.

Hi Dan,

Thanks for feedback.

Could you elaborate what does "explicitly enumerate TDX capable memory" mean? 
How to enumerate exactly?

And for "document how mempolicy can be used to avoid killing TDs", what
mempolicy (and error reporting you mentioned below) are you referring to?

I skipped to reply your below your two replies as I think they are referring to
the same "enumerate" and "mempolicy" that I am asking above.

> 
> > > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > > TD guest memory to host userspace:
> > > > 
> > > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > > > 
> > > > Also, besides TD guest memory, there are some per-TD control data structures
> > > > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > > page allocator is TDX memory.
> > > 
> > > You don't need that guarantee, just check it after the fact and fail
> > > if that assertion fails. It should almost always be the case that it
> > > succeeds and if it doesn't then something special is happening with
> > > that system and the end user has effectively opt-ed out of TDX
> > > operation.
> > 
> > This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
> > created, while the second may fail.  We should provide a consistent service.
> 
> Yes, there needs to be enumeration and policy knobs to avoid failures,
> hard coded "no memory hotplug" hacks do not seem the right enumeration
> and policy knobs to me.
> 
> > The thing is anyway we need to configure some memory regions to the TDX module.
> > To me there's no reason we don't want to guarantee all pages in page allocator
> > are TDX memory.
> > 
> > > 
> > > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > > > 
> > > > > Why?
> > > > 
> > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > > TDX memory" is broken.  A TD can be killed during runtime.
> > > 
> > > Yes, that is what the end user asked for. If they don't want that to
> > > happen then the policy decision about using kmem needs to be updated
> > > in userspace, not hard code that policy decision towards TDX inside
> > > the kernel.
> > 
> > This is also fine to me.  But please also see above Dave's comment.
> 
> Dave is right, the implementation can not just ignore the conflict. To
> me, enumeration plus error reporting allows for flexibility without
> hard coding policy in the kernel.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-08 10:00                           ` Kai Huang
@ 2022-05-09 10:33                             ` Mike Rapoport
  2022-05-09 23:27                               ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Mike Rapoport @ 2022-05-09 10:33 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dan Williams, Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

On Sun, May 08, 2022 at 10:00:39PM +1200, Kai Huang wrote:
> On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> > On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > > [ add Mike ]
> > > 
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > [..]
> > > > 
> > > > Hi Dave,
> > > > 
> > > > Sorry to ping (trying to close this).
> > > > 
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > > 
> > > The platform will already do this, right? I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> > > 
> > > No, special casing in hotplug code paths needed.
> > > 
> > > > 
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > > 
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > > 
> > > > Please let me know if you have any objection?  Thanks!
> > > 
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code. Mike, correct
> > > me if I am wrong, but adding more is less desirable than just keeping
> > > the memblock around?
> > 
> > TBH, I didn't read the entire thread yet, but at the first glance, keeping
> > memblock around is much more preferable that adding yet another { .start,
> > .end, .flags } data structure. To keep memblock after boot all is needed is
> > something like
> > 
> > 	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
> > 
> > I'll take a closer look next week on the entire series, maybe I'm missing
> > some details.
> > 
> 
> Hi Mike,
> 
> Thanks for feedback.
> 
> Perhaps I haven't put a lot details of the new TDX data structures, so let me
> point out that the new two data structures 'struct tdx_memblock' and 'struct
> tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> which is pretty standalone.  They are not supposed to be some basic
> infrastructure that can be widely used by other random kernel components. 

We already have "pretty standalone" numa_meminfo that originally was used
to setup NUMA memory topology, but now it's used by other code as well.
And e820 tables also contain similar data and they are supposedly should be
used only at boot time, but in reality there are too much callbacks into
e820 way after the system is booted.

So any additional memory representation will only add to the overall
complexity and well have even more "eventually consistent" collections of 
{ .start, .end, .flags } structures.
 
> In fact, currently the only operation we need is to allow memblock to register
> all memory regions as TDX memory blocks when the memblock is still alive. 
> Therefore, in fact, the new data structures can even be completely invisible to
> other kernel components.  For instance, TDX code can provide below API w/o
> exposing any data structures to other kernel components:
> 
> int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
> 
> And we call above API for each memory region in memblock when it is alive.
> 
> TDX code internally manages those memory regions via the new data structures
> that I mentioned above, so we don't need to keep memblock after boot.  The
> advantage of this approach is it is more flexible to support other potential TDX
> memory resources (such as CLX memory) in the future.

Please let keep things simple. If other TDX memory resources will need
different handling it can be implemented then. For now, just enable
ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.
 
> Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
> INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.
> 
> -- 
> Thanks,
> -Kai
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-09 10:33                             ` Mike Rapoport
@ 2022-05-09 23:27                               ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-09 23:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dan Williams, Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata

> > 
> > Hi Mike,
> > 
> > Thanks for feedback.
> > 
> > Perhaps I haven't put a lot details of the new TDX data structures, so let me
> > point out that the new two data structures 'struct tdx_memblock' and 'struct
> > tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> > which is pretty standalone.  They are not supposed to be some basic
> > infrastructure that can be widely used by other random kernel components. 
> 
> We already have "pretty standalone" numa_meminfo that originally was used
> to setup NUMA memory topology, but now it's used by other code as well.
> And e820 tables also contain similar data and they are supposedly should be
> used only at boot time, but in reality there are too much callbacks into
> e820 way after the system is booted.
> 
> So any additional memory representation will only add to the overall
> complexity and well have even more "eventually consistent" collections of 
> { .start, .end, .flags } structures.
>  
> > In fact, currently the only operation we need is to allow memblock to register
> > all memory regions as TDX memory blocks when the memblock is still alive. 
> > Therefore, in fact, the new data structures can even be completely invisible to
> > other kernel components.  For instance, TDX code can provide below API w/o
> > exposing any data structures to other kernel components:
> > 
> > int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
> > 
> > And we call above API for each memory region in memblock when it is alive.
> > 
> > TDX code internally manages those memory regions via the new data structures
> > that I mentioned above, so we don't need to keep memblock after boot.  The
> > advantage of this approach is it is more flexible to support other potential TDX
> > memory resources (such as CLX memory) in the future.
> 
> Please let keep things simple. If other TDX memory resources will need
> different handling it can be implemented then. For now, just enable
> ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.
>  

Looks good to me.  Thanks for the feedback.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 00/21] TDX host kernel support
  2022-05-09  2:46                                     ` Kai Huang
@ 2022-05-10 10:25                                       ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-10 10:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Linux Kernel Mailing List, KVM list,
	Sean Christopherson, Paolo Bonzini, Brown, Len, Luck, Tony,
	Rafael J Wysocki, Reinette Chatre, Peter Zijlstra, Andi Kleen,
	Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Isaku Yamahata,
	Mike Rapoport

> > > 
> > > > 
> > > > Consider the fact that end users can break the kernel by specifying
> > > > invalid memmap= command line options. The memory hotplug code does not
> > > > take any steps to add safety in those cases because there are already
> > > > too many ways it can go wrong. TDX is just one more corner case where
> > > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > > platform firmware to make sure everything in the base memory map is
> > > > TDX capable, and then all you need is documentation about the failure
> > > > mode when extending "System RAM" beyond that baseline.
> > > 
> > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > > system RAM, a live TD may eventually be killed.
> > > 
> > > If such case is a corner case that we don't need to guarantee, then even better.
> > > And we have an additional reason that those legacy PMEMs don't need to be in
> > > TDMRs.  As you suggested,  we can add some documentation to point out.
> > > 
> > > But the point we want to do some code check and prevent memory hotplug is, as
> > > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > > memory.  If we don't do any code check in  memory hotplug in this series, then
> > > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > > system RAM thus break the assumption "all pages in page allocator are TDX
> > > memory", which eventually leads to live TDs being killed potentially.
> > > 
> > > Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> > > documentation saying it only works one some platforms and you shouldn't do
> > > things on other platforms are not good enough:
> > > 
> > > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
> > 
> > Yes, the incompatible cases cannot be ignored, but I disagree that
> > they actively need to be prevented. One way to achieve that is to
> > explicitly enumerate TDX capable memory and document how mempolicy can
> > be used to avoid killing TDs.
> 
> Hi Dan,
> 
> Thanks for feedback.
> 
> Could you elaborate what does "explicitly enumerate TDX capable memory" mean? 
> How to enumerate exactly?
> 
> And for "document how mempolicy can be used to avoid killing TDs", what
> mempolicy (and error reporting you mentioned below) are you referring to?
> 
> I skipped to reply your below your two replies as I think they are referring to
> the same "enumerate" and "mempolicy" that I am asking above.
> 
> 

Hi Dan,

I guess "explicitly enumerate TDX capable memory" means getting the Convertible
Memory Regions (CMR).  And "document how mempolicy can be used to avoid killing
TDs" means we say something like below in the documentation?

	Any non TDX capable memory hot-add will result in non TDX capable pages
	being potentially allocated to a TD, in which case a TD may fail to be
	created or a live TD may be killed at runtime.

And "error reporting" do you mean in memory hot-add code path, we check whether
the new memory resource is TDX capable, if not we print some error similar to
above message in documentation, but still allow the memory hot-add to happen?

Something like below in add_memory_resource()?

	if (platform_has_tdx() && new memory resource NOT in CMRs)
		pr_err("Hot-add non-TDX memory on TDX capable system. TD may
			fail to be created, or a live TD may be killed during
			runtime.\n");

	// allow memory hot-add anyway


I have below concerns of this approach:

1) I think we should provide a consistent service to user, that is, we either to
guarantee that TD won't be failed to be created randomly and a running TD won't
be killed during runtime, or we don't provide any TDX functionality at all.  So
I am not sure only "document how mempolicy can be use to avoid killing TDs" is
good enough.

2) Above code to check whether a new memory resource is in CMRs or not requires
the kernel to get CMRs during kernel boot.  However getting CMRs requires
calling SEAMCALL which requires kernel to support VMXON/VMXOFF.  VMXON/VMXOFF is
currently only handled by KVM.  We'd like to avoid adding VMXON/VMXOFF to core-
kernel now if not mandatory, as eventually we will very likely need to have a
reference-based approach to call VMXON/VMXOFF.  This part is explained in the
cover letter in this series.

Dave suggested for now to keep things simple, we can use "winner take all"
approach:  If TDX is initialized first, don't allow memory hotplug. If memory
hotplug happens first, don't allow TDX to be initialized.

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mfa6b5dcc536d8a7b78522f46ccd1230f84d52ae0

I think this is perhaps more reasonable as we are at least providing some
consistent service to user.  And in this approach we don't need to handle
VMXON/VMXOFF in core-kernel.

Comments?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-04-27  0:06     ` Kai Huang
@ 2022-05-18 16:19       ` Sagi Shahar
  2022-05-18 23:51         ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Sagi Shahar @ 2022-05-18 16:19 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, linux-kernel, kvm, Sean Christopherson,
	Paolo Bonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, Yamahata, Isaku

On Tue, Apr 26, 2022 at 5:06 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> > On 4/5/22 21:49, Kai Huang wrote:
> > > TDX supports shutting down the TDX module at any time during its
> > > lifetime.  After TDX module is shut down, no further SEAMCALL can be
> > > made on any logical cpu.
> >
> > Is this strictly true?
> >
> > I thought SEAMCALLs were used for the P-SEAMLDR too.
>
> Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.
>
> [...]
>
> > >
> > > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > > +struct seamcall_ctx {
> > > +   u64 fn;
> > > +   u64 rcx;
> > > +   u64 rdx;
> > > +   u64 r8;
> > > +   u64 r9;
> > > +   atomic_t err;
> > > +   u64 seamcall_ret;
> > > +   struct tdx_module_output out;
> > > +};
> > > +
> > > +static void seamcall_smp_call_function(void *data)
> > > +{
> > > +   struct seamcall_ctx *sc = data;
> > > +   int ret;
> > > +
> > > +   ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > > +                   &sc->seamcall_ret, &sc->out);

Are the seamcall_ret and out fields in seamcall_ctx going to be used?
Right now it looks like no one is going to read them.
If they are going to be used then this is going to cause a race since
the different CPUs are going to write concurrently to the same address
inside seamcall().
We should either use local memory and write using atomic_set like the
case for the err field or hard code NULL at the call site if they are
not going to be used.

> > > +   if (ret)
> > > +           atomic_set(&sc->err, ret);
> > > +}
> > > +
> > > +/*
> > > + * Call the SEAMCALL on all online cpus concurrently.
> > > + * Return error if SEAMCALL fails on any cpu.
> > > + */
> > > +static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
> > > +{
> > > +   on_each_cpu(seamcall_smp_call_function, sc, true);
> > > +   return atomic_read(&sc->err);
> > > +}
> >
> > Why bother returning something that's not read?
>
> It's not needed.  I'll make it void.
>
> Caller can check seamcall_ctx::err directly if they want to know whether any
> error happened.
>
>
>
> --
> Thanks,
> -Kai
>
>

Sagi

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-28  0:15     ` Kai Huang
  2022-04-28 14:06       ` Dave Hansen
@ 2022-05-18 22:30       ` Sagi Shahar
  2022-05-18 23:56         ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Sagi Shahar @ 2022-05-18 22:30 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, linux-kernel, kvm, Sean Christopherson,
	Paolo Bonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, Yamahata, Isaku

On Wed, Apr 27, 2022 at 5:15 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Wed, 2022-04-27 at 15:15 -0700, Dave Hansen wrote:
> > On 4/5/22 21:49, Kai Huang wrote:
> > > TDX provides increased levels of memory confidentiality and integrity.
> > > This requires special hardware support for features like memory
> > > encryption and storage of memory integrity checksums.  Not all memory
> > > satisfies these requirements.
> > >
> > > As a result, TDX introduced the concept of a "Convertible Memory Region"
> > > (CMR).  During boot, the firmware builds a list of all of the memory
> > > ranges which can provide the TDX security guarantees.  The list of these
> > > ranges, along with TDX module information, is available to the kernel by
> > > querying the TDX module via TDH.SYS.INFO SEAMCALL.
> > >
> > > Host kernel can choose whether or not to use all convertible memory
> > > regions as TDX memory.  Before TDX module is ready to create any TD
> > > guests, all TDX memory regions that host kernel intends to use must be
> > > configured to the TDX module, using specific data structures defined by
> > > TDX architecture.  Constructing those structures requires information of
> > > both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
> > > to get this information as preparation to construct those structures.
> > >
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > ---
> > >  arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
> > >  arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
> > >  2 files changed, 192 insertions(+)
> > >
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index ef2718423f0f..482e6d858181 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
> > >
> > >  static struct p_seamldr_info p_seamldr_info;
> > >
> > > +/* Base address of CMR array needs to be 512 bytes aligned. */
> > > +static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> > > +static int tdx_cmr_num;
> > > +static struct tdsysinfo_struct tdx_sysinfo;
> >
> > I really dislike mixing hardware and software structures.  Please make
> > it clear which of these are fully software-defined and which are part of
> > the hardware ABI.
>
> Both 'struct tdsysinfo_struct' and 'struct cmr_info' are hardware structures.
> They are defined in tdx.h which has a comment saying the data structures below
> this comment is hardware structures:
>
>         +/*
>         + * TDX architectural data structures
>         + */
>
> It is introduced in the P-SEAMLDR patch.
>
> Should I explicitly add comments around the variables saying they are used by
> hardware, something like:
>
>         /*
>          * Data structures used by TDH.SYS.INFO SEAMCALL to return CMRs and
>          * TDX module system information.
>          */
>
> ?
>
> >
> > >  static bool __seamrr_enabled(void)
> > >  {
> > >     return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
> > > @@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
> > >     return seamcall_on_each_cpu(&sc);
> > >  }
> > >
> > > +static inline bool cmr_valid(struct cmr_info *cmr)
> > > +{
> > > +   return !!cmr->size;
> > > +}
> > > +
> > > +static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
> > > +                  const char *name)
> > > +{
> > > +   int i;
> > > +
> > > +   for (i = 0; i < cmr_num; i++) {
> > > +           struct cmr_info *cmr = &cmr_array[i];
> > > +
> > > +           pr_info("%s : [0x%llx, 0x%llx)\n", name,
> > > +                           cmr->base, cmr->base + cmr->size);
> > > +   }
> > > +}
> > > +
> > > +static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
> > > +{
> > > +   int i, j;
> > > +
> > > +   /*
> > > +    * Intel TDX module spec, 20.7.3 CMR_INFO:
> > > +    *
> > > +    *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
> > > +    *   array of CMR_INFO entries. The CMRs are sorted from the
> > > +    *   lowest base address to the highest base address, and they
> > > +    *   are non-overlapping.
> > > +    *
> > > +    * This implies that BIOS may generate invalid empty entries
> > > +    * if total CMRs are less than 32.  Skip them manually.
> > > +    */
> > > +   for (i = 0; i < cmr_num; i++) {
> > > +           struct cmr_info *cmr = &cmr_array[i];
> > > +           struct cmr_info *prev_cmr = NULL;
> > > +
> > > +           /* Skip further invalid CMRs */
> > > +           if (!cmr_valid(cmr))
> > > +                   break;
> > > +
> > > +           if (i > 0)
> > > +                   prev_cmr = &cmr_array[i - 1];
> > > +
> > > +           /*
> > > +            * It is a TDX firmware bug if CMRs are not
> > > +            * in address ascending order.
> > > +            */
> > > +           if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
> > > +                                   cmr->base)) {
> > > +                   pr_err("Firmware bug: CMRs not in address ascending order.\n");
> > > +                   return -EFAULT;
> >
> > -EFAULT is a really weird return code to use for this.  I'd use -EINVAL.
>
> OK thanks.
>
> >
> > > +           }
> > > +   }
> > > +
> > > +   /*
> > > +    * Also a sane BIOS should never generate invalid CMR(s) between
> > > +    * two valid CMRs.  Sanity check this and simply return error in
> > > +    * this case.
> > > +    *
> > > +    * By reaching here @i is the index of the first invalid CMR (or
> > > +    * cmr_num).  Starting with next entry of @i since it has already
> > > +    * been checked.
> > > +    */
> > > +   for (j = i + 1; j < cmr_num; j++)
> > > +           if (cmr_valid(&cmr_array[j])) {
> > > +                   pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
> > > +                   return -EFAULT;
> > > +           }
> >
> > Please add brackets for the for().
>
> OK.
>
> >
> > > +   /*
> > > +    * Trim all tail invalid empty CMRs.  BIOS should generate at
> > > +    * least one valid CMR, otherwise it's a TDX firmware bug.
> > > +    */
> > > +   tdx_cmr_num = i;
> > > +   if (!tdx_cmr_num) {
> > > +           pr_err("Firmware bug: No valid CMR.\n");
> > > +           return -EFAULT;
> > > +   }
> > > +
> > > +   /* Print kernel sanitized CMRs */
> > > +   print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
> > > +
> > > +   return 0;
> > > +}
> > > +
> > > +static int tdx_get_sysinfo(void)
> > > +{
> > > +   struct tdx_module_output out;
> > > +   u64 tdsysinfo_sz, cmr_num;
> > > +   int ret;
> > > +
> > > +   BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
> > > +
> > > +   ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
> > > +                   __pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
> > > +   if (ret)
> > > +           return ret;
> > > +
> > > +   /*
> > > +    * If TDH.SYS.CONFIG succeeds, RDX contains the actual bytes
> > > +    * written to @tdx_sysinfo and R9 contains the actual entries
> > > +    * written to @tdx_cmr_array.  Sanity check them.
> > > +    */
> > > +   tdsysinfo_sz = out.rdx;
> > > +   cmr_num = out.r9;
> >
> > Please vertically align things like this:
> >
> >       tdsysinfo_sz = out.rdx;
> >       cmr_num      = out.r9;
>
> OK.
>
> >
> > > +   if (WARN_ON_ONCE((tdsysinfo_sz > sizeof(tdx_sysinfo)) || !tdsysinfo_sz ||
> > > +                           (cmr_num > MAX_CMRS) || !cmr_num))
> > > +           return -EFAULT;
> >
> > Sanity checking is good, but this makes me wonder how much is too much.
> >  I don't see a lot of code for instance checking if sys_write() writes
> > more than how much it was supposed to.
> >
> > Why are these sanity checks necessary here?  Is the TDX module expected
> > to be *THAT* buggy?  The thing that's providing, oh, basically all of
> > the security guarantees of this architecture.  It's overflowing the
> > buffers you hand it?
>
> I think this check can be removed.  Will remove.
>
> >
> > > +   pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> > > +           tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
> > > +           tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
> > > +           tdx_sysinfo.build_num);
> > > +
> > > +   /* Print BIOS provided CMRs */
> > > +   print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
> > > +

sanitize_cmrs already prints the cmrs in case of success. So for valid
cmrs we are going to print them twice.
Would it be better to only print cmrs here in case sanitize_cmrs fails?

> > > +   return sanitize_cmrs(tdx_cmr_array, cmr_num);
> > > +}
> >
> > Does sanitize_cmrs() sanitize anything?  It looks to me like it *checks*
> > the CMRs.  But, sanitizing is an active operation that writes to the
> > data being sanitized.  This looks read-only to me.  check_cmrs() would
> > be a better name for a passive check.
>
> Sure will change to check_cmrs().
>
> >
> > >  static int init_tdx_module(void)
> > >  {
> > >     int ret;
> > > @@ -482,6 +608,11 @@ static int init_tdx_module(void)
> > >     if (ret)
> > >             goto out;
> > >
> > > +   /* Get TDX module information and CMRs */
> > > +   ret = tdx_get_sysinfo();
> > > +   if (ret)
> > > +           goto out;
> >
> > Couldn't we get rid of that comment if you did something like:
> >
> >       ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
>
> Yes will do.
>
> >
> > and preferably make the variables function-local.
>
> 'tdx_sysinfo' will be used by KVM too.
>
>
>
> --
> Thanks,
> -Kai
>
>

Sagi

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
  2022-05-18 16:19       ` Sagi Shahar
@ 2022-05-18 23:51         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-18 23:51 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: Dave Hansen, linux-kernel, kvm, Sean Christopherson,
	Paolo Bonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, Yamahata, Isaku

On Wed, 2022-05-18 at 09:19 -0700, Sagi Shahar wrote:
> On Tue, Apr 26, 2022 at 5:06 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Tue, 2022-04-26 at 13:59 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > TDX supports shutting down the TDX module at any time during its
> > > > lifetime.  After TDX module is shut down, no further SEAMCALL can be
> > > > made on any logical cpu.
> > > 
> > > Is this strictly true?
> > > 
> > > I thought SEAMCALLs were used for the P-SEAMLDR too.
> > 
> > Sorry will change to no TDX module SEAMCALL can be made on any logical cpu.
> > 
> > [...]
> > 
> > > > 
> > > > +/* Data structure to make SEAMCALL on multiple CPUs concurrently */
> > > > +struct seamcall_ctx {
> > > > +   u64 fn;
> > > > +   u64 rcx;
> > > > +   u64 rdx;
> > > > +   u64 r8;
> > > > +   u64 r9;
> > > > +   atomic_t err;
> > > > +   u64 seamcall_ret;
> > > > +   struct tdx_module_output out;
> > > > +};
> > > > +
> > > > +static void seamcall_smp_call_function(void *data)
> > > > +{
> > > > +   struct seamcall_ctx *sc = data;
> > > > +   int ret;
> > > > +
> > > > +   ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
> > > > +                   &sc->seamcall_ret, &sc->out);
> 
> Are the seamcall_ret and out fields in seamcall_ctx going to be used?
> Right now it looks like no one is going to read them.
> If they are going to be used then this is going to cause a race since
> the different CPUs are going to write concurrently to the same address
> inside seamcall().
> We should either use local memory and write using atomic_set like the
> case for the err field or hard code NULL at the call site if they are
> not going to be used.
> > > > 

Thanks for catching this.  Both 'seamcall_ret' and 'out' are actually not used
in this series, but this needs to be improved for sure.  

I think I can just remove them from the 'seamcall_ctx' for now, since they are
not used at all.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-05-18 22:30       ` Sagi Shahar
@ 2022-05-18 23:56         ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-18 23:56 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: Dave Hansen, linux-kernel, kvm, Sean Christopherson,
	Paolo Bonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, Yamahata, Isaku

On Wed, 2022-05-18 at 15:30 -0700, Sagi Shahar wrote:
> > > 
> > > > +   pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version
> > > > %u, build_date %u, build_num %u",
> > > > +           tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
> > > > +           tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
> > > > +           tdx_sysinfo.build_num);
> > > > +
> > > > +   /* Print BIOS provided CMRs */
> > > > +   print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
> > > > +
> 
> sanitize_cmrs already prints the cmrs in case of success. So for valid
> cmrs we are going to print them twice.
> Would it be better to only print cmrs here in case sanitize_cmrs fails?

The "BIOS-CMR" will always have 32 entries.  It includes all the *empty* CMRs
after the valid CMRs, so the two are different.  But yes it seems there's no
need to always print "BIOS-CMR".


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-04-29 17:47           ` Dave Hansen
  2022-05-02  5:04             ` Kai Huang
@ 2022-05-25  4:47             ` Kai Huang
  2022-05-25  4:57               ` Kai Huang
  1 sibling, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-25  4:47 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Fri, 2022-04-29 at 10:47 -0700, Dave Hansen wrote:
> On 4/28/22 16:14, Kai Huang wrote:
> > On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:15, Kai Huang wrote:
> > > > > Couldn't we get rid of that comment if you did something like:
> > > > > 
> > > > > 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> > > > 
> > > > Yes will do.
> > > > 
> > > > > and preferably make the variables function-local.
> > > > 
> > > > 'tdx_sysinfo' will be used by KVM too.
> > > 
> > > In other words, it's not a part of this series so I can't review whether
> > > this statement is correct or whether there's a better way to hand this
> > > information over to KVM.
> > > 
> > > This (minor) nugget influencing the design also isn't even commented or
> > > addressed in the changelog.
> > 
> > TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
> > in the stack.  I can change to use dynamic allocation at the beginning and free
> > it at the end of the function.  KVM support patches can change it to static
> > variable in the file.
> 
> 2k of stack is big, but it isn't a deal breaker for something that's not
> nested anywhere and that's only called once in a pretty controlled
> setting and not in interrupt context.  I wouldn't cry about it.

Hi Dave,

I got below complaining when I use local variable for TDSYSINFO_STRUCT and CMR
array:

arch/x86/virt/vmx/tdx/tdx.c:383:1: warning: the frame size of 3072 bytes is
larger than 1024 bytes [-Wframe-larger-than=]
  383 | }

So I don't think we can use local variable for them.  I'll still use static
variables to avoid dynamic allocation.  In the commit message, I'll explain they
are too big to put into the stack, and KVM will need to use TDSYSINFO_STRUCT
reported by TDX module anyway.

Let me know if you don't agree?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-05-25  4:47             ` Kai Huang
@ 2022-05-25  4:57               ` Kai Huang
  2022-05-25 16:00                 ` Kai Huang
  0 siblings, 1 reply; 156+ messages in thread
From: Kai Huang @ 2022-05-25  4:57 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-05-25 at 16:47 +1200, Kai Huang wrote:
> On Fri, 2022-04-29 at 10:47 -0700, Dave Hansen wrote:
> > On 4/28/22 16:14, Kai Huang wrote:
> > > On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:15, Kai Huang wrote:
> > > > > > Couldn't we get rid of that comment if you did something like:
> > > > > > 
> > > > > > 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> > > > > 
> > > > > Yes will do.
> > > > > 
> > > > > > and preferably make the variables function-local.
> > > > > 
> > > > > 'tdx_sysinfo' will be used by KVM too.
> > > > 
> > > > In other words, it's not a part of this series so I can't review whether
> > > > this statement is correct or whether there's a better way to hand this
> > > > information over to KVM.
> > > > 
> > > > This (minor) nugget influencing the design also isn't even commented or
> > > > addressed in the changelog.
> > > 
> > > TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
> > > in the stack.  I can change to use dynamic allocation at the beginning and free
> > > it at the end of the function.  KVM support patches can change it to static
> > > variable in the file.
> > 
> > 2k of stack is big, but it isn't a deal breaker for something that's not
> > nested anywhere and that's only called once in a pretty controlled
> > setting and not in interrupt context.  I wouldn't cry about it.
> 
> Hi Dave,
> 
> I got below complaining when I use local variable for TDSYSINFO_STRUCT and CMR
> array:
> 
> arch/x86/virt/vmx/tdx/tdx.c:383:1: warning: the frame size of 3072 bytes is
> larger than 1024 bytes [-Wframe-larger-than=]
>   383 | }
> 
> So I don't think we can use local variable for them.  I'll still use static
> variables to avoid dynamic allocation.  In the commit message, I'll explain they
> are too big to put into the stack, and KVM will need to use TDSYSINFO_STRUCT
> reported by TDX module anyway.
> 
> Let me know if you don't agree?

Btw, CMR array alone can be put into the stack.  It will never be used by KVM,
so I'll put CMR array as local variable, but keep tdx_sysinfo as static
variable.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory
  2022-05-25  4:57               ` Kai Huang
@ 2022-05-25 16:00                 ` Kai Huang
  0 siblings, 0 replies; 156+ messages in thread
From: Kai Huang @ 2022-05-25 16:00 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, kvm
  Cc: seanjc, pbonzini, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, peterz, ak, kirill.shutemov,
	sathyanarayanan.kuppuswamy, isaku.yamahata

On Wed, 2022-05-25 at 16:57 +1200, Kai Huang wrote:
> On Wed, 2022-05-25 at 16:47 +1200, Kai Huang wrote:
> > On Fri, 2022-04-29 at 10:47 -0700, Dave Hansen wrote:
> > > On 4/28/22 16:14, Kai Huang wrote:
> > > > On Thu, 2022-04-28 at 07:06 -0700, Dave Hansen wrote:
> > > > > On 4/27/22 17:15, Kai Huang wrote:
> > > > > > > Couldn't we get rid of that comment if you did something like:
> > > > > > > 
> > > > > > > 	ret = tdx_get_sysinfo(&tdx_cmr_array, &tdx_sysinfo);
> > > > > > 
> > > > > > Yes will do.
> > > > > > 
> > > > > > > and preferably make the variables function-local.
> > > > > > 
> > > > > > 'tdx_sysinfo' will be used by KVM too.
> > > > > 
> > > > > In other words, it's not a part of this series so I can't review whether
> > > > > this statement is correct or whether there's a better way to hand this
> > > > > information over to KVM.
> > > > > 
> > > > > This (minor) nugget influencing the design also isn't even commented or
> > > > > addressed in the changelog.
> > > > 
> > > > TDSYSINFO_STRUCT is 1024B and CMR array is 512B, so I don't think it should be
> > > > in the stack.  I can change to use dynamic allocation at the beginning and free
> > > > it at the end of the function.  KVM support patches can change it to static
> > > > variable in the file.
> > > 
> > > 2k of stack is big, but it isn't a deal breaker for something that's not
> > > nested anywhere and that's only called once in a pretty controlled
> > > setting and not in interrupt context.  I wouldn't cry about it.
> > 
> > Hi Dave,
> > 
> > I got below complaining when I use local variable for TDSYSINFO_STRUCT and CMR
> > array:
> > 
> > arch/x86/virt/vmx/tdx/tdx.c:383:1: warning: the frame size of 3072 bytes is
> > larger than 1024 bytes [-Wframe-larger-than=]
> >   383 | }
> > 
> > So I don't think we can use local variable for them.  I'll still use static
> > variables to avoid dynamic allocation.  In the commit message, I'll explain they
> > are too big to put into the stack, and KVM will need to use TDSYSINFO_STRUCT
> > reported by TDX module anyway.
> > 
> > Let me know if you don't agree?
> 
> Btw, CMR array alone can be put into the stack.  It will never be used by KVM,
> so I'll put CMR array as local variable, but keep tdx_sysinfo as static
> variable.
> 

Sorry for multiple emails about this.  If I put CMR array to the stack, I still
sometimes get the build warning.  So will use static variables.

Also, constructing TDMRs internally needs to use tdx_sysinfo (max_tdmrs,
pamt_entry_size, max_rsvd_per_tdmr), so with static variable they don't need to
be passed around as function arguments.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 156+ messages in thread

end of thread, other threads:[~2022-05-25 16:03 UTC | newest]

Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
2022-04-18 22:50     ` Sean Christopherson
2022-04-19  3:38     ` Kai Huang
2022-04-26 20:21   ` Dave Hansen
2022-04-26 23:12     ` Kai Huang
2022-04-26 23:28       ` Dave Hansen
2022-04-26 23:49         ` Kai Huang
2022-04-27  0:22           ` Sean Christopherson
2022-04-27  0:44             ` Kai Huang
2022-04-27 14:22           ` Dave Hansen
2022-04-27 22:39             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
2022-04-19  9:41     ` Kai Huang
2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
2022-04-19 10:07     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
2022-04-20  4:16     ` Kai Huang
2022-04-20  7:29       ` Sathyanarayanan Kuppuswamy
2022-04-20 10:39         ` Kai Huang
2022-04-26 20:37   ` Dave Hansen
2022-04-26 23:29     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
2022-04-20  4:37     ` Kai Huang
2022-04-20  5:21       ` Dave Hansen
2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
2022-04-20 22:35         ` Kai Huang
2022-04-26 20:53   ` Dave Hansen
2022-04-27  0:43     ` Kai Huang
2022-04-27 14:49       ` Dave Hansen
2022-04-28  0:00         ` Kai Huang
2022-04-28 14:27           ` Dave Hansen
2022-04-28 23:44             ` Kai Huang
2022-04-28 23:53               ` Dave Hansen
2022-04-29  0:11                 ` Kai Huang
2022-04-29  0:26                   ` Dave Hansen
2022-04-29  0:59                     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module Kai Huang
2022-04-26 20:56   ` Dave Hansen
2022-04-27  0:01     ` Kai Huang
2022-04-27 14:24       ` Dave Hansen
2022-04-27 21:30         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
2022-04-25 23:41     ` Kai Huang
2022-04-26  1:48       ` Sathyanarayanan Kuppuswamy
2022-04-26  2:12         ` Kai Huang
2022-04-26 20:59   ` Dave Hansen
2022-04-27  0:06     ` Kai Huang
2022-05-18 16:19       ` Sagi Shahar
2022-05-18 23:51         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization Kai Huang
2022-04-20 22:27   ` Sathyanarayanan Kuppuswamy
2022-04-20 22:37     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-04-24  1:27   ` Sathyanarayanan Kuppuswamy
2022-04-25 23:55     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
2022-04-26  0:05     ` Kai Huang
2022-04-27 22:15   ` Dave Hansen
2022-04-28  0:15     ` Kai Huang
2022-04-28 14:06       ` Dave Hansen
2022-04-28 23:14         ` Kai Huang
2022-04-29 17:47           ` Dave Hansen
2022-05-02  5:04             ` Kai Huang
2022-05-25  4:47             ` Kai Huang
2022-05-25  4:57               ` Kai Huang
2022-05-25 16:00                 ` Kai Huang
2022-05-18 22:30       ` Sagi Shahar
2022-05-18 23:56         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
2022-04-20 20:48   ` Isaku Yamahata
2022-04-20 22:38     ` Kai Huang
2022-04-27 22:24   ` Dave Hansen
2022-04-28  0:53     ` Kai Huang
2022-04-28  1:07       ` Dave Hansen
2022-04-28  1:35         ` Kai Huang
2022-04-28  3:40           ` Dave Hansen
2022-04-28  3:55             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
2022-04-20 20:55   ` Isaku Yamahata
2022-04-20 22:39     ` Kai Huang
2022-04-28 15:54   ` Dave Hansen
2022-04-29  7:32     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM Kai Huang
2022-04-28 16:22   ` Dave Hansen
2022-04-29  7:24     ` Kai Huang
2022-04-29 13:52       ` Dave Hansen
2022-04-06  4:49 ` [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-04-28 17:12   ` Dave Hansen
2022-04-29  7:46     ` Kai Huang
2022-04-29 14:20       ` Dave Hansen
2022-04-29 14:30         ` Sean Christopherson
2022-04-29 17:46           ` Dave Hansen
2022-04-29 18:19             ` Sean Christopherson
2022-04-29 18:32               ` Dave Hansen
2022-05-02  5:59         ` Kai Huang
2022-05-02 14:17           ` Dave Hansen
2022-05-02 21:55             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-04-06  4:49 ` [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-04-06  4:49 ` [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-04-06  4:49 ` [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-04-06  4:49 ` [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-04-06  4:49 ` [PATCH v3 19/21] x86: Flush cache of TDX private memory during kexec() Kai Huang
2022-04-06  4:49 ` [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support Kai Huang
2022-04-28 17:25   ` Dave Hansen
2022-04-06  4:49 ` [PATCH v3 21/21] Documentation/x86: Add documentation for " Kai Huang
2022-04-14 10:19 ` [PATCH v3 00/21] TDX host kernel support Kai Huang
2022-04-26 20:13 ` Dave Hansen
2022-04-27  1:15   ` Kai Huang
2022-04-27 21:59     ` Dave Hansen
2022-04-28  0:37       ` Kai Huang
2022-04-28  0:50         ` Dave Hansen
2022-04-28  0:58           ` Kai Huang
2022-04-29  1:40             ` Kai Huang
2022-04-29  3:04               ` Dan Williams
2022-04-29  5:35                 ` Kai Huang
2022-05-03 23:59               ` Kai Huang
2022-05-04  0:25                 ` Dave Hansen
2022-05-04  1:15                   ` Kai Huang
2022-05-05  9:54                     ` Kai Huang
2022-05-05 13:51                       ` Dan Williams
2022-05-05 22:14                         ` Kai Huang
2022-05-06  0:22                           ` Dan Williams
2022-05-06  0:45                             ` Kai Huang
2022-05-06  1:15                               ` Dan Williams
2022-05-06  1:46                                 ` Kai Huang
2022-05-06 15:57                                   ` Dan Williams
2022-05-09  2:46                                     ` Kai Huang
2022-05-10 10:25                                       ` Kai Huang
2022-05-07  0:09                         ` Mike Rapoport
2022-05-08 10:00                           ` Kai Huang
2022-05-09 10:33                             ` Mike Rapoport
2022-05-09 23:27                               ` Kai Huang
2022-05-04 14:31                 ` Dan Williams
2022-05-04 22:50                   ` Kai Huang
2022-04-28  1:01   ` Dan Williams
2022-04-28  1:21     ` Kai Huang
2022-04-29  2:58       ` Dan Williams
2022-04-29  5:43         ` Kai Huang
2022-04-29 14:39         ` Dave Hansen
2022-04-29 15:18           ` Dan Williams
2022-04-29 17:18             ` Dave Hansen
2022-04-29 17:48               ` Dan Williams
2022-04-29 18:34                 ` Dave Hansen
2022-04-29 18:47                   ` Dan Williams
2022-04-29 19:20                     ` Dave Hansen
2022-04-29 21:20                       ` Dan Williams
2022-04-29 21:27                         ` Dave Hansen
2022-05-02 10:18                   ` Kai Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).