linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/16] TDX host kernel support
@ 2022-12-09  6:52 Kai Huang
  2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
                   ` (15 more replies)
  0 siblings, 16 replies; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests.  KVM support for TDX is being
developed separately[2].  A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3].  The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

This series doesn't aim to support all functionalities, and doesn't aim
to resolve all things perfectly.  For example, memory hotplug is handled
in simple way (please refer to "Kernel policy on TDX memory" and "Memory
hotplug" sections below).  Huang, Ying is working on a series to improve
and will send out separately.

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
help to provide comments.)

And TDX module metadata allocation just uses alloc_contig_pages() to
allocate large chunk at runtime, thus it can fail.  It is imperfect now
but _will_ be improved in the future.

Also, the patch to add the new kernel comline tdx="force" isn't included
in this initial version, as Dave suggested it isn't mandatory.  But I
_will_ add one once this initial version gets merged.

All other optimizations will be posted as follow-up once this initial
TDX support is upstreamed.

Hi Dave, Peter, Thomas, Dan (and Intel reviewers),

From v7 -> v8, there's a big change that we are pushing to remove
TDH.SYS.INIT and TDH.SYS.LP.INIT (per-LP initialization) out of kernel.

I'm assuming that the TDX spec and module will be changed to remove the
TDH.SYS.INIT and TDH.SYS.LP.INIT SEAMCALLs.  As a result, I've removed
those patches from this series.

But, the current TDX module that I'm testing with still requires those
SEAMCALLs.  So, I'm applying them at the end and testing with them in
place.

I would appreciate if folks could review this presumptive series anyway.
   
And I would appreciate reviewed-by or acked-by tags if the patches look
good to you.

----- Changelog history: ------

- v7 -> v8:

 - 200+ LOC removed (from 1800+ -> 1600+).
 - Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT
   (Dave/Peter/Thomas).
 - Removed patch to shut down TDX module (Sean).
 - For memory hotplug, changed to reject non-TDX memory from
   arch_add_memory() to memory_notifier (Dan/David).
 - Simplified the "skeletion patch" as a result of removing
   TDH.SYS.LP.INIT patch.
 - Refined changelog/comments for most of the patches (to tell better
   story, remove silly comments, etc) (Dave).
 - Added new 'struct tdmr_info_list' struct, and changed all TDMR related
   patches to use it (Dave).
 - Effectively merged patch "Reserve TDX module global KeyID" and
   "Configure TDX module with TDMRs and global KeyID", and removed the
   static variable 'tdx_global_keyid', following Dave's suggestion on
   making tdx_sysinfo local variable.
 - For detailed changes please see individual patch changelog history.

 v7: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v6 -> v7:
  - Added memory hotplug support.
  - Changed how to choose the list of "TDX-usable" memory regions from at
    kernel boot time to TDX module initialization time.
  - Addressed comments received in previous versions. (Andi/Dave).
  - Improved the commit message and the comments of kexec() support patch,
    and the patch handles returnning PAMTs back to the kernel when TDX
    module initialization fails. Please also see "kexec()" section below.
  - Changed the documentation patch accordingly.
  - For all others please see individual patch changelog history.

 v6: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v5 -> v6:

  - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
  - Removed patch to disable driver-managed memory hotplug (Intel
    internal discussion).
  - Added one patch to introduce enum type for TDX supported page size
    level to replace the hard-coded values in TDX guest code (Dave).
  - Added one patch to make TDX depends on X2APIC being enabled (Dave).
  - Added one patch to build all boot-time present memory regions as TDX
    memory during kernel boot.
  - Added Reviewed-by from others to some patches.
  - For all others please see individual patch changelog history.

 v5: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v4 -> v5:

  This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

  There are also very minor code and commit message update from v4:

  - Rebased to latest tip/x86/tdx.
  - Fixed a checkpatch issue that I missed in v4.
  - Removed an obsoleted comment that I missed in patch 6.
  - Very minor update to the commit message of patch 12.

  For other changes to individual patches since v3, please refer to the
  changelog histroy of individual patches (I just used v3 -> v5 since
  there's basically no code change to v4).

 v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/

- v3 -> v4 (addressed Dave's comments, and other comments from others):

 - Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

 v3: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- V2 -> v3:

 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

 v2: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

 v1: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized.  This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

This series adds a function tdx_enable() to allow the caller to initialize
TDX at runtime:

        if (tdx_enable())
                goto no_tdx;
	// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done).  So far, KVM is the only user of TDX, and it already handles VMXON.
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.

2. CPU hotplug

TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel.  This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator.  This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime.  No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

This _will_ be enhanced in the future after first submission.

A better solution, suggested by Kirill, is similar to the per-node memory
encryption flag in this series [4].  We can allow adding/onlining non-TDX
memory to separate NUMA nodes so that both "TDX-capable" nodes and
"TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
nodes via NUMA ABIs.

5. Physical Memory Hotplug

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

6. Kexec()

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Just like SME, TDX hosts require special cache flushing before kexec().
Similar to SME handling, the kernel uses wbinvd() to flush cache in
stop_this_cpu() when TDX is enabled.

This series doesn't convert all TDX private pages back to normal due to
below considerations:

1) Neither the TDX module nor the kernel has existing infrastructure to
   track which pages are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
   them (cache flush + using MOVDIR64B to clear the page) in kexec() can
   be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
   0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
   kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
   page with the new MKTME KeyID (just like TDX does) before using it.



Kai Huang (16):
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Detect TDX during kernel boot
  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  x86/virt/tdx: Add skeleton to initialize TDX on demand
  x86/virt/tdx: Implement functions to make SEAMCALL
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Designate reserved areas for all TDMRs
  x86/virt/tdx: Designate the global KeyID and configure the TDX module
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  Documentation/x86: Add documentation for TDX host support

 Documentation/x86/tdx.rst        |  169 +++-
 arch/x86/Kconfig                 |   15 +
 arch/x86/Makefile                |    2 +
 arch/x86/coco/tdx/tdx.c          |    6 +-
 arch/x86/include/asm/tdx.h       |   23 +
 arch/x86/kernel/process.c        |    8 +-
 arch/x86/kernel/setup.c          |    2 +
 arch/x86/virt/Makefile           |    2 +
 arch/x86/virt/vmx/Makefile       |    2 +
 arch/x86/virt/vmx/tdx/Makefile   |    3 +
 arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
 arch/x86/virt/vmx/tdx/tdx.c      | 1229 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  128 ++++
 arch/x86/virt/vmx/tdx/tdxcall.S  |   19 +-
 14 files changed, 1643 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

-- 
2.38.1


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 19:04   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
defined by the TDX module spec and used as TDX module ABI.  Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page.  However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Improved the comment of TDX supported page sizes macros (Dave)

v6 -> v7:
 - Removed the helper to convert kernel page level to TDX page level.
 - Changed to use macro to define TDX supported page sizes.

---
 arch/x86/coco/tdx/tdx.c    | 6 +++---
 arch/x86/include/asm/tdx.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cfd4c95b9f04..7fa7fb54f438 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -722,13 +722,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	 */
 	switch (pg_level) {
 	case PG_LEVEL_4K:
-		page_size = 0;
+		page_size = TDX_PS_4K;
 		break;
 	case PG_LEVEL_2M:
-		page_size = 1;
+		page_size = TDX_PS_2M;
 		break;
 	case PG_LEVEL_1G:
-		page_size = 2;
+		page_size = TDX_PS_1G;
 		break;
 	default:
 		return false;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..25fd6070dc0b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,11 @@
 
 #ifndef __ASSEMBLY__
 
+/* TDX supported page sizes from the TDX module ABI. */
+#define TDX_PS_4K	0
+#define TDX_PS_2M	1
+#define TDX_PS_1G	2
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
  2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 17:09   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
private KeyIDs are consistently and correctly programmed by the BIOS
across all CPU packages before it enables TDX on any CPU core.  A valid
TDX private KeyID range on BSP indicates TDX has been enabled by the
BIOS, otherwise the BIOS is buggy.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests.  The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.

Add a new early_initcall(tdx_init) to detect the TDX private KeyIDs.
Both TDX module initialization and creating TDX guest require to use TDX
private KeyID.  Also add a function to report whether TDX is enabled by
the BIOS (TDX KeyID range is valid).  Similar to AMD SME, kexec() will
use it to determine whether cache flush is needed.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM uses TDX.  Make the new config option depend
on KVM_INTEL.

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (address Dave's comments)
 - Improved changelog:
    - "KVM user" -> "The TDX module will be initialized by KVM when ..."
    - Changed "tdx_int" part to "Just say what this patch is doing"
    - Fixed the last sentence of "kexec()" paragraph
  - detect_tdx() -> record_keyid_partitioning()
  - Improved how to calculate tdx_keyid_start.
  - tdx_keyid_num -> nr_tdx_keyids.
  - Improved dmesg printing.
  - Add comment to clear_tdx().

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed SEAMRR detection to make code simpler.
 - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
 - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).

---
 arch/x86/Kconfig               | 12 +++++
 arch/x86/Makefile              |  2 +
 arch/x86/include/asm/tdx.h     |  7 +++
 arch/x86/virt/Makefile         |  2 +
 arch/x86/virt/vmx/Makefile     |  2 +
 arch/x86/virt/vmx/tdx/Makefile |  2 +
 arch/x86/virt/vmx/tdx/tdx.c    | 90 ++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    | 15 ++++++
 8 files changed, 132 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..cced4ef3bfb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1953,6 +1953,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	depends on KVM_INTEL
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 415a5d138de4..38d3e8addc5f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -246,6 +246,8 @@ archheaders:
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 25fd6070dc0b..4dfe2e794411 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else	/* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..292852773ced
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/printk.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+#include "tdx.h"
+
+static u32 tdx_keyid_start __ro_after_init;
+static u32 nr_tdx_keyids __ro_after_init;
+
+/*
+ * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
+ * This is used in TDX initialization error paths to take it from
+ * initialized -> uninitialized.
+ */
+static void __init clear_tdx(void)
+{
+	tdx_keyid_start = nr_tdx_keyids = 0;
+}
+
+static int __init record_keyid_partitioning(void)
+{
+	u32 nr_mktme_keyids;
+	int ret;
+
+	/*
+	 * IA32_MKTME_KEYID_PARTIONING:
+	 *   Bit [31:0]:	Number of MKTME KeyIDs.
+	 *   Bit [63:32]:	Number of TDX private KeyIDs.
+	 */
+	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &nr_mktme_keyids,
+			&nr_tdx_keyids);
+	if (ret)
+		return -ENODEV;
+
+	if (!nr_tdx_keyids)
+		return -ENODEV;
+
+	/* TDX KeyIDs start after the last MKTME KeyID. */
+	tdx_keyid_start = nr_mktme_keyids + 1;
+
+	pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
+			tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+	return 0;
+}
+
+static int __init tdx_init(void)
+{
+	int err;
+
+	err = record_keyid_partitioning();
+	if (err)
+		return err;
+
+	/*
+	 * Initializing the TDX module requires one TDX private KeyID.
+	 * If there's only one TDX KeyID then after module initialization
+	 * KVM won't be able to run any TDX guest, which makes the whole
+	 * thing worthless.  Just disable TDX in this case.
+	 */
+	if (nr_tdx_keyids < 2) {
+		pr_info("initialization failed: too few private KeyIDs available (%d).\n",
+				nr_tdx_keyids);
+		goto no_tdx;
+	}
+
+	return 0;
+no_tdx:
+	clear_tdx();
+	return -ENODEV;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+	return !!nr_tdx_keyids;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..d00074abcb20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/* MSR to report KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
+#endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
  2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
  2022-12-09  6:52 ` [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 19:04   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

TDX capable platforms are locked to X2APIC mode and cannot fall back to
the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.

Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Only make INTEL_TDX_HOST depend on X86_X2APIC but removed other code
 - Rewrote the changelog.

v6 -> v7:
 - Changed to use "Link" for the two lore links to get rid of checkpatch
   warning.

---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cced4ef3bfb2..dd333b46fafb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	depends on KVM_INTEL
+	depends on X86_X2APIC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (2 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 17:14   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Before the TDX module can be used to create and run TDX guests, it must
be loaded and properly initialized.  The TDX module is expected to be
loaded by the BIOS, and to be initialized by the kernel.

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  The host
kernel communicates with the TDX module via a new SEAMCALL instruction.
The TDX module implements a set of SEAMCALL leaf functions to allow the
host kernel to initialize it.

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" loading mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Loading the TDX module requires VMX to be enabled.  Currently, only
the kernel KVM code mucks with VMX enabling.  If the TDX module were to
be initialized separately from KVM (like at boot), the boot code would
need to be taught how to muck with VMX enabling and KVM would need to be
taught how to cope with that.  Making KVM itself responsible for TDX
initialization lets the rest of the kernel stay blissfully unaware of
VMX.

Add a placeholder tdx_enable() to initialize the TDX module on demand.
Use a state machine protected by mutex to make sure the initialization
will only be done once, as it can be called multiple times (i.e. KVM
module can be reloaded) and be called concurrently by other kernel
components in the future.

The TDX module will be initialized in multi-steps defined by the TDX
module and most of those steps involve a specific SEAMCALL:

 1) Get the TDX module information and TDX-capable memory regions
    (TDH.SYS.INFO).
 2) Build the list of TDX-usable memory regions.
 3) Construct a list of "TD Memory Regions" (TDMRs) to cover all
    TDX-usable memory regions.
 4) Pick up one TDX private KeyID as the global KeyID.
 5) Configure the TDMRs and the global KeyID to the TDX module
    (TDH.SYS.CONFIG).
 6) Configure the global KeyID on all packages (TDH.SYS.KEY.CONFIG).
 7) Initialize all TDMRs (TDH.SYS.TDMR.INIT).

Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.

---
 arch/x86/include/asm/tdx.h  |  2 +
 arch/x86/virt/vmx/tdx/tdx.c | 93 +++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4dfe2e794411..4a3ee64c1ca7 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -97,8 +97,10 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_enable(void)  { return -EINVAL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 292852773ced..ace9770e5e08 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -12,13 +12,25 @@
 #include <linux/init.h>
 #include <linux/errno.h>
 #include <linux/printk.h>
+#include <linux/mutex.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNKNOWN,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 nr_tdx_keyids __ro_after_init;
 
+static enum tdx_module_status_t tdx_module_status;
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -88,3 +100,84 @@ bool platform_tdx_enabled(void)
 {
 	return !!nr_tdx_keyids;
 }
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of TDMRs to cover all TDX-usable memory
+	 *    regions.
+	 *  - Pick up one TDX private KeyID as the global KeyID.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err_once("initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		/*
+		 * Just return one universal error code.
+		 * For now the caller cannot recover anyway.
+		 */
+		return -EINVAL;
+	}
+
+	pr_info_once("TDX module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX by initializing the TDX module
+ *
+ * The caller must make sure all online cpus are in VMX operation before
+ * calling this function.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled()) {
+		pr_err_once("initialization failed: TDX is disabled.\n");
+		return -EINVAL;
+	}
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (3 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 17:29   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.  The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.

Add infrastructure to make SEAMCALLs.  The SEAMCALL ABI is very similar
to the TDCALL ABI and leverages much TDCALL infrastructure.

SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
when CPU is not in VMX operation.  The current TDX_MODULE_CALL macro
doesn't handle any of them.  There's no way to check whether the CPU is
in VMX operation or not.

Initializing the TDX module is done at runtime on demand, and it depends
on the caller to ensure CPU is in VMX operation before making SEAMCALL.
To avoid getting Oops when the caller mistakenly tries to initialize the
TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL
macro to handle #UD (and opportunistically #GP since they share the same
assembly).

Introduce two new TDX error codes for #UD and #GP respectively so the
caller can distinguish.  Also, Opportunistically put the new TDX error
codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST
Kconfig option as they are only used when it is on.

Any failure during the module initialization is not recoverable for now.
Print out error message when SEAMCALL failed depending on the error code
to help the user to understand what went wrong.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Improved changelog (Dave):
   - Trim down some sentences (Dave).
   - Removed __seamcall() and seamcall() function name and changed
     accordingly (Dave).
   - Improved the sentence explaining why to handle #GP (Dave).
 - Added code to print out error message in seamcall(), following
   the idea that tdx_enable() to return universal error and print out
   error message to make clear what's going wrong (Dave).  Also mention
   this in changelog.

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to handle #UD and #GP (Dave).
 - Moved the seamcall() wrapper function to this patch, and used a
   temporary __always_unused to avoid compile warning (Dave).

- v3 -> v5 (no feedback on v4):
 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.

---
 arch/x86/include/asm/tdx.h       |  9 ++++++
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c      | 49 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 10 ++++++
 arch/x86/virt/vmx/tdx/tdxcall.S  | 19 ++++++++++--
 6 files changed, 138 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4a3ee64c1ca7..5c5ecfddb15b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,10 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
+#ifdef CONFIG_INTEL_TDX_HOST
+
+#include <asm/trapnr.h>
+
 /*
  * SW-defined error codes.
  *
@@ -18,6 +22,11 @@
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
+#define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
+#define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
+
+#endif
+
 #ifndef __ASSEMBLY__
 
 /* TDX supported page sizes from the TDX module ABI. */
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 93ca8b73e1f1..38d534f2c113 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f81be6b9c133
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ *		  (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function.  Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ace9770e5e08..b7cedf0589db 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -101,6 +101,55 @@ bool platform_tdx_enabled(void)
 	return !!nr_tdx_keyids;
 }
 
+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				    u64 *seamcall_ret,
+				    struct tdx_module_output *out)
+{
+	u64 sret;
+
+	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	/* Save SEAMCALL return code if the caller wants it */
+	if (seamcall_ret)
+		*seamcall_ret = sret;
+
+	/* SEAMCALL was successful */
+	if (!sret)
+		return 0;
+
+	switch (sret) {
+	case TDX_SEAMCALL_GP:
+		/*
+		 * tdx_enable() has already checked that BIOS has
+		 * enabled TDX at the very beginning before going
+		 * forward.  It's likely a firmware bug if the
+		 * SEAMCALL still caused #GP.
+		 */
+		pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
+		return -ENODEV;
+	case TDX_SEAMCALL_VMFAILINVALID:
+		pr_err_once("TDX module is not loaded.\n");
+		return -ENODEV;
+	case TDX_SEAMCALL_UD:
+		pr_err_once("CPU is not in VMX operation.\n");
+		return -EINVAL;
+	default:
+		pr_err_once("SEAMCALL failed: leaf %llu, error 0x%llx.\n",
+				fn, sret);
+		if (out)
+			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
+					out->rcx, out->rdx, out->r8,
+					out->r9, out->r10, out->r11);
+		return -EIO;
+	}
+}
+
 static int init_tdx_module(void)
 {
 	/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d00074abcb20..884357a4133c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -2,6 +2,8 @@
 #ifndef _X86_VIRT_TDX_H
 #define _X86_VIRT_TDX_H
 
+#include <linux/types.h>
+
 /*
  * This file contains both macros and data structures defined by the TDX
  * architecture and Linux defined software data structures and functions.
@@ -12,4 +14,12 @@
 /* MSR to report KeyID partitioning between MKTME and TDX */
 #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
 
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
 #endif
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..757b0c34be10 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>
 
 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
 	/* Leave input param 2 in RDX */
 
 	.if \host
+1:
 	seamcall
 	/*
 	 * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,10 +59,23 @@
 	 * This value will never be used as actual SEAMCALL error code as
 	 * it is from the Reserved status code class.
 	 */
-	jnc .Lno_vmfailinvalid
+	jnc .Lseamcall_out
 	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+	jmp .Lseamcall_out
+2:
+	/*
+	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+	 * the trap number.  Convert the trap number to the TDX error
+	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
+	 *
+	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
+	 * only accepts 32-bit immediate at most.
+	 */
+	mov $TDX_SW_ERROR, %r12
+	orq %r12, %rax
 
+	_ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out:
 	.else
 	tdcall
 	.endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (4 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 17:46   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Start to transit out the "multi-steps" of initializing the TDX module as
listed in the skeleton infrastructure.  Do the first step to get the TDX
module information and the TDX-capable memory regions.

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.

CMRs tell the kernel which memory is TDX compatible.  The kernel takes
CMRs (plus a little more metadata) and constructs "TD Memory Regions"
(TDMRs).  TDMRs let the kernel grant TDX protections to some or all of
the CMR areas.

The TDX module information tells the kernel TDX module properties such
as metadata entry size, the maximum number of TDMRs, and the maximum
number of reserved areas per TDMR that the module allows, etc.

The list of CMRs, along with the TDX module information, is available to
the kernel by querying the TDX module.

For now, both the TDX module information and CMRs are only used during
the module initialization, so declare them as local.  However, they are
1024 bytes and 512 bytes respectively.  Putting them to the stack
exceeds the default "stack frame size" that the kernel assumes as safe,
and the compiler yields a warning about this.  Add a kernel build flag
to extend the safe stack size to 4K for tdx.c to silence the warning --
the initialization function is only called once so it's safe to have a
4K stack.

Note not all members in the 1024 bytes TDX module information are used
(even by the KVM).

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Improved changelog to tell this is the first patch to transit out the
   "multi-steps" init_tdx_module().
 - Removed all CMR check/trim code but to depend on later SEAMCALL.
 - Variable 'vertical alignment' in print TDX module information.
 - Added DECLARE_PADDED_STRUCT() for padded structure.
 - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
   (and rename them accordingly), and added -Wframe-larger-than=4096 flag
   to silence the build warning.

v6 -> v7:
 - Simplified the check of CMRs due to the fact that TDX actually
   verifies CMRs (that are passed by the BIOS) before enabling TDX.
 - Changed the function name from check_cmrs() -> trim_empty_cmrs().
 - Added CMR page aligned check so that later patch can just get the PFN
   using ">> PAGE_SHIFT".

v5 -> v6:
 - Added to also print TDX module's attribute (Isaku).
 - Removed all arguments in tdx_gete_sysinfo() to use static variables
   of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
   directly in other functions in later patches.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Renamed sanitize_cmrs() to check_cmrs().
 - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
   actual size returned by TDH.SYS.INFO.
 - Changed -EFAULT to -EINVAL in couple places.
 - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
   used by TDH.SYS.INFO ABI.
 - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
   arguments in tdx_get_sysinfo().
 - Changed to only print BIOS-CMR when check_cmrs() fails.

---
 arch/x86/virt/vmx/tdx/Makefile |  1 +
 arch/x86/virt/vmx/tdx/tdx.c    | 85 ++++++++++++++++++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h    | 76 ++++++++++++++++++++++++++++++
 3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 38d534f2c113..f8a40d15fdfc 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
+CFLAGS_tdx.o += -Wframe-larger-than=4096
 obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b7cedf0589db..6fe505c32599 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/mutex.h>
+#include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
 #include "tdx.h"
@@ -107,9 +108,8 @@ bool platform_tdx_enabled(void)
  * leaf function return code and the additional output respectively if
  * not NULL.
  */
-static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-				    u64 *seamcall_ret,
-				    struct tdx_module_output *out)
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		    u64 *seamcall_ret, struct tdx_module_output *out)
 {
 	u64 sret;
 
@@ -150,12 +150,85 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	}
 }
 
+static inline bool is_cmr_empty(struct cmr_info *cmr)
+{
+	return !cmr->size;
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
+{
+	int i;
+
+	for (i = 0; i < nr_cmrs; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		/*
+		 * The array of CMRs reported via TDH.SYS.INFO can
+		 * contain tail empty CMRs.  Don't print them.
+		 */
+		if (is_cmr_empty(cmr))
+			break;
+
+		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
+				cmr->base + cmr->size);
+	}
+}
+
+/*
+ * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
+ * CMRs, and save them to @sysinfo and @cmr_array, which come from the
+ * kernel stack.  @sysinfo must have been padded to have enough room
+ * to save the TDSYSINFO_STRUCT.
+ */
+static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
+			   struct cmr_info *cmr_array)
+{
+	struct tdx_module_output out;
+	u64 sysinfo_pa, cmr_array_pa;
+	int ret;
+
+	/*
+	 * Cannot use __pa() directly as @sysinfo and @cmr_array
+	 * come from the kernel stack.
+	 */
+	sysinfo_pa = slow_virt_to_phys(sysinfo);
+	cmr_array_pa = slow_virt_to_phys(cmr_array);
+	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
+			cmr_array_pa, MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		sysinfo->attributes,	sysinfo->vendor_id,
+		sysinfo->major_version, sysinfo->minor_version,
+		sysinfo->build_date,	sysinfo->build_num);
+
+	/* R9 contains the actual entries written to the CMR array. */
+	print_cmrs(cmr_array, out.r9);
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
+	/*
+	 * @tdsysinfo and @cmr_array are used in TDH.SYS.INFO SEAMCALL ABI.
+	 * They are 1024 bytes and 512 bytes respectively but it's fine to
+	 * keep them in the stack as this function is only called once.
+	 */
+	DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
+			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
+	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
+	int ret;
+
+	ret = tdx_get_sysinfo(sysinfo, cmr_array);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Get TDX module information and TDX-capable memory regions.
 	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of TDMRs to cover all TDX-usable memory
 	 *    regions.
@@ -166,7 +239,9 @@ static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+out:
+	return ret;
 }
 
 static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 884357a4133c..6d32f62e4182 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,8 @@
 #define _X86_VIRT_TDX_H
 
 #include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/compiler_attributes.h>
 
 /*
  * This file contains both macros and data structures defined by the TDX
@@ -14,6 +16,80 @@
 /* MSR to report KeyID partitioning between MKTME and TDX */
 #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INFO		32
+
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define DECLARE_PADDED_STRUCT(type, name, size, alignment)	\
+	struct type##_padded {					\
+		union {						\
+			struct type name;			\
+			u8 padding[size];			\
+		};						\
+	} name##_padded __aligned(alignment)
+
+#define PADDED_STRUCT(name)	(name##_padded.name)
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+/*
+ * The size of this structure itself is flexible.  The actual structure
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
+ * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
+ */
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.
+	 */
+	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
+} __packed;
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (5 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 18:18   ` Dave Hansen
  2023-01-18 11:08   ` Huang, Kai
  2022-12-09  6:52 ` [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator.  To keep things simple, just make sure
all pages in the page allocator are TDX memory.

To guarantee that, stash off the memblock memory regions at the time of
initializing the TDX module as TDX's own usable memory regions, and in
the meantime, register a TDX memory notifier to reject to online any new
memory in memory hotplug.

This approach works as in practice all boot-time present DIMMs are TDX
convertible memory.  However, if any non-TDX-convertible memory has been
hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
module, the module initialization will fail.

This can also be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).

---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
 3 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dd333b46fafb..b36129183035 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 216fee7144ee..3a841a77fda4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	reserve_real_mode();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6fe505c32599..f010402f443d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,13 @@
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -25,6 +32,12 @@ enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 nr_tdx_keyids __ro_after_init;
 
@@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/* Empty list means TDX isn't enabled. */
+	if (list_empty(&tdx_memlist))
+		return true;
+
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		/*
+		 * The new range is TDX memory if it is fully covered by
+		 * any TDX memory block.
+		 *
+		 * Note TDX memory blocks are originated from memblock
+		 * memory regions, which can only be contiguous when two
+		 * regions have different NUMA nodes or flags.  Therefore
+		 * the new range cannot cross multiple TDX memory blocks.
+		 */
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Not all memory is compatible with TDX.  Reject
+	 * to online any incompatible memory.
+	 */
+	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
+		NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	int err;
@@ -89,6 +149,13 @@ static int __init tdx_init(void)
 		goto no_tdx;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		goto no_tdx;
+	}
+
 	return 0;
 no_tdx:
 	clear_tdx();
@@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -226,10 +364,25 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * The initial support of TDX guests only allocates memory from
+	 * the global page allocator.  To keep things simple, just make
+	 * sure all pages in the page allocator are TDX memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of TDMRs to cover all TDX-usable memory
 	 *    regions.
 	 *  - Pick up one TDX private KeyID as the global KeyID.
@@ -241,6 +394,11 @@ static int init_tdx_module(void)
 	 */
 	ret = -EINVAL;
 out:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
 	return ret;
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (6 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 19:24   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 09/16] x86/virt/tdx: Fill out " Kai Huang
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).

Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.

=== Long Version ===

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be represented.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.

Constructing the list of TDMRs consists below steps:

1) Fill out TDMRs to cover all memory regions that the TDX module will
   use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps.  Always free
the space for TDMRs at the end of the module initialization (no matter
successful or not) as TDMRs are only used during the initialization.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Improved changelog to tell this is one step of "TODO list" in
   init_tdx_module().
 - Other changelog improvement suggested by Dave (with "Create TDMRs" to
   "Fill out TDMRs" to align with the code).
 - Added a "TODO list" comment to lay out the steps to construct TDMRs,
   following the same idea of "TODO list" in tdx_module_init().
 - Introduced 'struct tdmr_info_list' (Dave)
 - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
   simplify getting TDMR by given index, and reduce passing arguments
   around functions.
 - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
   uses tdmr_size_single() (Dave).
 - tdmr_num -> nr_tdmrs (Dave).

v6 -> v7:
 - Improved commit message to explain 'int' overflow cannot happen
   in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

v5 -> v6:
 - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
   used instead of memblock.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Moved calculating TDMR size to this patch.
 - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
   once, instead of allocating each TDMR individually.
 - Removed "crypto protection" in the changelog.
 - -EFAULT -> -EINVAL in couple of places.

---
 arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  23 ++++++++
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f010402f443d..d36ac72ef299 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -20,6 +20,7 @@
 #include <linux/minmax.h>
 #include <linux/sizes.h>
 #include <linux/pfn.h>
+#include <linux/align.h>
 #include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -347,6 +348,86 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 	return ret;
 }
 
+struct tdmr_info_list {
+	struct tdmr_info *first_tdmr;
+	int tdmr_sz;
+	int max_tdmrs;
+	int nr_tdmrs;	/* Actual number of TDMRs */
+};
+
+/* Calculate the actual TDMR size */
+static int tdmr_size_single(u16 max_reserved_per_tdmr)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR depends on the maximum
+	 * number of reserved areas.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
+
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	size_t tdmr_sz, tdmr_array_sz;
+	void *tdmr_array;
+
+	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
+	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
+
+	/*
+	 * To keep things simple, allocate all TDMRs together.
+	 * The buffer needs to be physically contiguous to make
+	 * sure each TDMR is physically contiguous.
+	 */
+	tdmr_array = alloc_pages_exact(tdmr_array_sz,
+			GFP_KERNEL | __GFP_ZERO);
+	if (!tdmr_array)
+		return -ENOMEM;
+
+	tdmr_list->first_tdmr = tdmr_array;
+	/*
+	 * Keep the size of TDMR to find the target TDMR
+	 * at a given index in the TDMR list.
+	 */
+	tdmr_list->tdmr_sz = tdmr_sz;
+	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
+	tdmr_list->nr_tdmrs = 0;
+
+	return 0;
+}
+
+static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
+{
+	free_pages_exact(tdmr_list->first_tdmr,
+			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
+}
+
+/*
+ * Construct a list of TDMRs on the preallocated space in @tdmr_list
+ * to cover all TDX memory regions in @tmb_list based on the TDX module
+ * information in @sysinfo.
+ */
+static int construct_tdmrs(struct list_head *tmb_list,
+			   struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Fill out TDMRs to cover all TDX memory regions.
+	 *  - Allocate and set up PAMTs for each TDMR.
+	 *  - Designate reserved areas for each TDMR.
+	 *
+	 * Return -EINVAL until constructing TDMRs is done
+	 */
+	return -EINVAL;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -358,6 +439,7 @@ static int init_tdx_module(void)
 			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
 	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
+	struct tdmr_info_list tdmr_list;
 	int ret;
 
 	ret = tdx_get_sysinfo(sysinfo, cmr_array);
@@ -380,11 +462,19 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Allocate enough space for constructing TDMRs */
+	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdx_mem;
+
+	/* Cover all TDX-usable memory regions in TDMRs */
+	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Construct a list of TDMRs to cover all TDX-usable memory
-	 *    regions.
 	 *  - Pick up one TDX private KeyID as the global KeyID.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
@@ -393,6 +483,16 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * Free the space for the TDMRs no matter the initialization is
+	 * successful or not.  They are not needed anymore after the
+	 * module initialization.
+	 */
+	free_tdmr_list(&tdmr_list);
+out_free_tdx_mem:
+	if (ret)
+		free_tdx_memlist(&tdx_memlist);
 out:
 	/*
 	 * @tdx_memlist is written here and read at memory hotplug time.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6d32f62e4182..d0c762f1a94c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -90,6 +90,29 @@ struct tdsysinfo_struct {
 	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
 } __packed;
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 09/16] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (7 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 19:36   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.

The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Do the first step to fill out a number of TDMRs to cover all TDX memory
regions.  To keep it simple, always try to use one TDMR for each memory
region.  As the first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just use a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Add a sentence to changelog stating this is the first patch to transit
   "multi-steps" of constructing TDMRs.
 - Added a comment to explain "why one TDMR for each memory region" is OK
   for now.
 - Trimed down/removed unnecessary comments.
 - Removed tdmr_start() but use tdmr->base directly
 - create_tdmrs() -> fill_out_tdmrs()
 - Other changes due to introducing 'struct tdmr_info_list'.

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.

- v3 -> v5 (no feedback on v4):
 - Removed allocating TDMR individually.
 - Improved changelog by using Dave's words.
 - Made TDMR_START() and TDMR_END() as static inline function.

---
 arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d36ac72ef299..5b1de0200c6b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -407,6 +407,90 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
 			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
 }
 
+/* Get the TDMR from the list at the given index. */
+static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
+				    int idx)
+{
+	return (struct tdmr_info *)((unsigned long)tdmr_list->first_tdmr +
+			tdmr_list->tdmr_sz * idx);
+}
+
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
+/*
+ * Take the memory referenced in @tmb_list and populate the
+ * preallocated @tdmr_list, following all the special alignment
+ * and size rules for TDMR.
+ */
+static int fill_out_tdmrs(struct list_head *tmb_list,
+			  struct tdmr_info_list *tdmr_list)
+{
+	struct tdx_memblock *tmb;
+	int tdmr_idx = 0;
+
+	/*
+	 * Loop over TDX memory regions and fill out TDMRs to cover them.
+	 * To keep it simple, always try to use one TDMR to cover one
+	 * memory region.
+	 *
+	 * In practice TDX1.0 supports 64 TDMRs, which is big enough to
+	 * cover all memory regions in reality if the admin doesn't use
+	 * 'memmap' to create a bunch of discrete memory regions.  When
+	 * there's a real problem, enhancement can be done to merge TDMRs
+	 * to reduce the final number of TDMRs.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		u64 start, end;
+
+		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
+		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
+
+		/*
+		 * A valid size indicates the current TDMR has already
+		 * been filled out to cover the previous memory region(s).
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next if the current memory region
+			 * has already been fully covered.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/* Otherwise, skip the already covered part. */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdmr_list->max_tdmrs)
+				return -E2BIG;
+
+			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	tdmr_list->nr_tdmrs = tdmr_idx + 1;
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -416,16 +500,23 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
 			   struct tdsysinfo_struct *sysinfo)
 {
+	int ret;
+
+	ret = fill_out_tdmrs(tmb_list, tdmr_list);
+	if (ret)
+		goto err;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Fill out TDMRs to cover all TDX memory regions.
 	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
 	 * Return -EINVAL until constructing TDMRs is done
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+err:
+	return ret;
 }
 
 static int init_tdx_module(void)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (8 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 09/16] x86/virt/tdx: Fill out " Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 21:53   ` Dave Hansen
  2023-01-07  0:47   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Changelog:
  - Added a sentence to state PAMT allocation will be improved.
  - Others suggested by Dave.
 - Moved 'nid' of 'struct tdx_memblock' to this patch.
 - Improved comments around tdmr_get_nid().
 - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).   

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
 2 files changed, 211 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b36129183035..b86a333b860f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1960,6 +1960,7 @@ config INTEL_TDX_HOST
 	depends on KVM_INTEL
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5b1de0200c6b..cf970a783f1f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -37,6 +37,7 @@ struct tdx_memblock {
 	struct list_head list;
 	unsigned long start_pfn;
 	unsigned long end_pfn;
+	int nid;
 };
 
 static u32 tdx_keyid_start __ro_after_init;
@@ -283,7 +284,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
  * overlap.
  */
 static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-			    unsigned long end_pfn)
+			    unsigned long end_pfn, int nid)
 {
 	struct tdx_memblock *tmb;
 
@@ -294,6 +295,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
 	INIT_LIST_HEAD(&tmb->list);
 	tmb->start_pfn = start_pfn;
 	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
 
 	list_add_tail(&tmb->list, tmb_list);
 	return 0;
@@ -319,9 +321,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
 static int build_tdx_memlist(struct list_head *tmb_list)
 {
 	unsigned long start_pfn, end_pfn;
-	int i, ret;
+	int i, nid, ret;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
 		/*
 		 * The first 1MB is not reported as TDX convertible memory.
 		 * Although the first 1MB is always reserved and won't end up
@@ -337,7 +339,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		 * memblock has already guaranteed they are in address
 		 * ascending order and don't overlap.
 		 */
-		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
 		if (ret)
 			goto err;
 	}
@@ -491,6 +493,200 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
 	return 0;
 }
 
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
+				      u16 pamt_entry_size)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+
+	switch (pgsz) {
+	case TDX_PS_4K:
+		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PS_2M:
+		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PS_1G:
+		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = nr_pamt_entries * pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate a NUMA node which should hold the allocation of the @tdmr
+ * PAMT.  This node will have some memory covered by the TDMR.  The
+ * relative amount of memory covered is not considered.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * A TDMR must cover at least part of one TMB.  That TMB will end
+	 * after the TDMR begins.  But, that TMB may have started before
+	 * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
+	 * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		if (tmb->end_pfn > PHYS_PFN(tdmr->base))
+			return tmb->nid;
+	}
+
+	/*
+	 * Fall back to allocating the TDMR's metadata from node 0 when
+	 * no TDX memory block can be found.  This should never happen
+	 * since TDMRs originate from TDX memory blocks.
+	 */
+	pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
+			tdmr->base, tdmr_end(tdmr));
+	return 0;
+}
+
+/*
+ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
+ * within @tdmr, and set up PAMTs for @tdmr.
+ */
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
+			    struct list_head *tmb_list,
+			    u16 pamt_entry_size)
+{
+	unsigned long pamt_base[TDX_PS_1G + 1];
+	unsigned long pamt_size[TDX_PS_1G + 1];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	struct page *pamt;
+	int pgsz, nid;
+
+	nid = tdmr_get_nid(tdmr, tmb_list);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
+				pamt_entry_size);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/*
+	 * Break the contiguous allocation back up into the
+	 * individual PAMTs for each page size.
+	 */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+			  unsigned long *pamt_npages)
+{
+	unsigned long pamt_base, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_base = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	*pamt_pfn = PHYS_PFN(pamt_base);
+	*pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_npages;
+
+	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_npages)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_npages);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++)
+		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
+				 struct list_head *tmb_list,
+				 u16 pamt_entry_size)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
+		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
+				pamt_entry_size);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_list);
+	return ret;
+}
+
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
+{
+	unsigned long pamt_npages = 0;
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
+		unsigned long pfn, npages;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
+		pamt_npages += npages;
+	}
+
+	return pamt_npages;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -506,15 +702,19 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		goto err;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
+			sysinfo->pamt_entry_size);
+	if (ret)
+		goto err;
 	/*
 	 * TODO:
 	 *
-	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
 	 * Return -EINVAL until constructing TDMRs is done
 	 */
 	ret = -EINVAL;
+	tdmrs_free_pamt_all(tdmr_list);
 err:
 	return ret;
 }
@@ -574,6 +774,11 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(&tdmr_list);
+	else
+		pr_info("%lu pages allocated for PAMT.\n",
+				tdmrs_count_pamt_pages(&tdmr_list));
 out_free_tdmrs:
 	/*
 	 * Free the space for the TDMRs no matter the initialization is
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (9 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 22:07   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Kai Huang
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

As the last step of constructing TDMRs, populate reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - "set_up" -> "populate" in function name change (Dave).
 - Improved comment suggested by Dave.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
   hole and PAMT respectively.
 - Added Isaku's Reviewed-by.

---
 arch/x86/virt/vmx/tdx/tdx.c | 213 ++++++++++++++++++++++++++++++++++--
 1 file changed, 205 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cf970a783f1f..620b35e2a61b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -21,6 +21,7 @@
 #include <linux/sizes.h>
 #include <linux/pfn.h>
 #include <linux/align.h>
+#include <linux/sort.h>
 #include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -687,6 +688,202 @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
 	return pamt_npages;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
+			      u64 size, u16 max_reserved_per_tdmr)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	if (idx >= max_reserved_per_tdmr)
+		return -E2BIG;
+
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/*
+ * Go through @tmb_list to find holes between memory areas.  If any of
+ * those holes fall within @tdmr, set up a TDMR reserved area to cover
+ * the hole.
+ */
+static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	struct tdx_memblock *tmb;
+	u64 prev_end;
+	int ret;
+
+	/*
+	 * Start looking for reserved blocks at the
+	 * beginning of the TDMR.
+	 */
+	prev_end = tdmr->base;
+	list_for_each_entry(tmb, tmb_list, list) {
+		u64 start, end;
+
+		start = PFN_PHYS(tmb->start_pfn);
+		end   = PFN_PHYS(tmb->end_pfn);
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr->base)
+			continue;
+
+		/*
+		 * Skip over memory areas that
+		 * have already been dealt with.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				start - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
+ * overlaps with @tdmr, set up a TDMR reserved area to cover the
+ * overlapping part.
+ */
+static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
+		struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
+		unsigned long pamt_start_pfn, pamt_npages;
+		u64 pamt_start, pamt_end;
+
+		tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
+		/* Each TDMR must already have PAMT allocated */
+		WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
+
+		pamt_start = PFN_PHYS(pamt_start_pfn);
+		pamt_end   = PFN_PHYS(pamt_start_pfn + pamt_npages);
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr->base) ||
+				(pamt_start >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < tdmr->base)
+			pamt_start = tdmr->base;
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
+				pamt_end - pamt_start,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  The caller must guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/*
+ * Populate reserved areas for the given @tdmr, including memory holes
+ * (via @tmb_list) and PAMTs (via @tdmr_list).
+ */
+static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
+				    struct list_head *tmb_list,
+				    struct tdmr_info_list *tdmr_list,
+				    u16 max_reserved_per_tdmr)
+{
+	int ret, rsvd_idx = 0;
+
+	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+/*
+ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
+ * holes (via @tmb_list) and PAMTs.
+ */
+static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
+					 struct list_head *tmb_list,
+					 u16 max_reserved_per_tdmr)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
+		int ret;
+
+		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
+				tmb_list, tdmr_list, max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -706,14 +903,14 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			sysinfo->pamt_entry_size);
 	if (ret)
 		goto err;
-	/*
-	 * TODO:
-	 *
-	 *  - Designate reserved areas for each TDMR.
-	 *
-	 * Return -EINVAL until constructing TDMRs is done
-	 */
-	ret = -EINVAL;
+
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+			sysinfo->max_reserved_per_tdmr);
+	if (ret)
+		goto err_free_pamts;
+
+	return 0;
+err_free_pamts:
 	tdmrs_free_pamt_all(tdmr_list);
 err:
 	return ret;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (10 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 22:21   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

After a list of "TD Memory Regions" (TDMRs) has been constructed to
cover all TDX-usable memory regions, the next step is to pick up a TDX
private KeyID as the "global KeyID" (which protects, i.e. TDX module's
metadata), and configure it to the TDX module along with the TDMRs.

To keep things simple, just use the first TDX KeyID as the global KeyID.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Merged "Reserve TDX module global KeyID" patch to this patch, and
   removed 'tdx_global_keyid' but use 'tdx_keyid_start' directly.
 - Changed changelog accordingly.
 - Changed how to allocate aligned array (Dave).

---
 arch/x86/virt/vmx/tdx/tdx.c | 41 +++++++++++++++++++++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 620b35e2a61b..ab961443fed5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -916,6 +916,36 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
+{
+	u64 *tdmr_pa_array, *p;
+	size_t array_sz;
+	int i, ret;
+
+	/*
+	 * TDMRs are passed to the TDX module via an array of physical
+	 * addresses of each TDMR.  The array itself has alignment
+	 * requirement.
+	 */
+	array_sz = tdmr_list->nr_tdmrs * sizeof(u64) +
+		TDMR_INFO_PA_ARRAY_ALIGNMENT - 1;
+	p = kzalloc(array_sz, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	tdmr_pa_array = PTR_ALIGN(p, TDMR_INFO_PA_ARRAY_ALIGNMENT);
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++)
+		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_list->nr_tdmrs,
+				global_keyid, 0, NULL, NULL);
+
+	/* Free the array as it is not required anymore. */
+	kfree(p);
+
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -960,17 +990,24 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/*
+	 * Use the first private KeyID as the global KeyID, and pass
+	 * it along with the TDMRs to the TDX module.
+	 */
+	ret = config_tdx_module(&tdmr_list, tdx_keyid_start);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Pick up one TDX private KeyID as the global KeyID.
-	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(&tdmr_list);
 	else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d0c762f1a94c..4d2edd477480 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -20,6 +20,7 @@
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_SYS_INFO		32
+#define TDH_SYS_CONFIG		45
 
 struct cmr_info {
 	u64	base;
@@ -96,6 +97,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (11 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-06 22:49   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.

TDH.SYS.KEY.CONFIG needs to be done on one (any) cpu for each package.
Also, it cannot run concurrently on different cpus, so just use
smp_call_function_single() to do it one by one.

Note to keep things simple, neither the function to configure the global
KeyID on all packages nor the tdx_enable() checks whether there's at
least one online cpu for each package.  Also, neither of them explicitly
prevents any cpu from going offline.  It is caller's responsibility to
guarantee this.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
with KeyID 0) before the TDX module uses the global KeyID to access the
PAMT.  Otherwise, those dirty cachelines can silently corrupt the TDX
module's metadata.  Note this breaks TDX from functionality point of
view but TDX's security remains intact.

Following the TDX module specification, flush cache before configuring
the global KeyID on all packages.  Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.

Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
used the global KeyID to write any PAMT.  Therefore, need to use WBINVD
to flush cache before freeing the PAMTs back to the kernel.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Changelog changes:
  - Point out this is the step of "multi-steps" of init_tdx_module().
  - Removed MOVDIR64B part.
  - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
 - Changed to loop over online cpus and use smp_call_function_single()
   directly as the patch to shut down TDX module has been removed.
 - Removed MOVDIR64B part in comment.

v6 -> v7:
 - Improved changelong and comment to explain why MOVDIR64B isn't used
   when returning PAMTs back to the kernel.

---
 arch/x86/virt/vmx/tdx/tdx.c | 97 +++++++++++++++++++++++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ab961443fed5..4c779e8412f1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -946,6 +946,66 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
 	return ret;
 }
 
+static void do_global_key_config(void *data)
+{
+	int ret;
+
+	/*
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
+	 * recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 */
+	ret = seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
+
+	*(int *)data = ret;
+}
+
+/*
+ * Configure the global KeyID on all packages by doing TDH.SYS.KEY.CONFIG
+ * on one online cpu for each package.  If any package doesn't have any
+ * online
+ *
+ * Note:
+ *
+ * This function neither checks whether there's at least one online cpu
+ * for each package, nor explicitly prevents any cpu from going offline.
+ * If any package doesn't have any online cpu then the SEAMCALL won't be
+ * done on that package and the later step of TDX module initialization
+ * will fail.  The caller needs to guarantee this.
+ */
+static int config_global_keyid(void)
+{
+	cpumask_var_t packages;
+	int cpu, ret = 0;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		int err;
+
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		/*
+		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
+		 * different cpus, so just do it one by one.
+		 */
+		ret = smp_call_function_single(cpu, do_global_key_config, &err,
+				true);
+		if (ret)
+			break;
+		if (err) {
+			ret = err;
+			break;
+		}
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -998,19 +1058,46 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 *
+	 * Follow the TDX spec to flush cache before configuring the
+	 * global KeyID on all packages.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
 out_free_pamts:
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMT may already have been initialized by the
+		 * TDX module.  Flush cache before returning PAMT back
+		 * to the kernel.
+		 *
+		 * No need to worry about integrity checks here.  KeyID
+		 * 0 has integrity checking disabled.
+		 */
+		wbinvd_on_all_cpus();
+
 		tdmrs_free_pamt_all(&tdmr_list);
-	else
+	} else
 		pr_info("%lu pages allocated for PAMT.\n",
 				tdmrs_count_pamt_pages(&tdmr_list));
 out_free_tdmrs:
@@ -1057,7 +1144,9 @@ static int __tdx_enable(void)
  * tdx_enable - Enable TDX by initializing the TDX module
  *
  * The caller must make sure all online cpus are in VMX operation before
- * calling this function.
+ * calling this function.  Also, the caller must make sure there is at
+ * least one online cpu for each package, and to prevent any cpu from
+ * going offline during this function.
  *
  * This function can be called in parallel by multiple callers.
  *
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 4d2edd477480..f5c12a2543d4 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -19,6 +19,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_CONFIG		45
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (12 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-07  0:17   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
  2022-12-09  6:52 ` [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.

This is the last step of initializing the TDX module.

Initializing different TDMRs can be parallelized.  For now to keep it
simple, just initialize all TDMRs one by one.  It can be enhanced in the
future.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8: (Dave)
 - Changelog:
   - explicitly call out this is the last step of TDX module initialization.
   - Trimed down changelog by removing SEAMCALL name and details.
 - Removed/trimmed down unnecessary comments.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Removed need_resched() check. -- Andi.

---
 arch/x86/virt/vmx/tdx/tdx.c | 61 ++++++++++++++++++++++++++++++++-----
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4c779e8412f1..8b7314f19df2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1006,6 +1006,55 @@ static int config_global_keyid(void)
 	return ret;
 }
 
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing a TDMR can be time consuming.  To avoid long
+	 * SEAMCALLs, the TDX module may only initialize a part of the
+	 * TDMR in each call.
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		/* All 0's are unused parameters, they mean nothing. */
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+				&out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INT succeeded.
+		 */
+		next = out.rdx;
+		cond_resched();
+		/* Keep making SEAMCALLs until the TDMR is done */
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	/*
+	 * This operation is costly.  It can be parallelized,
+	 * but keep it simple for now.
+	 */
+	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_entry(tdmr_list, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -1076,14 +1125,10 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * TODO:
-	 *
-	 *  - Initialize all TDMRs.
-	 *
-	 *  Return error before all steps are done.
-	 */
-	ret = -EINVAL;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(&tdmr_list);
+	if (ret)
+		goto out_free_pamts;
 out_free_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f5c12a2543d4..163c4876dee4 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -21,6 +21,7 @@
  */
 #define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
 struct cmr_info {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (13 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  2023-01-07  0:35   ` Dave Hansen
  2022-12-09  6:52 ` [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang
  15 siblings, 1 reply; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Because the hardware doesn't guarantee cache coherency among different
KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
before booting to the new kernel.  Also, reading TDX private page using
any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
Therefore ideally, the kernel should convert all TDX private pages back
to normal before booting to the new kernel.

However, this implementation doesn't convert TDX private pages back to
normal in kexec() because of below considerations:

1) Neither the kernel nor the TDX module has existing infrastructure to
   track which pages are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
   them (cache flush + using MOVDIR64B to clear the page) in kexec() can
   be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
   0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
   kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
   page with the new MKTME KeyID (just like TDX does) before using it.

Therefore, this implementation just flushes cache to make sure there are
no stale dirty cachelines associated with any TDX private KeyIDs before
booting to the new kernel, otherwise they may silently corrupt the new
kernel.

Following SME support, use wbinvd() to flush cache in stop_this_cpu().
Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by BIOS instead to flush cache.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Changelog:
   - Removed "leave TDX module open" part due to shut down patch has been
     removed.

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.

---
 arch/x86/kernel/process.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c21b7347a26d..0cc84977dc62 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * Similar to SME, if the TDX module is ever initialized, the
+	 * cachelines associated with any TDX private KeyID must be flushed
+	 * before transiting to the new kernel.  The TDX module is initialized
+	 * on demand, and it takes the mutex to read its status.  Just check
+	 * whether TDX is enabled by BIOS instead to flush cache.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support
  2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
                   ` (14 preceding siblings ...)
  2022-12-09  6:52 ` [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
@ 2022-12-09  6:52 ` Kai Huang
  15 siblings, 0 replies; 84+ messages in thread
From: Kai Huang @ 2022-12-09  6:52 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo, kai.huang

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 169 +++++++++++++++++++++++++++++++++++---
 1 file changed, 158 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index dc8d9fd2c3f7..207b91610b36 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,153 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized.  The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot.  Below dmesg shows when TDX is enabled by BIOS::
+
+  [..] tdx: BIOS enabled: private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module.  The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  The kernel initializes the TDX module at runtime on
+demand.  The caller to call tdx_enable() to initialize the TDX module::
+
+        ret = tdx_enable();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+One step of initializing the TDX module requires at least one online cpu
+for each package.  The caller needs to guarantee this otherwise the
+initialization will fail.
+
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
+has been done).  For now tdx_enable() doesn't handle VMXON internally,
+but depends on the caller to guarantee that.  So far only KVM calls
+tdx_enable() and KVM already handles VMXON.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+  [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] tdx: 65667 pages allocated for PAMT.
+  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+  [..] tdx: initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible.  The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module.  Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory.  Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime.  A non-buggy BIOS should never support hot-removal of
+any convertible memory.  This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
+physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages (i.e. metadata used by the TDX module, and any
+TDX guest memory if kexec() is executed when there's live TDX guests).
+2) There might be dirty cachelines associated with TDX private pages.
+
+Because the hardware doesn't guarantee cache coherency among different
+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
+before booting to the new kernel.  Also, the kernel doesn't convert all
+TDX private pages back to normal because of below considerations:
+
+1) Neither the kernel nor the TDX module has existing infrastructure to
+   track which pages are TDX private page.
+2) The number of TDX private pages can be large, and converting all of
+   them (cache flush + using MOVDIR64B to clear the page) can be time
+   consuming.
+3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
+   0 doesn't support integrity-check, so it's OK.
+4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
+   kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
+   page with the new MKTME KeyID (just like TDX does) before using it.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +167,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +177,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +188,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +199,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +220,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +240,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +254,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +274,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +292,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +314,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +336,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2022-12-09  6:52 ` [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2023-01-06 17:09   ` Dave Hansen
  2023-01-08 22:25     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 17:09 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _X86_VIRT_TDX_H
> +#define _X86_VIRT_TDX_H
> +
> +/*
> + * This file contains both macros and data structures defined by the TDX
> + * architecture and Linux defined software data structures and functions.
> + * The two should not be mixed together for better readability.  The
> + * architectural definitions come first.
> + */
> +
> +/* MSR to report KeyID partitioning between MKTME and TDX */
> +#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087

The *VAST* majority of MSR definitions are in msr-index.h.

Why is this one different from the norm?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2022-12-09  6:52 ` [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
@ 2023-01-06 17:14   ` Dave Hansen
  2023-01-08 22:26     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 17:14 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> 
> The TDX module will be initialized in multi-steps defined by the TDX
> module and most of those steps involve a specific SEAMCALL:
> 
>  1) Get the TDX module information and TDX-capable memory regions
>     (TDH.SYS.INFO).
>  2) Build the list of TDX-usable memory regions.
>  3) Construct a list of "TD Memory Regions" (TDMRs) to cover all
>     TDX-usable memory regions.
>  4) Pick up one TDX private KeyID as the global KeyID.
>  5) Configure the TDMRs and the global KeyID to the TDX module
>     (TDH.SYS.CONFIG).
>  6) Configure the global KeyID on all packages (TDH.SYS.KEY.CONFIG).
>  7) Initialize all TDMRs (TDH.SYS.TDMR.INIT).

I don't think you really need this *AND* the "TODO" comments in
init_tdx_module().  Just say:

	Add a placeholder tdx_enable() to initialize the TDX module on
	demand.  The TODO list will be pared down as functionality is
	added.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL
  2022-12-09  6:52 ` [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
@ 2023-01-06 17:29   ` Dave Hansen
  2023-01-09 10:30     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 17:29 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

The subject here is a bit too specific.  This patch isn't just
"implementing functions".  There are more than just functions here.  The
best subject is probably:

	Add SEAMCALL infrastructure

But that's rather generic by necessity because this patch does several
_different_ logical things:

 * Wrap TDX_MODULE_CALL so it can be used for SEAMCALLs with host=1
 * Add handling to TDX_MODULE_CALL to allow it to handle specifically
   host-side error conditions
 * Add high-level seamcall() function with actual error handling

It's probably also worth noting that the code that allows "host=1" to be
passed to TDX_MODULE_CALL is dead code in mainline right now.  It
probably shouldn't have been merged this way, but oh well.

I don't know that you really _need_ to split this up, but I'm just
pointing out that mashing three different logical things together makes
it hard to write a coherent Subject.  But, I've seen worse.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2022-12-09  6:52 ` [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2023-01-06 17:46   ` Dave Hansen
  2023-01-09 10:25     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 17:46 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> For now, both the TDX module information and CMRs are only used during
> the module initialization, so declare them as local.  However, they are
> 1024 bytes and 512 bytes respectively.  Putting them to the stack
> exceeds the default "stack frame size" that the kernel assumes as safe,
> and the compiler yields a warning about this.  Add a kernel build flag
> to extend the safe stack size to 4K for tdx.c to silence the warning --
> the initialization function is only called once so it's safe to have a
> 4K stack.

Gah.  This has gone off in a really odd direction.

The fact that this is called once really has nothing to do with how
tolerant of a large stack we should be.  If a function is called once
from a deep call stack, it can't consume a lot of stack space.  If it's
called a billion times from a shallow stack depth, it can use all the
stack it wants.

All I really wanted here was this:

static int init_tdx_module(void)
{
-	struct cmr_info cmr_array[MAX_CMRS] ...;+	static struct cmr_info
cmr_array[MAX_CMRS] ...;

Just make the function variable static instead of having it be a global.
 That's *IT*.

> Note not all members in the 1024 bytes TDX module information are used
> (even by the KVM).

I'm not sure what this has to do with anything.

> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> index 38d534f2c113..f8a40d15fdfc 100644
> --- a/arch/x86/virt/vmx/tdx/Makefile
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> +CFLAGS_tdx.o += -Wframe-larger-than=4096
>  obj-y += tdx.o seamcall.o
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b7cedf0589db..6fe505c32599 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/mutex.h>
> +#include <asm/pgtable_types.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
>  #include "tdx.h"
> @@ -107,9 +108,8 @@ bool platform_tdx_enabled(void)
>   * leaf function return code and the additional output respectively if
>   * not NULL.
>   */
> -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> -				    u64 *seamcall_ret,
> -				    struct tdx_module_output *out)
> +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		    u64 *seamcall_ret, struct tdx_module_output *out)
>  {
>  	u64 sret;
>  
> @@ -150,12 +150,85 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	}
>  }
>  
> +static inline bool is_cmr_empty(struct cmr_info *cmr)
> +{
> +	return !cmr->size;
> +}
> +
> +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_cmrs; i++) {
> +		struct cmr_info *cmr = &cmr_array[i];
> +
> +		/*
> +		 * The array of CMRs reported via TDH.SYS.INFO can
> +		 * contain tail empty CMRs.  Don't print them.
> +		 */
> +		if (is_cmr_empty(cmr))
> +			break;
> +
> +		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
> +				cmr->base + cmr->size);
> +	}
> +}
> +
> +/*
> + * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
> + * CMRs, and save them to @sysinfo and @cmr_array, which come from the
> + * kernel stack.  @sysinfo must have been padded to have enough room
> + * to save the TDSYSINFO_STRUCT.
> + */
> +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> +			   struct cmr_info *cmr_array)
> +{
> +	struct tdx_module_output out;
> +	u64 sysinfo_pa, cmr_array_pa;
> +	int ret;
> +
> +	/*
> +	 * Cannot use __pa() directly as @sysinfo and @cmr_array
> +	 * come from the kernel stack.
> +	 */
> +	sysinfo_pa = slow_virt_to_phys(sysinfo);
> +	cmr_array_pa = slow_virt_to_phys(cmr_array);

Note: they won't be on the kernel stack if they're 'static'.

> +	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> +			cmr_array_pa, MAX_CMRS, NULL, &out);
> +	if (ret)
> +		return ret;
> +
> +	pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> +		sysinfo->attributes,	sysinfo->vendor_id,
> +		sysinfo->major_version, sysinfo->minor_version,
> +		sysinfo->build_date,	sysinfo->build_num);
> +
> +	/* R9 contains the actual entries written to the CMR array. */
> +	print_cmrs(cmr_array, out.r9);
> +
> +	return 0;
> +}
> +
>  static int init_tdx_module(void)
>  {
> +	/*
> +	 * @tdsysinfo and @cmr_array are used in TDH.SYS.INFO SEAMCALL ABI.
> +	 * They are 1024 bytes and 512 bytes respectively but it's fine to
> +	 * keep them in the stack as this function is only called once.
> +	 */

Again, more silliness about being called once.

> +	DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> +			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
> +	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);

One more thing about being on the stack: These aren't implicitly zeroed.
 They might have stack gunk from other calls in them.  I _think_ that's
OK because of, for instance, the 'out.r9' that limits how many CMRs get
read.  But, not being zeroed is a potential source of bugs and it's also
something that reviewers (and you) need to think about to make sure it
doesn't have side-effects.

> +	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> +	int ret;
> +
> +	ret = tdx_get_sysinfo(sysinfo, cmr_array);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Get TDX module information and TDX-capable memory regions.
>  	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>  	 *    regions.
> @@ -166,7 +239,9 @@ static int init_tdx_module(void)
>  	 *
>  	 *  Return error before all steps are done.
>  	 */
> -	return -EINVAL;
> +	ret = -EINVAL;
> +out:
> +	return ret;
>  }

I'm going to be lazy and not look into the future.  But, you don't need
the "out:" label here, yet.  It doesn'serve any purpose like this, so
why introduce it here?

>  static int __tdx_enable(void)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 884357a4133c..6d32f62e4182 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -3,6 +3,8 @@
>  #define _X86_VIRT_TDX_H
>  
>  #include <linux/types.h>
> +#include <linux/stddef.h>
> +#include <linux/compiler_attributes.h>
>  
>  /*
>   * This file contains both macros and data structures defined by the TDX
> @@ -14,6 +16,80 @@
>  /* MSR to report KeyID partitioning between MKTME and TDX */
>  #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_INFO		32
> +
> +struct cmr_info {
> +	u64	base;
> +	u64	size;
> +} __packed;
> +
> +#define MAX_CMRS			32
> +#define CMR_INFO_ARRAY_ALIGNMENT	512
> +
> +struct cpuid_config {
> +	u32	leaf;
> +	u32	sub_leaf;
> +	u32	eax;
> +	u32	ebx;
> +	u32	ecx;
> +	u32	edx;
> +} __packed;
> +
> +#define DECLARE_PADDED_STRUCT(type, name, size, alignment)	\
> +	struct type##_padded {					\
> +		union {						\
> +			struct type name;			\
> +			u8 padding[size];			\
> +		};						\
> +	} name##_padded __aligned(alignment)
> +
> +#define PADDED_STRUCT(name)	(name##_padded.name)

These don't turn out looking _that_ nice in practice, but I do vastly
prefer them to hard-coded padding.

<snip>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2023-01-06 18:18   ` Dave Hansen
  2023-01-09 11:48     ` Huang, Kai
  2023-01-18 11:08   ` Huang, Kai
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 18:18 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
> 
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible.  The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module.  Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, just make sure
> all pages in the page allocator are TDX memory.

It's hard to tell what "The initial support of TDX guests" means.  I
*think* you mean "this series".  But, we try not to say "this blah" too
much, so just say this:

	To keep things simple, assume that all TDX-protected memory will
	comes page allocator.  Make sure all pages in the page allocator
	*are* TDX-usable memory.

> To guarantee that, stash off the memblock memory regions at the time of
> initializing the TDX module as TDX's own usable memory regions, and in
> the meantime, register a TDX memory notifier to reject to online any new
> memory in memory hotplug.

First, this is a run-on sentence.  Second, it isn't really clear what
memblocks have to do with this or why you need to stash them off.
Please explain.

> This approach works as in practice all boot-time present DIMMs are TDX
> convertible memory.  However, if any non-TDX-convertible memory has been
> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> module, the module initialization will fail.

I really don't know what this is trying to say.

*How* and *why* does this module initialization failure occur?  How do
you implement it and why is it necessary?

> This can also be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.

Why does it need to be enhanced?  What's the problem?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 216fee7144ee..3a841a77fda4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
>  	 *
>  	 * Moreover, on machines with SandyBridge graphics or in setups that use
>  	 * crashkernel the entire 1M is reserved anyway.
> +	 *
> +	 * Note the host kernel TDX also requires the first 1MB being reserved.
>  	 */
>  	reserve_real_mode();
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6fe505c32599..f010402f443d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,13 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/memory.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> +#include <linux/pfn.h>
>  #include <asm/pgtable_types.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -25,6 +32,12 @@ enum tdx_module_status_t {
>  	TDX_MODULE_ERROR
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};
> +
>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 nr_tdx_keyids __ro_after_init;
>  
> @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
>  /* Prevent concurrent attempts on TDX detection and initialization */
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
>  	return 0;
>  }
>  
> +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled. */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered by
> +		 * any TDX memory block.
> +		 *
> +		 * Note TDX memory blocks are originated from memblock
> +		 * memory regions, which can only be contiguous when two
> +		 * regions have different NUMA nodes or flags.  Therefore
> +		 * the new range cannot cross multiple TDX memory blocks.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}

I don't really like that comment.  It should first state its behavior
and assumptions, like:

	This check assumes that the start_pfn<->end_pfn range does not
	cross multiple tdx_memlist entries.

Only then should it describe why that is OK:

	A single memory hotplug even across mutliple memblocks (from
	which tdx_memlist entries are derived) is impossible.  ... then
	actually explain



> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * Not all memory is compatible with TDX.  Reject
> +	 * to online any incompatible memory.
> +	 */

This comment isn't quite right either.  There might actually be totally
TDX *compatible* memory here.  It just wasn't configured for use with TDX.

Shouldn't this be something more like:

	/*
	 * The TDX memory configuration is static and can not be
	 * changed.  Reject onlining any memory which is outside
	 * of the static configuration whether it supports TDX or not.
	 */

> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +	.notifier_call = tdx_memory_notifier,
> +};
> +
>  static int __init tdx_init(void)
>  {
>  	int err;
> @@ -89,6 +149,13 @@ static int __init tdx_init(void)
>  		goto no_tdx;
>  	}
>  
> +	err = register_memory_notifier(&tdx_memory_nb);
> +	if (err) {
> +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> +				err);
> +		goto no_tdx;
> +	}
> +
>  	return 0;
>  no_tdx:
>  	clear_tdx();
> @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>  	return 0;
>  }
>  
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}

'tdx_memlist' is written only once at boot and then is read-only, right?

It might be nice to mention that so that the lack of locking doesn't
look problematic.

> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */

Ahh, that's why we need to "shadow" the memblocks.  Can you add a
sentence on this to the changelog, please?

> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -226,10 +364,25 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, just make
> +	 * sure all pages in the page allocator are TDX memory.

I didn't like this in the changelog either.  Try to make this "timeless"
rather than refer to what the support is today.  I gave you example text
above.

> +	 * Build the list of "TDX-usable" memory regions which cover all
> +	 * pages in the page allocator to guarantee that.  Do it while
> +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> +	 * path reads the @tdx_memlist to reject any new memory.
> +	 */
> +	get_online_mems();

Oh, it actually uses the memory hotplug locking for list protection.
That's at least a bit subtle.  Please document that somewhere in the
functions that actually manipulate the list.

I think it's also worth saying something here about the high-level
effects of what's going on:

	Take a snapshot of the memory configuration (memblocks).  This
	snapshot will be used to enable TDX support for *this* memory
	configuration only.  Use a memory hotplug notifier to ensure
	that no other RAM can be added outside of this configuration.

That's it, right?

> +	ret = build_tdx_memlist(&tdx_memlist);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>  	 *    regions.
>  	 *  - Pick up one TDX private KeyID as the global KeyID.
> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
>  	return ret;
>  }

You would also be wise to have the folks who do a lot of memory hotplug
work today look at this sooner rather than later.  I _think_ what you
have here is OK, but I'm really rusty on the code itself.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros
  2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
@ 2023-01-06 19:04   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 19:04 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
> defined by the TDX module spec and used as TDX module ABI.  Currently,
> they are used in try_accept_one() when the TDX guest tries to accept a
> page.  However currently try_accept_one() uses hard-coded magic values.
> 
> Define TDX supported page sizes as macros and get rid of the hard-coded
> values in try_accept_one().  TDX host support will need to use them too.
> 
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  2022-12-09  6:52 ` [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
@ 2023-01-06 19:04   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 19:04 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> TDX capable platforms are locked to X2APIC mode and cannot fall back to
> the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
> requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2022-12-09  6:52 ` [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
@ 2023-01-06 19:24   ` Dave Hansen
  2023-01-10  0:40     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 19:24 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

> +struct tdmr_info_list {
> +	struct tdmr_info *first_tdmr;

This is named badly.  This is really a pointer to an array.  While it
_does_ of course point to the first member of the array, the naming
should make it clear that there are multiple tdmr_infos here.

> +	int tdmr_sz;
> +	int max_tdmrs;
> +	int nr_tdmrs;	/* Actual number of TDMRs */
> +};

This 'tdmr_info_list's is declared in an unfortunate place.  I thought
the tdmr_size_single() function below was related to it.

Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
really need to be stored here?  If so, I think I'd probably do something
like this with the structure:

struct tdmr_info_list {
	struct tdmr_info *tdmrs;
	int nr_consumed_tdmrs; // How many @tdmrs are in use

	/* Metadata for freeing this structure: */
	int tdmr_sz;   // Size of one 'tdmr_info' (has a flex array)
	int max_tdmrs; // How many @tdmrs are allocated
};

Modulo whataver folks are doing for comments these days.

> +/* Calculate the actual TDMR size */
> +static int tdmr_size_single(u16 max_reserved_per_tdmr)
> +{
> +	int tdmr_sz;
> +
> +	/*
> +	 * The actual size of TDMR depends on the maximum
> +	 * number of reserved areas.
> +	 */
> +	tdmr_sz = sizeof(struct tdmr_info);
> +	tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
> +
> +	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> +}
> +
> +static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
> +			   struct tdsysinfo_struct *sysinfo)
> +{
> +	size_t tdmr_sz, tdmr_array_sz;
> +	void *tdmr_array;
> +
> +	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
> +	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
> +
> +	/*
> +	 * To keep things simple, allocate all TDMRs together.
> +	 * The buffer needs to be physically contiguous to make
> +	 * sure each TDMR is physically contiguous.
> +	 */
> +	tdmr_array = alloc_pages_exact(tdmr_array_sz,
> +			GFP_KERNEL | __GFP_ZERO);
> +	if (!tdmr_array)
> +		return -ENOMEM;
> +
> +	tdmr_list->first_tdmr = tdmr_array;
> +	/*

	^ probably missing whitepsace before the comment

> +	 * Keep the size of TDMR to find the target TDMR
> +	 * at a given index in the TDMR list.
> +	 */
> +	tdmr_list->tdmr_sz = tdmr_sz;
> +	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
> +	tdmr_list->nr_tdmrs = 0;
> +
> +	return 0;
> +}
> +
> +static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
> +{
> +	free_pages_exact(tdmr_list->first_tdmr,
> +			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
> +}
> +
> +/*
> + * Construct a list of TDMRs on the preallocated space in @tdmr_list
> + * to cover all TDX memory regions in @tmb_list based on the TDX module
> + * information in @sysinfo.
> + */
> +static int construct_tdmrs(struct list_head *tmb_list,
> +			   struct tdmr_info_list *tdmr_list,
> +			   struct tdsysinfo_struct *sysinfo)
> +{
> +	/*
> +	 * TODO:
> +	 *
> +	 *  - Fill out TDMRs to cover all TDX memory regions.
> +	 *  - Allocate and set up PAMTs for each TDMR.
> +	 *  - Designate reserved areas for each TDMR.
> +	 *
> +	 * Return -EINVAL until constructing TDMRs is done
> +	 */
> +	return -EINVAL;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -358,6 +439,7 @@ static int init_tdx_module(void)
>  			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
>  	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
>  	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> +	struct tdmr_info_list tdmr_list;
>  	int ret;
>  
>  	ret = tdx_get_sysinfo(sysinfo, cmr_array);
> @@ -380,11 +462,19 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/* Allocate enough space for constructing TDMRs */
> +	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> +	if (ret)
> +		goto out_free_tdx_mem;
> +
> +	/* Cover all TDX-usable memory regions in TDMRs */
> +	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> +	if (ret)
> +		goto out_free_tdmrs;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Construct a list of TDMRs to cover all TDX-usable memory
> -	 *    regions.
>  	 *  - Pick up one TDX private KeyID as the global KeyID.
>  	 *  - Configure the TDMRs and the global KeyID to the TDX module.
>  	 *  - Configure the global KeyID on all packages.
> @@ -393,6 +483,16 @@ static int init_tdx_module(void)
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +out_free_tdmrs:
> +	/*
> +	 * Free the space for the TDMRs no matter the initialization is
> +	 * successful or not.  They are not needed anymore after the
> +	 * module initialization.
> +	 */
> +	free_tdmr_list(&tdmr_list);
> +out_free_tdx_mem:
> +	if (ret)
> +		free_tdx_memlist(&tdx_memlist);
>  out:
>  	/*
>  	 * @tdx_memlist is written here and read at memory hotplug time.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 6d32f62e4182..d0c762f1a94c 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -90,6 +90,29 @@ struct tdsysinfo_struct {
>  	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
>  } __packed;
>  
> +struct tdmr_reserved_area {
> +	u64 offset;
> +	u64 size;
> +} __packed;
> +
> +#define TDMR_INFO_ALIGNMENT	512
> +
> +struct tdmr_info {
> +	u64 base;
> +	u64 size;
> +	u64 pamt_1g_base;
> +	u64 pamt_1g_size;
> +	u64 pamt_2m_base;
> +	u64 pamt_2m_size;
> +	u64 pamt_4k_base;
> +	u64 pamt_4k_size;
> +	/*
> +	 * Actual number of reserved areas depends on
> +	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
> +	 */
> +	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
> +} __packed __aligned(TDMR_INFO_ALIGNMENT);
> +
>  /*
>   * Do not put any hardware-defined TDX structure representations below
>   * this comment!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 09/16] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2022-12-09  6:52 ` [PATCH v8 09/16] x86/virt/tdx: Fill out " Kai Huang
@ 2023-01-06 19:36   ` Dave Hansen
  2023-01-10  0:45     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 19:36 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> Start to transit out the "multi-steps" to construct a list of "TD Memory
> Regions" (TDMRs) to cover all TDX-usable memory regions.
> 
> The kernel configures TDX-usable memory regions by passing a list of
> TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
> the information of the base/size of a memory region, the base/size of the
> associated Physical Address Metadata Table (PAMT) and a list of reserved
> areas in the region.
> 
> Do the first step to fill out a number of TDMRs to cover all TDX memory
> regions.  To keep it simple, always try to use one TDMR for each memory
> region.  As the first step only set up the base/size for each TDMR.
> 
> Each TDMR must be 1G aligned and the size must be in 1G granularity.
> This implies that one TDMR could cover multiple memory regions.  If a
> memory region spans the 1GB boundary and the former part is already
> covered by the previous TDMR, just use a new TDMR for the remaining
> part.
> 
> TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
> are consumed but there is more memory region to cover.

This could probably use some discussion of why it is not being
future-proofed.  Maybe:

	There are fancier things that could be done like trying to merge
	adjacent TDMRs.  This would allow more pathological memory
	layouts to be supported.  But, current systems are not even
	close to exhausting the existing TDMR resources in practice.
	For now, keep it simple.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index d36ac72ef299..5b1de0200c6b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -407,6 +407,90 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
>  			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
>  }
>  
> +/* Get the TDMR from the list at the given index. */
> +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> +				    int idx)
> +{
> +	return (struct tdmr_info *)((unsigned long)tdmr_list->first_tdmr +
> +			tdmr_list->tdmr_sz * idx);
> +}

I think that's more complicated and has more casting than necessary.
This looks nicer:

	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;

	return (void *)tdmr_list->first_tdmr + tdmr_info_offset;

Also, it might even be worth keeping ->first_tdmr as a void*.  It isn't
a real C array and keeping it as void* would keep anyone from doing:

	tdmr_foo = tdmr_list->first_tdmr[foo];

> +#define TDMR_ALIGNMENT		BIT_ULL(30)
> +#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
> +#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +static inline u64 tdmr_end(struct tdmr_info *tdmr)
> +{
> +	return tdmr->base + tdmr->size;
> +}
> +
> +/*
> + * Take the memory referenced in @tmb_list and populate the
> + * preallocated @tdmr_list, following all the special alignment
> + * and size rules for TDMR.
> + */
> +static int fill_out_tdmrs(struct list_head *tmb_list,
> +			  struct tdmr_info_list *tdmr_list)
> +{
> +	struct tdx_memblock *tmb;
> +	int tdmr_idx = 0;
> +
> +	/*
> +	 * Loop over TDX memory regions and fill out TDMRs to cover them.
> +	 * To keep it simple, always try to use one TDMR to cover one
> +	 * memory region.
> +	 *
> +	 * In practice TDX1.0 supports 64 TDMRs, which is big enough to
> +	 * cover all memory regions in reality if the admin doesn't use
> +	 * 'memmap' to create a bunch of discrete memory regions.  When
> +	 * there's a real problem, enhancement can be done to merge TDMRs
> +	 * to reduce the final number of TDMRs.
> +	 */
> +	list_for_each_entry(tmb, tmb_list, list) {
> +		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> +		u64 start, end;
> +
> +		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
> +		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
> +
> +		/*
> +		 * A valid size indicates the current TDMR has already
> +		 * been filled out to cover the previous memory region(s).
> +		 */
> +		if (tdmr->size) {
> +			/*
> +			 * Loop to the next if the current memory region
> +			 * has already been fully covered.
> +			 */
> +			if (end <= tdmr_end(tdmr))
> +				continue;
> +
> +			/* Otherwise, skip the already covered part. */
> +			if (start < tdmr_end(tdmr))
> +				start = tdmr_end(tdmr);
> +
> +			/*
> +			 * Create a new TDMR to cover the current memory
> +			 * region, or the remaining part of it.
> +			 */
> +			tdmr_idx++;
> +			if (tdmr_idx >= tdmr_list->max_tdmrs)
> +				return -E2BIG;
> +
> +			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> +		}
> +
> +		tdmr->base = start;
> +		tdmr->size = end - start;
> +	}
> +
> +	/* @tdmr_idx is always the index of last valid TDMR. */
> +	tdmr_list->nr_tdmrs = tdmr_idx + 1;
> +
> +	return 0;
> +}
> +
>  /*
>   * Construct a list of TDMRs on the preallocated space in @tdmr_list
>   * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -416,16 +500,23 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  			   struct tdmr_info_list *tdmr_list,
>  			   struct tdsysinfo_struct *sysinfo)
>  {
> +	int ret;
> +
> +	ret = fill_out_tdmrs(tmb_list, tdmr_list);
> +	if (ret)
> +		goto err;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Fill out TDMRs to cover all TDX memory regions.
>  	 *  - Allocate and set up PAMTs for each TDMR.
>  	 *  - Designate reserved areas for each TDMR.
>  	 *
>  	 * Return -EINVAL until constructing TDMRs is done
>  	 */
> -	return -EINVAL;
> +	ret = -EINVAL;
> +err:
> +	return ret;
>  }
>  
>  static int init_tdx_module(void)

Otherwise this actually looks fine.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2023-01-06 21:53   ` Dave Hansen
  2023-01-10  0:49     ` Huang, Kai
  2023-01-07  0:47   ` Dave Hansen
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 21:53 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

Looks good so far.

> +/*
> + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
> + * within @tdmr, and set up PAMTs for @tdmr.
> + */
> +static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> +			    struct list_head *tmb_list,
> +			    u16 pamt_entry_size)
> +{
> +	unsigned long pamt_base[TDX_PS_1G + 1];
> +	unsigned long pamt_size[TDX_PS_1G + 1];

Nit: please define a TDX_PS_NR rather than open-coding this.

> +	unsigned long tdmr_pamt_base;
> +	unsigned long tdmr_pamt_size;
> +	struct page *pamt;
> +	int pgsz, nid;
> +
> +	nid = tdmr_get_nid(tdmr, tmb_list);
> +
> +	/*
> +	 * Calculate the PAMT size for each TDX supported page size
> +	 * and the total PAMT size.
> +	 */
> +	tdmr_pamt_size = 0;
> +	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
> +		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> +				pamt_entry_size);

This alignment is wonky.  Should be way over here:

> +						   pamt_entry_size);

> +		tdmr_pamt_size += pamt_size[pgsz];
> +	}
> +
> +	/*
> +	 * Allocate one chunk of physically contiguous memory for all
> +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> +	 * in overlapped TDMRs.
> +	 */
> +	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> +			nid, &node_online_map);
> +	if (!pamt)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Break the contiguous allocation back up into the
> +	 * individual PAMTs for each page size.
> +	 */
> +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> +	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
> +		pamt_base[pgsz] = tdmr_pamt_base;
> +		tdmr_pamt_base += pamt_size[pgsz];
> +	}
> +
> +	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> +	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> +	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> +	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
> +	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
> +	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
> +
> +	return 0;
> +}
> +
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
> +			  unsigned long *pamt_npages)
> +{
> +	unsigned long pamt_base, pamt_sz;
> +
> +	/*
> +	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
> +	 * should always point to the beginning of that allocation.
> +	 */
> +	pamt_base = tdmr->pamt_4k_base;
> +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> +	*pamt_pfn = PHYS_PFN(pamt_base);
> +	*pamt_npages = pamt_sz >> PAGE_SHIFT;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long pamt_pfn, pamt_npages;
> +
> +	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
> +
> +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> +	if (!pamt_npages)
> +		return;
> +
> +	if (WARN_ON_ONCE(!pamt_pfn))
> +		return;
> +
> +	free_contig_range(pamt_pfn, pamt_npages);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_tdmrs; i++)
> +		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> +				 struct list_head *tmb_list,
> +				 u16 pamt_entry_size)
> +{
> +	int i, ret = 0;
> +
> +	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
> +		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
> +				pamt_entry_size);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	tdmrs_free_pamt_all(tdmr_list);
> +	return ret;
> +}
> +
> +static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
> +{
> +	unsigned long pamt_npages = 0;
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
> +		unsigned long pfn, npages;
> +
> +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
> +		pamt_npages += npages;
> +	}
> +
> +	return pamt_npages;
> +}
> +
>  /*
>   * Construct a list of TDMRs on the preallocated space in @tdmr_list
>   * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -506,15 +702,19 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  	if (ret)
>  		goto err;
>  
> +	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
> +			sysinfo->pamt_entry_size);
> +	if (ret)
> +		goto err;
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Allocate and set up PAMTs for each TDMR.
>  	 *  - Designate reserved areas for each TDMR.
>  	 *
>  	 * Return -EINVAL until constructing TDMRs is done
>  	 */
>  	ret = -EINVAL;
> +	tdmrs_free_pamt_all(tdmr_list);
>  err:
>  	return ret;
>  }
> @@ -574,6 +774,11 @@ static int init_tdx_module(void)
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +	if (ret)
> +		tdmrs_free_pamt_all(&tdmr_list);
> +	else
> +		pr_info("%lu pages allocated for PAMT.\n",
> +				tdmrs_count_pamt_pages(&tdmr_list));
>  out_free_tdmrs:
>  	/*
>  	 * Free the space for the TDMRs no matter the initialization is

Other than the two nits:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2022-12-09  6:52 ` [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
@ 2023-01-06 22:07   ` Dave Hansen
  2023-01-10  1:19     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 22:07 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> +			      u64 size, u16 max_reserved_per_tdmr)
> +{
> +	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
> +	int idx = *p_idx;
> +
> +	/* Reserved area must be 4K aligned in offset and size */
> +	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
> +		return -EINVAL;
> +
> +	if (idx >= max_reserved_per_tdmr)
> +		return -E2BIG;
> +
> +	rsvd_areas[idx].offset = addr - tdmr->base;
> +	rsvd_areas[idx].size = size;
> +
> +	*p_idx = idx + 1;
> +
> +	return 0;
> +}

It's probably worth at least a comment here to say:

	/*
	 * Consume one reserved area per call.  Make no effort to
	 * optimize or reduce the number of reserved areas which are
	 * consumed by contiguous reserved areas, for instance.
	 */

I think the -E2BIG is also wrong.  It should be ENOSPC.  I'd also add a
pr_warn() there.  Especially with how lazy this whole thing is, I can
start to see how the reserved areas might be exhausted.  Let's be kind
to our future selves and make the error (and the fix) easier to find.

It's probably also worth noting *somewhere* that there's a balance to be
had between TDMRs and reserved areas.  A system that is running out of
reserved areas in a TDMR could split a TDMR to get more reserved areas.
A system that has run out of TDMRs could relatively easily coalesce two
adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
if there was a gap between them.

I'm *really* close to acking this patch once those are fixed up.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module
  2022-12-09  6:52 ` [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Kai Huang
@ 2023-01-06 22:21   ` Dave Hansen
  2023-01-10 10:48     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 22:21 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> After a list of "TD Memory Regions" (TDMRs) has been constructed to
> cover all TDX-usable memory regions, the next step is to pick up a TDX
> private KeyID as the "global KeyID" (which protects, i.e. TDX module's
> metadata), and configure it to the TDX module along with the TDMRs.

For whatever reason, whenever I see "i.e." in a changelog, it's usually
going off the rails.  This is no exception.  Let's also get rid of the
passive voice:

	The next step After constructing a list of "TD Memory Regions"
	(TDMRs) to cover all TDX-usable memory regions is to designate a
	TDX private KeyID as the "global KeyID".  This KeyID is used by
	the TDX module for mapping things like the PAMT and other TDX
	metadata.  This KeyID is passed to the TDX module at the same
	time as the TDMRs.

> To keep things simple, just use the first TDX KeyID as the global KeyID.




> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 41 +++++++++++++++++++++++++++++++++++--
>  arch/x86/virt/vmx/tdx/tdx.h |  2 ++
>  2 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 620b35e2a61b..ab961443fed5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -916,6 +916,36 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  	return ret;
>  }
>  
> +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> +{
> +	u64 *tdmr_pa_array, *p;
> +	size_t array_sz;
> +	int i, ret;
> +
> +	/*
> +	 * TDMRs are passed to the TDX module via an array of physical
> +	 * addresses of each TDMR.  The array itself has alignment
> +	 * requirement.
> +	 */
> +	array_sz = tdmr_list->nr_tdmrs * sizeof(u64) +
> +		TDMR_INFO_PA_ARRAY_ALIGNMENT - 1;

One other way of doing this which might be a wee bit less messy:

	array_sz = roundup_pow_of_two(array_sz);
	if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
		array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;

Since that keeps 'array_sz' at a power-of-two, then kzalloc() will give
you all the alignment you need, except if the array is too small, in
which case you can just bloat it to the alignment requirement.

This would get rid of the PTR_ALIGN() below too.

Your choice.  What you have works too.

> +	p = kzalloc(array_sz, GFP_KERNEL);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	tdmr_pa_array = PTR_ALIGN(p, TDMR_INFO_PA_ARRAY_ALIGNMENT);
> +	for (i = 0; i < tdmr_list->nr_tdmrs; i++)
> +		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
> +
> +	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_list->nr_tdmrs,
> +				global_keyid, 0, NULL, NULL);
> +
> +	/* Free the array as it is not required anymore. */
> +	kfree(p);
> +
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -960,17 +990,24 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_tdmrs;
>  
> +	/*
> +	 * Use the first private KeyID as the global KeyID, and pass
> +	 * it along with the TDMRs to the TDX module.
> +	 */
> +	ret = config_tdx_module(&tdmr_list, tdx_keyid_start);
> +	if (ret)
> +		goto out_free_pamts;

This is "consuming" tdx_keyid_start.  Does it need to get incremented
since the first guest can't use this KeyID now?

>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Pick up one TDX private KeyID as the global KeyID.
> -	 *  - Configure the TDMRs and the global KeyID to the TDX module.
>  	 *  - Configure the global KeyID on all packages.
>  	 *  - Initialize all TDMRs.
>  	 *
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +out_free_pamts:
>  	if (ret)
>  		tdmrs_free_pamt_all(&tdmr_list);
>  	else
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index d0c762f1a94c..4d2edd477480 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -20,6 +20,7 @@
>   * TDX module SEAMCALL leaf functions
>   */
>  #define TDH_SYS_INFO		32
> +#define TDH_SYS_CONFIG		45
>  
>  struct cmr_info {
>  	u64	base;
> @@ -96,6 +97,7 @@ struct tdmr_reserved_area {
>  } __packed;
>  
>  #define TDMR_INFO_ALIGNMENT	512
> +#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
>  
>  struct tdmr_info {
>  	u64 base;


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2022-12-09  6:52 ` [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2023-01-06 22:49   ` Dave Hansen
  2023-01-10 10:15     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-06 22:49 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> After the list of TDMRs and the global KeyID are configured to the TDX
> module, the kernel needs to configure the key of the global KeyID on all
> packages using TDH.SYS.KEY.CONFIG.
> 
> TDH.SYS.KEY.CONFIG needs to be done on one (any) cpu for each package.
> Also, it cannot run concurrently on different cpus, so just use
> smp_call_function_single() to do it one by one.
> 
> Note to keep things simple, neither the function to configure the global
> KeyID on all packages nor the tdx_enable() checks whether there's at
> least one online cpu for each package.  Also, neither of them explicitly
> prevents any cpu from going offline.  It is caller's responsibility to
> guarantee this.

OK, but does someone *actually* do this?

> Intel hardware doesn't guarantee cache coherency across different
> KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
> with KeyID 0) before the TDX module uses the global KeyID to access the
> PAMT.  Otherwise, those dirty cachelines can silently corrupt the TDX
> module's metadata.  Note this breaks TDX from functionality point of
> view but TDX's security remains intact.

	Intel hardware doesn't guarantee cache coherency across
	different KeyIDs.  The PAMTs are transitioning from being used
	by the kernel mapping (KeyId 0) to the TDX module's "global
	KeyID" mapping.

	This means that the kernel must flush any dirty KeyID-0 PAMT
	cachelines before the TDX module uses the global KeyID to access
	the PAMT.  Otherwise, if those dirty cachelines were written
	back, they would corrupt the TDX module's metadata.  Aside: This
	corruption would be detected by the memory integrity hardware on
	the next read of the memory with the global KeyID.  The result
	would likely be fatal to the system but would not impact TDX
	security.

> Following the TDX module specification, flush cache before configuring
> the global KeyID on all packages.  Given the PAMT size can be large
> (~1/256th of system RAM), just use WBINVD on all CPUs to flush.
> 
> Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> used the global KeyID to write any PAMT.  Therefore, need to use WBINVD
> to flush cache before freeing the PAMTs back to the kernel.

						s/need to// ^


> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ab961443fed5..4c779e8412f1 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -946,6 +946,66 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
>  	return ret;
>  }
>  
> +static void do_global_key_config(void *data)
> +{
> +	int ret;
> +
> +	/*
> +	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
> +	 * recoverable error).  Assume this is exceedingly rare and
> +	 * just return error if encountered instead of retrying.
> +	 */
> +	ret = seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
> +
> +	*(int *)data = ret;
> +}
> +
> +/*
> + * Configure the global KeyID on all packages by doing TDH.SYS.KEY.CONFIG
> + * on one online cpu for each package.  If any package doesn't have any
> + * online

This looks like it stopped mid-sentence.

> + * Note:
> + *
> + * This function neither checks whether there's at least one online cpu
> + * for each package, nor explicitly prevents any cpu from going offline.
> + * If any package doesn't have any online cpu then the SEAMCALL won't be
> + * done on that package and the later step of TDX module initialization
> + * will fail.  The caller needs to guarantee this.
> + */

*Does* the caller guarantee it?

You're basically saying, "this code needs $FOO to work", but you're not
saying who *provides* $FOO.

> +static int config_global_keyid(void)
> +{
> +	cpumask_var_t packages;
> +	int cpu, ret = 0;
> +
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	for_each_online_cpu(cpu) {
> +		int err;
> +
> +		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
> +					packages))
> +			continue;
> +
> +		/*
> +		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
> +		 * different cpus, so just do it one by one.
> +		 */
> +		ret = smp_call_function_single(cpu, do_global_key_config, &err,
> +				true);
> +		if (ret)
> +			break;
> +		if (err) {
> +			ret = err;
> +			break;
> +		}
> +	}
> +
> +	free_cpumask_var(packages);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -998,19 +1058,46 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_pamts;
>  
> +	/*
> +	 * Hardware doesn't guarantee cache coherency across different
> +	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
> +	 * (associated with KeyID 0) before the TDX module can use the
> +	 * global KeyID to access the PAMT.  Given PAMTs are potentially
> +	 * large (~1/256th of system RAM), just use WBINVD on all cpus
> +	 * to flush the cache.
> +	 *
> +	 * Follow the TDX spec to flush cache before configuring the
> +	 * global KeyID on all packages.
> +	 */

I don't think this second paragraph adds very much clarity.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs
  2022-12-09  6:52 ` [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2023-01-07  0:17   ` Dave Hansen
  2023-01-10 10:23     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-07  0:17 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> After the global KeyID has been configured on all packages, initialize
> all TDMRs to make all TDX-usable memory regions that are passed to the
> TDX module become usable.
> 
> This is the last step of initializing the TDX module.
> 
> Initializing different TDMRs can be parallelized.  For now to keep it
> simple, just initialize all TDMRs one by one.  It can be enhanced in the
> future.

The changelog probably also needs a note about this being a long process
and also at least touching on *why* it takes so long.

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 4c779e8412f1..8b7314f19df2 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1006,6 +1006,55 @@ static int config_global_keyid(void)
>  	return ret;
>  }
>  
> +static int init_tdmr(struct tdmr_info *tdmr)
> +{
> +	u64 next;
> +
> +	/*
> +	 * Initializing a TDMR can be time consuming.  To avoid long
> +	 * SEAMCALLs, the TDX module may only initialize a part of the
> +	 * TDMR in each call.
> +	 */
> +	do {
> +		struct tdx_module_output out;
> +		int ret;
> +
> +		/* All 0's are unused parameters, they mean nothing. */
> +		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> +				&out);
> +		if (ret)
> +			return ret;
> +		/*
> +		 * RDX contains 'next-to-initialize' address if
> +		 * TDH.SYS.TDMR.INT succeeded.

This reads strangely.  "Success" to me really is different from "partial
success".  Sure, partial success is also not an error, *but* this can be
explained better.  How about:

		 * RDX contains 'next-to-initialize' address if
		 * TDH.SYS.TDMR.INT did not fully complete and should
		 * be retried.


> +		 */
> +		next = out.rdx;
> +		cond_resched();
> +		/* Keep making SEAMCALLs until the TDMR is done */
> +	} while (next < tdmr->base + tdmr->size);
> +
> +	return 0;
> +}
> +
> +static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> +{
> +	int i;
> +
> +	/*
> +	 * This operation is costly.  It can be parallelized,
> +	 * but keep it simple for now.
> +	 */
> +	for (i = 0; i < tdmr_list->nr_tdmrs; i++) {
> +		int ret;
> +
> +		ret = init_tdmr(tdmr_entry(tdmr_list, i));
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -1076,14 +1125,10 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_pamts;
>  
> -	/*
> -	 * TODO:
> -	 *
> -	 *  - Initialize all TDMRs.
> -	 *
> -	 *  Return error before all steps are done.
> -	 */
> -	ret = -EINVAL;
> +	/* Initialize TDMRs to complete the TDX module initialization */
> +	ret = init_tdmrs(&tdmr_list);
> +	if (ret)
> +		goto out_free_pamts;
>  out_free_pamts:
>  	if (ret) {
>  		/*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index f5c12a2543d4..163c4876dee4 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -21,6 +21,7 @@
>   */
>  #define TDH_SYS_KEY_CONFIG	31
>  #define TDH_SYS_INFO		32
> +#define TDH_SYS_TDMR_INIT	36
>  #define TDH_SYS_CONFIG		45
>  
>  struct cmr_info {


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2022-12-09  6:52 ` [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
@ 2023-01-07  0:35   ` Dave Hansen
  2023-01-10 11:29     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-07  0:35 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> There are two problems in terms of using kexec() to boot to a new kernel
> when the old kernel has enabled TDX: 1) Part of the memory pages are
> still TDX private pages (i.e. metadata used by the TDX module, and any
> TDX guest memory if kexec() happens when there's any TDX guest alive).
> 2) There might be dirty cachelines associated with TDX private pages.
> 
> Because the hardware doesn't guarantee cache coherency among different
> KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
> before booting to the new kernel.  Also, reading TDX private page using
> any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
> Therefore ideally, the kernel should convert all TDX private pages back
> to normal before booting to the new kernel.

This is just talking about way too many things that just don't apply.

Let's focus on the *ACTUAL* problem that's being addressed instead of
the 15 problems that aren't actual practical problems.

> However, this implementation doesn't convert TDX private pages back to
> normal in kexec() because of below considerations:
> 
> 1) Neither the kernel nor the TDX module has existing infrastructure to
>    track which pages are TDX private pages.
> 2) The number of TDX private pages can be large, and converting all of
>    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
>    be time consuming.
> 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
>    0 doesn't support integrity-check, so it's OK.
> 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
>    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
>    page with the new MKTME KeyID (just like TDX does) before using it.

Yeah, why are we getting all worked up about MKTME when there is not
support?

The only thing that matters here is dirty cacheline writeback.  There
are two things the kernel needs to do to mitigate that:

 1. Stop accessing TDX private memory mappings
  1a. Stop making TDX module calls (uses global private KeyID)
  1b. Stop TDX guests from running (uses per-guest KeyID)
 2. Flush any cachelines from previous private KeyID writes

There are a couple of ways we can do #2.  We do *NOT* need to convert
*ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
in any way as far as I can tell.

I think you're also saying that since all CPUs go through this path and
there is no TDX activity between the WBINVD and the native_halt() that
1a and 1b basically happen for "free" without needing to do theme
explicitly.

> Therefore, this implementation just flushes cache to make sure there are
> no stale dirty cachelines associated with any TDX private KeyIDs before
> booting to the new kernel, otherwise they may silently corrupt the new
> kernel.

That's fine.  So, this patch kinda happens to land in the right spot
even after thrashing about about a while.

> Following SME support, use wbinvd() to flush cache in stop_this_cpu().
> Theoretically, cache flush is only needed when the TDX module has been
> initialized.  However initializing the TDX module is done on demand at
> runtime, and it takes a mutex to read the module status.  Just check
> whether TDX is enabled by BIOS instead to flush cache.

Yeah, close enough.

> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index c21b7347a26d..0cc84977dc62 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
>  	 *
>  	 * Test the CPUID bit directly because the machine might've cleared
>  	 * X86_FEATURE_SME due to cmdline options.
> +	 *
> +	 * Similar to SME, if the TDX module is ever initialized, the
> +	 * cachelines associated with any TDX private KeyID must be flushed
> +	 * before transiting to the new kernel.  The TDX module is initialized
> +	 * on demand, and it takes the mutex to read its status.  Just check
> +	 * whether TDX is enabled by BIOS instead to flush cache.
>  	 */

There's too much detail here.  Let's up-level it a bit.  We don't need
to be talking about TDX locking here.

	/*
	 * The TDX module or guests might have left dirty cachelines
	 * behind.  Flush them to avoid corruption from later writeback.
	 * Note that this flushes on all systems where TDX is possible,
	 * but does not actually check that TDX was in use.
	 */

> -	if (cpuid_eax(0x8000001f) & BIT(0))
> +	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
>  		native_wbinvd();
>  	for (;;) {
>  		/*


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
  2023-01-06 21:53   ` Dave Hansen
@ 2023-01-07  0:47   ` Dave Hansen
  2023-01-10  0:47     ` Huang, Kai
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-07  0:47 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, peterz, tglx, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, kirill.shutemov, ying.huang, reinette.chatre,
	len.brown, tony.luck, ak, isaku.yamahata, chao.gao,
	sathyanarayanan.kuppuswamy, bagasdotme, sagis, imammedo

On 12/8/22 22:52, Kai Huang wrote:
> @@ -574,6 +774,11 @@ static int init_tdx_module(void)
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +	if (ret)
> +		tdmrs_free_pamt_all(&tdmr_list);
> +	else
> +		pr_info("%lu pages allocated for PAMT.\n",
> +				tdmrs_count_pamt_pages(&tdmr_list));

Could you please convert this to megabytes or kilobytes?  dmesg is for
humans and humans don't generally know how large their systems or DIMMs
are in pages without looking or grabbing a calculator.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-01-06 17:09   ` Dave Hansen
@ 2023-01-08 22:25     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-08 22:25 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 09:09 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _X86_VIRT_TDX_H
> > +#define _X86_VIRT_TDX_H
> > +
> > +/*
> > + * This file contains both macros and data structures defined by the TDX
> > + * architecture and Linux defined software data structures and functions.
> > + * The two should not be mixed together for better readability.  The
> > + * architectural definitions come first.
> > + */
> > +
> > +/* MSR to report KeyID partitioning between MKTME and TDX */
> > +#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
> 
> The *VAST* majority of MSR definitions are in msr-index.h.
> 
> Why is this one different from the norm?

I had memory that only when one MSR is shared by multiple source files the
definition of that MSR should go into msr-index.h, but this has changed since
commit 97fa21f65c3e ("x86/resctrl: Move MSR defines into msr-index.h").

I'll move it to msr-index.h

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand
  2023-01-06 17:14   ` Dave Hansen
@ 2023-01-08 22:26     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-08 22:26 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 09:14 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > 
> > The TDX module will be initialized in multi-steps defined by the TDX
> > module and most of those steps involve a specific SEAMCALL:
> > 
> >  1) Get the TDX module information and TDX-capable memory regions
> >     (TDH.SYS.INFO).
> >  2) Build the list of TDX-usable memory regions.
> >  3) Construct a list of "TD Memory Regions" (TDMRs) to cover all
> >     TDX-usable memory regions.
> >  4) Pick up one TDX private KeyID as the global KeyID.
> >  5) Configure the TDMRs and the global KeyID to the TDX module
> >     (TDH.SYS.CONFIG).
> >  6) Configure the global KeyID on all packages (TDH.SYS.KEY.CONFIG).
> >  7) Initialize all TDMRs (TDH.SYS.TDMR.INIT).
> 
> I don't think you really need this *AND* the "TODO" comments in
> init_tdx_module().  Just say:
> 
> 	Add a placeholder tdx_enable() to initialize the TDX module on
> 	demand.  The TODO list will be pared down as functionality is
> 	added.

Yes agreed.  Will do.  Thanks.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-01-06 17:46   ` Dave Hansen
@ 2023-01-09 10:25     ` Huang, Kai
  2023-01-09 19:52       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-09 10:25 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 09:46 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > For now, both the TDX module information and CMRs are only used during
> > the module initialization, so declare them as local.  However, they are
> > 1024 bytes and 512 bytes respectively.  Putting them to the stack
> > exceeds the default "stack frame size" that the kernel assumes as safe,
> > and the compiler yields a warning about this.  Add a kernel build flag
> > to extend the safe stack size to 4K for tdx.c to silence the warning --
> > the initialization function is only called once so it's safe to have a
> > 4K stack.
> 
> Gah.  This has gone off in a really odd direction.
> 
> The fact that this is called once really has nothing to do with how
> tolerant of a large stack we should be.  If a function is called once
> from a deep call stack, it can't consume a lot of stack space.  If it's
> called a billion times from a shallow stack depth, it can use all the
> stack it wants.

Agreed.

> 
> All I really wanted here was this:
> 
> static int init_tdx_module(void)
> {
> -	struct cmr_info cmr_array[MAX_CMRS] ...;+	static struct cmr_info
> cmr_array[MAX_CMRS] ...;
> 
> Just make the function variable static instead of having it be a global.
>  That's *IT*.

Yes will do.

Btw, I think putting hardware-used (physically contiguous large) data structure
on the stack isn't a good idea anyway.  The reason is as replied in below link:

https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#m48506450cd22f84ab718d3b5bf8ddbff8fcc3362

Kernel stack can be vmalloc()-ed, so it doesn't guarantee data structure is
physically contiguous if data structure is across page boundary.

This particular TDX case works because the size and the alignment make sure both
cmr_array[] and tdsysinfo cannot cross page boundary.

> 
> > Note not all members in the 1024 bytes TDX module information are used
> > (even by the KVM).
> 
> I'm not sure what this has to do with anything.

You mentioned in v7 that:

>> This is also a great place to mention that the tdsysinfo_struct contains
>> a *lot* of gunk which will not be used for a bit or that may never get
>> used.

https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#m168e619aac945fa418ccb1d6652113003243d895

Perhaps I misunderstood something but I was trying to address this.

Should I remove this sentence?

> 
> > diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> > index 38d534f2c113..f8a40d15fdfc 100644
> > --- a/arch/x86/virt/vmx/tdx/Makefile
> > +++ b/arch/x86/virt/vmx/tdx/Makefile
> > @@ -1,2 +1,3 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> > +CFLAGS_tdx.o += -Wframe-larger-than=4096
> >  obj-y += tdx.o seamcall.o
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index b7cedf0589db..6fe505c32599 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/printk.h>
> >  #include <linux/mutex.h>
> > +#include <asm/pgtable_types.h>
> >  #include <asm/msr.h>
> >  #include <asm/tdx.h>
> >  #include "tdx.h"
> > @@ -107,9 +108,8 @@ bool platform_tdx_enabled(void)
> >   * leaf function return code and the additional output respectively if
> >   * not NULL.
> >   */
> > -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > -				    u64 *seamcall_ret,
> > -				    struct tdx_module_output *out)
> > +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > +		    u64 *seamcall_ret, struct tdx_module_output *out)
> >  {
> >  	u64 sret;
> >  
> > @@ -150,12 +150,85 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> >  	}
> >  }
> >  
> > +static inline bool is_cmr_empty(struct cmr_info *cmr)
> > +{
> > +	return !cmr->size;
> > +}
> > +
> > +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < nr_cmrs; i++) {
> > +		struct cmr_info *cmr = &cmr_array[i];
> > +
> > +		/*
> > +		 * The array of CMRs reported via TDH.SYS.INFO can
> > +		 * contain tail empty CMRs.  Don't print them.
> > +		 */
> > +		if (is_cmr_empty(cmr))
> > +			break;
> > +
> > +		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
> > +				cmr->base + cmr->size);
> > +	}
> > +}
> > +
> > +/*
> > + * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
> > + * CMRs, and save them to @sysinfo and @cmr_array, which come from the
> > + * kernel stack.  @sysinfo must have been padded to have enough room
> > + * to save the TDSYSINFO_STRUCT.
> > + */
> > +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > +			   struct cmr_info *cmr_array)
> > +{
> > +	struct tdx_module_output out;
> > +	u64 sysinfo_pa, cmr_array_pa;
> > +	int ret;
> > +
> > +	/*
> > +	 * Cannot use __pa() directly as @sysinfo and @cmr_array
> > +	 * come from the kernel stack.
> > +	 */
> > +	sysinfo_pa = slow_virt_to_phys(sysinfo);
> > +	cmr_array_pa = slow_virt_to_phys(cmr_array);
> 
> Note: they won't be on the kernel stack if they're 'static'.
> 
> > +	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> > +			cmr_array_pa, MAX_CMRS, NULL, &out);
> > +	if (ret)
> > +		return ret;
> > +
> > +	pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> > +		sysinfo->attributes,	sysinfo->vendor_id,
> > +		sysinfo->major_version, sysinfo->minor_version,
> > +		sysinfo->build_date,	sysinfo->build_num);
> > +
> > +	/* R9 contains the actual entries written to the CMR array. */
> > +	print_cmrs(cmr_array, out.r9);
> > +
> > +	return 0;
> > +}
> > +
> >  static int init_tdx_module(void)
> >  {
> > +	/*
> > +	 * @tdsysinfo and @cmr_array are used in TDH.SYS.INFO SEAMCALL ABI.
> > +	 * They are 1024 bytes and 512 bytes respectively but it's fine to
> > +	 * keep them in the stack as this function is only called once.
> > +	 */
> 
> Again, more silliness about being called once.
> 
> > +	DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> > +			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
> > +	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> 
> One more thing about being on the stack: These aren't implicitly zeroed.
>  They might have stack gunk from other calls in them.  I _think_ that's
> OK because of, for instance, the 'out.r9' that limits how many CMRs get
> read.  But, not being zeroed is a potential source of bugs and it's also
> something that reviewers (and you) need to think about to make sure it
> doesn't have side-effects.

Agreed.

As mentioned above, will change to use 'static' but keep the variables in the
function.

> 
> > +	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> > +	int ret;
> > +
> > +	ret = tdx_get_sysinfo(sysinfo, cmr_array);
> > +	if (ret)
> > +		goto out;
> > +
> >  	/*
> >  	 * TODO:
> >  	 *
> > -	 *  - Get TDX module information and TDX-capable memory regions.
> >  	 *  - Build the list of TDX-usable memory regions.
> >  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
> >  	 *    regions.
> > @@ -166,7 +239,9 @@ static int init_tdx_module(void)
> >  	 *
> >  	 *  Return error before all steps are done.
> >  	 */
> > -	return -EINVAL;
> > +	ret = -EINVAL;
> > +out:
> > +	return ret;
> >  }
> 
> I'm going to be lazy and not look into the future.  But, you don't need
> the "out:" label here, yet.  It doesn'serve any purpose like this, so
> why introduce it here?

The 'out' label is here because of below code:

	ret = tdx_get_sysinfo(...);
	if (ret)
		goto out;

If I don't have 'out' label here in this patch, do you mean something below?

	ret = tdx_get_sysinfo(...);
	if (ret)
		return ret;

	/*
	 * TODO:
	 * ...
	 * Return error before all steps are done.
	 */
	return -EINVAL;

> 
> >  static int __tdx_enable(void)
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 884357a4133c..6d32f62e4182 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -3,6 +3,8 @@
> >  #define _X86_VIRT_TDX_H
> >  
> >  #include <linux/types.h>
> > +#include <linux/stddef.h>
> > +#include <linux/compiler_attributes.h>
> >  
> >  /*
> >   * This file contains both macros and data structures defined by the TDX
> > @@ -14,6 +16,80 @@
> >  /* MSR to report KeyID partitioning between MKTME and TDX */
> >  #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
> >  
> > +/*
> > + * TDX module SEAMCALL leaf functions
> > + */
> > +#define TDH_SYS_INFO		32
> > +
> > +struct cmr_info {
> > +	u64	base;
> > +	u64	size;
> > +} __packed;
> > +
> > +#define MAX_CMRS			32
> > +#define CMR_INFO_ARRAY_ALIGNMENT	512
> > +
> > +struct cpuid_config {
> > +	u32	leaf;
> > +	u32	sub_leaf;
> > +	u32	eax;
> > +	u32	ebx;
> > +	u32	ecx;
> > +	u32	edx;
> > +} __packed;
> > +
> > +#define DECLARE_PADDED_STRUCT(type, name, size, alignment)	\
> > +	struct type##_padded {					\
> > +		union {						\
> > +			struct type name;			\
> > +			u8 padding[size];			\
> > +		};						\
> > +	} name##_padded __aligned(alignment)
> > +
> > +#define PADDED_STRUCT(name)	(name##_padded.name)
> 
> These don't turn out looking _that_ nice in practice, but I do vastly
> prefer them to hard-coded padding.
> 
> <snip>

Agreed.  Thanks for your original code.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL
  2023-01-06 17:29   ` Dave Hansen
@ 2023-01-09 10:30     ` Huang, Kai
  2023-01-09 19:54       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-09 10:30 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 09:29 -0800, Dave Hansen wrote:
> The subject here is a bit too specific.  This patch isn't just
> "implementing functions".  There are more than just functions here.  The
> best subject is probably:
> 
> 	Add SEAMCALL infrastructure

Yes this is better.  Thanks.

> 
> But that's rather generic by necessity because this patch does several
> _different_ logical things:
> 
>  * Wrap TDX_MODULE_CALL so it can be used for SEAMCALLs with host=1
>  * Add handling to TDX_MODULE_CALL to allow it to handle specifically
>    host-side error conditions
>  * Add high-level seamcall() function with actual error handling
> 
> It's probably also worth noting that the code that allows "host=1" to be
> passed to TDX_MODULE_CALL is dead code in mainline right now.  It
> probably shouldn't have been merged this way, but oh well.
> 
> I don't know that you really _need_ to split this up, but I'm just
> pointing out that mashing three different logical things together makes
> it hard to write a coherent Subject.  But, I've seen worse.

Agreed.

To me seems "Add SEAMCALL infrastructure" is good enough, but I can split up if
you want me to.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-06 18:18   ` Dave Hansen
@ 2023-01-09 11:48     ` Huang, Kai
  2023-01-09 16:51       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-09 11:48 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 10:18 -0800, Dave Hansen wrote:
> > On 12/8/22 22:52, Kai Huang wrote:
> > > > As a step of initializing the TDX module, the kernel needs to tell the
> > > > TDX module which memory regions can be used by the TDX module as TDX
> > > > guest memory.
> > > > 
> > > > TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> > > > kernel which memory is TDX compatible.  The kernel needs to build a list
> > > > of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> > > > the TDX module.  Once this is done, those "TDX-usable" memory regions
> > > > are fixed during module's lifetime.
> > > > 
> > > > The initial support of TDX guests will only allocate TDX guest memory
> > > > from the global page allocator.  To keep things simple, just make sure
> > > > all pages in the page allocator are TDX memory.
> > 
> > It's hard to tell what "The initial support of TDX guests" means.  I
> > *think* you mean "this series".  But, we try not to say "this blah" too
> > much, so just say this:
> > 
> > 	To keep things simple, assume that all TDX-protected memory will
> > 	comes page allocator.  Make sure all pages in the page allocator
> > 	*are* TDX-usable memory.
> > 

Yes will do. Thanks.

> > > > To guarantee that, stash off the memblock memory regions at the time of
> > > > initializing the TDX module as TDX's own usable memory regions, and in
> > > > the meantime, register a TDX memory notifier to reject to online any new
> > > > memory in memory hotplug.
> > 
> > First, this is a run-on sentence.  Second, it isn't really clear what
> > memblocks have to do with this or why you need to stash them off.
> > Please explain.

Will break up the run-on sentence. And as you mentioned below, will add a
sentence to explain why need to stash off memblocks.

> > 
> > > > This approach works as in practice all boot-time present DIMMs are TDX
> > > > convertible memory.  However, if any non-TDX-convertible memory has been
> > > > hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> > > > module, the module initialization will fail.
> > 
> > I really don't know what this is trying to say.

My intention is to explain and call out that such design (use all memory regions
in memblock at the time of module initialization) works in practice, as long as
non-CMR memory hasn't been added via memory hotplug.

Not sure if it is necessary, but I was thinking it may help reviewer to judge
whether such design is acceptable.

> > 
> > *How* and *why* does this module initialization failure occur?  
> > 

If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
will fail.

> > How do
> > you implement it and why is it necessary?

As mentioned above, we depend on SEAMCALL to fail.

I am not sure whether it is necessary, but I thought it could help people to
review.

> > 
> > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > the "TDX-capable" nodes.
> > 
> > Why does it need to be enhanced?  What's the problem?

The problem is after TDX module initialization, no more memory can be hot-added
to the page allocator.

Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
utilize all memory.

But probably it is not necessarily to call out in the changelog?

> > 
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index dd333b46fafb..b36129183035 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> > > >  	depends on X86_64
> > > >  	depends on KVM_INTEL
> > > >  	depends on X86_X2APIC
> > > > +	select ARCH_KEEP_MEMBLOCK
> > > >  	help
> > > >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > >  	  host and certain physical attacks.  This option enables necessary TDX
> > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > index 216fee7144ee..3a841a77fda4 100644
> > > > --- a/arch/x86/kernel/setup.c
> > > > +++ b/arch/x86/kernel/setup.c
> > > > @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
> > > >  	 *
> > > >  	 * Moreover, on machines with SandyBridge graphics or in setups that use
> > > >  	 * crashkernel the entire 1M is reserved anyway.
> > > > +	 *
> > > > +	 * Note the host kernel TDX also requires the first 1MB being reserved.
> > > >  	 */
> > > >  	reserve_real_mode();
> > > >  
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index 6fe505c32599..f010402f443d 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -13,6 +13,13 @@
> > > >  #include <linux/errno.h>
> > > >  #include <linux/printk.h>
> > > >  #include <linux/mutex.h>
> > > > +#include <linux/list.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/memblock.h>
> > > > +#include <linux/memory.h>
> > > > +#include <linux/minmax.h>
> > > > +#include <linux/sizes.h>
> > > > +#include <linux/pfn.h>
> > > >  #include <asm/pgtable_types.h>
> > > >  #include <asm/msr.h>
> > > >  #include <asm/tdx.h>
> > > > @@ -25,6 +32,12 @@ enum tdx_module_status_t {
> > > >  	TDX_MODULE_ERROR
> > > >  };
> > > >  
> > > > +struct tdx_memblock {
> > > > +	struct list_head list;
> > > > +	unsigned long start_pfn;
> > > > +	unsigned long end_pfn;
> > > > +};
> > > > +
> > > >  static u32 tdx_keyid_start __ro_after_init;
> > > >  static u32 nr_tdx_keyids __ro_after_init;
> > > >  
> > > > @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
> > > >  /* Prevent concurrent attempts on TDX detection and initialization */
> > > >  static DEFINE_MUTEX(tdx_module_lock);
> > > >  
> > > > +/* All TDX-usable memory regions */
> > > > +static LIST_HEAD(tdx_memlist);
> > > > +
> > > >  /*
> > > >   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
> > > >   * This is used in TDX initialization error paths to take it from
> > > > @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> > > > +{
> > > > +	struct tdx_memblock *tmb;
> > > > +
> > > > +	/* Empty list means TDX isn't enabled. */
> > > > +	if (list_empty(&tdx_memlist))
> > > > +		return true;
> > > > +
> > > > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > +		/*
> > > > +		 * The new range is TDX memory if it is fully covered by
> > > > +		 * any TDX memory block.
> > > > +		 *
> > > > +		 * Note TDX memory blocks are originated from memblock
> > > > +		 * memory regions, which can only be contiguous when two
> > > > +		 * regions have different NUMA nodes or flags.  Therefore
> > > > +		 * the new range cannot cross multiple TDX memory blocks.
> > > > +		 */
> > > > +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > +			return true;
> > > > +	}
> > > > +	return false;
> > > > +}
> > 
> > I don't really like that comment.  It should first state its behavior
> > and assumptions, like:
> > 
> > 	This check assumes that the start_pfn<->end_pfn range does not
> > 	cross multiple tdx_memlist entries.
> > 
> > Only then should it describe why that is OK:
> > 
> > 	A single memory hotplug even across mutliple memblocks (from
> > 	which tdx_memlist entries are derived) is impossible.  ... then
> > 	actually explain
> > 

How about below?

	/*
	 * This check assumes that the start_pfn<->end_pfn range does not cross
	 * multiple tdx_memlist entries. A single memory hotplug event across
	 * multiple memblocks (from which tdx_memlist entries are derived) is
	 * impossible. That means start_pfn<->end_pfn range cannot exceed a
	 * tdx_memlist entry, and the new range is TDX memory if it is fully
	 * covered by any tdx_memlist entry.
	 */

> > 
> > 
> > > > +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> > > > +			       void *v)
> > > > +{
> > > > +	struct memory_notify *mn = v;
> > > > +
> > > > +	if (action != MEM_GOING_ONLINE)
> > > > +		return NOTIFY_OK;
> > > > +
> > > > +	/*
> > > > +	 * Not all memory is compatible with TDX.  Reject
> > > > +	 * to online any incompatible memory.
> > > > +	 */
> > 
> > This comment isn't quite right either.  There might actually be totally
> > TDX *compatible* memory here.  It just wasn't configured for use with TDX.
> > 
> > Shouldn't this be something more like:
> > 
> > 	/*
> > 	 * The TDX memory configuration is static and can not be
> > 	 * changed.  Reject onlining any memory which is outside
> > 	 * of the static configuration whether it supports TDX or not.
> > 	 */

Yes it's better. Thanks.

My intention was the "incompatible memory" in the original comment actually
means "out-of-static-configuration" memory, but indeed it can be easily confused
with non-CMR memory.

> > 
> > > > +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> > > > +		NOTIFY_OK : NOTIFY_BAD;
> > > > +}
> > > > +
> > > > +static struct notifier_block tdx_memory_nb = {
> > > > +	.notifier_call = tdx_memory_notifier,
> > > > +};
> > > > +
> > > >  static int __init tdx_init(void)
> > > >  {
> > > >  	int err;
> > > > @@ -89,6 +149,13 @@ static int __init tdx_init(void)
> > > >  		goto no_tdx;
> > > >  	}
> > > >  
> > > > +	err = register_memory_notifier(&tdx_memory_nb);
> > > > +	if (err) {
> > > > +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> > > > +				err);
> > > > +		goto no_tdx;
> > > > +	}
> > > > +
> > > >  	return 0;
> > > >  no_tdx:
> > > >  	clear_tdx();
> > > > @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Add a memory region as a TDX memory block.  The caller must make sure
> > > > + * all memory regions are added in address ascending order and don't
> > > > + * overlap.
> > > > + */
> > > > +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> > > > +			    unsigned long end_pfn)
> > > > +{
> > > > +	struct tdx_memblock *tmb;
> > > > +
> > > > +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > > > +	if (!tmb)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	INIT_LIST_HEAD(&tmb->list);
> > > > +	tmb->start_pfn = start_pfn;
> > > > +	tmb->end_pfn = end_pfn;
> > > > +
> > > > +	list_add_tail(&tmb->list, tmb_list);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void free_tdx_memlist(struct list_head *tmb_list)
> > > > +{
> > > > +	while (!list_empty(tmb_list)) {
> > > > +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> > > > +				struct tdx_memblock, list);
> > > > +
> > > > +		list_del(&tmb->list);
> > > > +		kfree(tmb);
> > > > +	}
> > > > +}
> > 
> > 'tdx_memlist' is written only once at boot and then is read-only, right?
> > 
> > It might be nice to mention that so that the lack of locking doesn't
> > look problematic.

No. 'tdx_memlist' is only written in init_tdx_module(), and it is only read in
memory hotplug. The read/write of 'tdx_memlist' is protected by memory hotplug
locking as you also mentioned below.

> > 
> > > > +/*
> > > > + * Ensure that all memblock memory regions are convertible to TDX
> > > > + * memory.  Once this has been established, stash the memblock
> > > > + * ranges off in a secondary structure because memblock is modified
> > > > + * in memory hotplug while TDX memory regions are fixed.
> > > > + */
> > 
> > Ahh, that's why we need to "shadow" the memblocks.  Can you add a
> > sentence on this to the changelog, please?

Yes will do.

> > 
> > > > +static int build_tdx_memlist(struct list_head *tmb_list)
> > > > +{
> > > > +	unsigned long start_pfn, end_pfn;
> > > > +	int i, ret;
> > > > +
> > > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> > > > +		/*
> > > > +		 * The first 1MB is not reported as TDX convertible memory.
> > > > +		 * Although the first 1MB is always reserved and won't end up
> > > > +		 * to the page allocator, it is still in memblock's memory
> > > > +		 * regions.  Skip them manually to exclude them as TDX memory.
> > > > +		 */
> > > > +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> > > > +		if (start_pfn >= end_pfn)
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * Add the memory regions as TDX memory.  The regions in
> > > > +		 * memblock has already guaranteed they are in address
> > > > +		 * ascending order and don't overlap.
> > > > +		 */
> > > > +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> > > > +		if (ret)
> > > > +			goto err;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +err:
> > > > +	free_tdx_memlist(tmb_list);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static int init_tdx_module(void)
> > > >  {
> > > >  	/*
> > > > @@ -226,10 +364,25 @@ static int init_tdx_module(void)
> > > >  	if (ret)
> > > >  		goto out;
> > > >  
> > > > +	/*
> > > > +	 * The initial support of TDX guests only allocates memory from
> > > > +	 * the global page allocator.  To keep things simple, just make
> > > > +	 * sure all pages in the page allocator are TDX memory.
> > 
> > I didn't like this in the changelog either.  Try to make this "timeless"
> > rather than refer to what the support is today.  I gave you example text
> > above.

Yes will improve.

> > 
> > > > +	 * Build the list of "TDX-usable" memory regions which cover all
> > > > +	 * pages in the page allocator to guarantee that.  Do it while
> > > > +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> > > > +	 * path reads the @tdx_memlist to reject any new memory.
> > > > +	 */
> > > > +	get_online_mems();
> > 
> > Oh, it actually uses the memory hotplug locking for list protection.
> > That's at least a bit subtle.  Please document that somewhere in the
> > functions that actually manipulate the list.

add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
list_del() to manipulate the list, but they actually takes 'struct list_head
*tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.

Do you mean document the locking around the implementation of add_tdx_memblock()
and free_tdx_memlist()?

Or should we just mention it around the 'tdx_memlist' variable?

/* All TDX-usable memory regions. Protected by memory hotplug locking. */
static LIST_HEAD(tdx_memlist);

> > 
> > I think it's also worth saying something here about the high-level
> > effects of what's going on:
> > 
> > 	Take a snapshot of the memory configuration (memblocks).  This
> > 	snapshot will be used to enable TDX support for *this* memory
> > 	configuration only.  Use a memory hotplug notifier to ensure
> > 	that no other RAM can be added outside of this configuration.
> > 
> > That's it, right?

Yes. I'll somehow include above into the comment around get_online_mems().

But should I move "Use a memory hotplug notifier ..." part to:

	err = register_memory_notifier(&tdx_memory_nb);

because this is where we actually use the memory hotplug notifier?

> > 
> > > > +	ret = build_tdx_memlist(&tdx_memlist);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > >  	/*
> > > >  	 * TODO:
> > > >  	 *
> > > > -	 *  - Build the list of TDX-usable memory regions.
> > > >  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
> > > >  	 *    regions.
> > > >  	 *  - Pick up one TDX private KeyID as the global KeyID.
> > > > @@ -241,6 +394,11 @@ static int init_tdx_module(void)
> > > >  	 */
> > > >  	ret = -EINVAL;
> > > >  out:
> > > > +	/*
> > > > +	 * @tdx_memlist is written here and read at memory hotplug time.
> > > > +	 * Lock out memory hotplug code while building it.
> > > > +	 */
> > > > +	put_online_mems();
> > > >  	return ret;
> > > >  }
> > 
> > You would also be wise to have the folks who do a lot of memory hotplug
> > work today look at this sooner rather than later.  I _think_ what you
> > have here is OK, but I'm really rusty on the code itself.
> > 

Thanks for advice. Will do.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-09 11:48     ` Huang, Kai
@ 2023-01-09 16:51       ` Dave Hansen
  2023-01-10 12:09         ` Huang, Kai
  2023-01-12 11:33         ` Huang, Kai
  0 siblings, 2 replies; 84+ messages in thread
From: Dave Hansen @ 2023-01-09 16:51 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 03:48, Huang, Kai wrote:
...
>>>>> This approach works as in practice all boot-time present DIMMs are TDX
>>>>> convertible memory.  However, if any non-TDX-convertible memory has been
>>>>> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
>>>>> module, the module initialization will fail.
>>>
>>> I really don't know what this is trying to say.
> 
> My intention is to explain and call out that such design (use all memory regions
> in memblock at the time of module initialization) works in practice, as long as
> non-CMR memory hasn't been added via memory hotplug.
> 
> Not sure if it is necessary, but I was thinking it may help reviewer to judge
> whether such design is acceptable.

This is yet another case where you've mechanically described the "what",
but left out the implications or the underlying basis "why".

I'd take a more methodical approach to describe what is going on here.
List the steps that must occur, or at least *one* example of those steps
and how they intereact with the code in this patch.  Then, explain the
fallout.

I also don't think it's quite right to call out "CXL memory via kmem
driver".  If the CXL memory was "System RAM", it should get covered by a
CMR and TDMR.  The kmem driver can still go wild with it.

>>> *How* and *why* does this module initialization failure occur?
> 
> If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> will fail.

I'm frankly lost now.  Please go back and try to explain this better.
Let me know if you want to iterate on this faster than resending this
series five more times.  I've got some ideas.

>>>>> This can also be enhanced in the future, i.e. by allowing adding non-TDX
>>>>> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>>>>> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>>>>> needs to guarantee memory pages for TDX guests are always allocated from
>>>>> the "TDX-capable" nodes.
>>>
>>> Why does it need to be enhanced?  What's the problem?
> 
> The problem is after TDX module initialization, no more memory can be hot-added
> to the page allocator.
> 
> Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> utilize all memory.
> 
> But probably it is not necessarily to call out in the changelog?

Let's say that we add this TDX-compatible-node ABI in the future.  What
will old code do that doesn't know about this ABI?

...
>>>>> +       list_for_each_entry(tmb, &tdx_memlist, list) {
>>>>> +               /*
>>>>> +                * The new range is TDX memory if it is fully covered by
>>>>> +                * any TDX memory block.
>>>>> +                *
>>>>> +                * Note TDX memory blocks are originated from memblock
>>>>> +                * memory regions, which can only be contiguous when two
>>>>> +                * regions have different NUMA nodes or flags.  Therefore
>>>>> +                * the new range cannot cross multiple TDX memory blocks.
>>>>> +                */
>>>>> +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
>>>>> +                       return true;
>>>>> +       }
>>>>> +       return false;
>>>>> +}
>>>
>>> I don't really like that comment.  It should first state its behavior
>>> and assumptions, like:
>>>
>>>     This check assumes that the start_pfn<->end_pfn range does not
>>>     cross multiple tdx_memlist entries.
>>>
>>> Only then should it describe why that is OK:
>>>
>>>     A single memory hotplug even across mutliple memblocks (from
>>>     which tdx_memlist entries are derived) is impossible.  ... then
>>>     actually explain
>>>
> 
> How about below?
> 
>         /*
>          * This check assumes that the start_pfn<->end_pfn range does not cross
>          * multiple tdx_memlist entries. A single memory hotplug event across
>          * multiple memblocks (from which tdx_memlist entries are derived) is
>          * impossible. That means start_pfn<->end_pfn range cannot exceed a
>          * tdx_memlist entry, and the new range is TDX memory if it is fully
>          * covered by any tdx_memlist entry.
>          */

I was hoping you would actually explain why it is impossible.

Is there something fundamental that keeps a memory area that spans two
nodes from being removed and then a new area added that is comprised of
a single node?

Boot time:

	| memblock  |  memblock |
	<--Node=0--> <--Node=1-->

Funky hotplug... nothing to see here, then:

	<--------Node=2-------->

I would believe that there is no current bare-metal TDX system that has
an implementation like this.  But, the comments above speak like it's
fundamentally impossible.  That should be clarified.

In other words, that comment talks about memblock attributes as being
the core underlying reason that that simplified check is OK.  Is that
it, or is it really the reduced hotplug feature set on TDX systems?


...
>>>>> +        * Build the list of "TDX-usable" memory regions which cover all
>>>>> +        * pages in the page allocator to guarantee that.  Do it while
>>>>> +        * holding mem_hotplug_lock read-lock as the memory hotplug code
>>>>> +        * path reads the @tdx_memlist to reject any new memory.
>>>>> +        */
>>>>> +       get_online_mems();
>>>
>>> Oh, it actually uses the memory hotplug locking for list protection.
>>> That's at least a bit subtle.  Please document that somewhere in the
>>> functions that actually manipulate the list.
> 
> add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> list_del() to manipulate the list, but they actually takes 'struct list_head
> *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> 
> Do you mean document the locking around the implementation of add_tdx_memblock()
> and free_tdx_memlist()?
> 
> Or should we just mention it around the 'tdx_memlist' variable?
> 
> /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> static LIST_HEAD(tdx_memlist);

I don't think I'd hate it being in all three spots.  Also "protected by
memory hotplug locking" is pretty generic.  Please be more specific.

>>> I think it's also worth saying something here about the high-level
>>> effects of what's going on:
>>>
>>>     Take a snapshot of the memory configuration (memblocks).  This
>>>     snapshot will be used to enable TDX support for *this* memory
>>>     configuration only.  Use a memory hotplug notifier to ensure
>>>     that no other RAM can be added outside of this configuration.
>>>
>>> That's it, right?
> 
> Yes. I'll somehow include above into the comment around get_online_mems().
> 
> But should I move "Use a memory hotplug notifier ..." part to:
> 
>         err = register_memory_notifier(&tdx_memory_nb);
> 
> because this is where we actually use the memory hotplug notifier?

I actually want that snippet in the changelog.

>>>>> +       ret = build_tdx_memlist(&tdx_memlist);
>>>>> +       if (ret)
>>>>> +               goto out;
>>>>> +
>>>>>         /*
>>>>>          * TODO:
>>>>>          *
>>>>> -        *  - Build the list of TDX-usable memory regions.
>>>>>          *  - Construct a list of TDMRs to cover all TDX-usable memory
>>>>>          *    regions.
>>>>>          *  - Pick up one TDX private KeyID as the global KeyID.
>>>>> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>>>>>          */
>>>>>         ret = -EINVAL;
>>>>>  out:
>>>>> +       /*
>>>>> +        * @tdx_memlist is written here and read at memory hotplug time.
>>>>> +        * Lock out memory hotplug code while building it.
>>>>> +        */
>>>>> +       put_online_mems();
>>>>>         return ret;
>>>>>  }
>>>
>>> You would also be wise to have the folks who do a lot of memory hotplug
>>> work today look at this sooner rather than later.  I _think_ what you
>>> have here is OK, but I'm really rusty on the code itself.
>>>
> 
> Thanks for advice. Will do.
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-01-09 10:25     ` Huang, Kai
@ 2023-01-09 19:52       ` Dave Hansen
  2023-01-09 22:07         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-09 19:52 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 02:25, Huang, Kai wrote:
> On Fri, 2023-01-06 at 09:46 -0800, Dave Hansen wrote:
...
>>> Note not all members in the 1024 bytes TDX module information are used
>>> (even by the KVM).
>>
>> I'm not sure what this has to do with anything.
> 
> You mentioned in v7 that:
>>>> This is also a great place to mention that the tdsysinfo_struct
contains
>>> a *lot* of gunk which will not be used for a bit or that may never get
>>> used.
> 
> https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#m168e619aac945fa418ccb1d6652113003243d895
> 
> Perhaps I misunderstood something but I was trying to address this.
> 
> Should I remove this sentence?

If someone goes looking at this patch, the see tdsysinfo_struct with
something like two dozen defined fields.  But, very few of them get used
in this patch.  Why?  Just saying that they are unused is a bit silly.

	The 'tdsysinfo_struct' is fairly large (1k) and contains a lot
	of info about the TD.  Fully define the entire structure, but
	only use the fields necessary to build the PAMT and TDMRs and
	pr_info() some basics about the module.

	The rest of the fields will get used... (by kvm?  never??)

...
>>> +	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
>>> +	int ret;
>>> +
>>> +	ret = tdx_get_sysinfo(sysinfo, cmr_array);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>>  	/*
>>>  	 * TODO:
>>>  	 *
>>> -	 *  - Get TDX module information and TDX-capable memory regions.
>>>  	 *  - Build the list of TDX-usable memory regions.
>>>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>>>  	 *    regions.
>>> @@ -166,7 +239,9 @@ static int init_tdx_module(void)
>>>  	 *
>>>  	 *  Return error before all steps are done.
>>>  	 */
>>> -	return -EINVAL;
>>> +	ret = -EINVAL;
>>> +out:
>>> +	return ret;
>>>  }
>>
>> I'm going to be lazy and not look into the future.  But, you don't need
>> the "out:" label here, yet.  It doesn'serve any purpose like this, so
>> why introduce it here?
> 
> The 'out' label is here because of below code:
> 
> 	ret = tdx_get_sysinfo(...);
> 	if (ret)
> 		goto out;
> 
> If I don't have 'out' label here in this patch, do you mean something below?
> 
> 	ret = tdx_get_sysinfo(...);
> 	if (ret)
> 		return ret;
> 
> 	/*
> 	 * TODO:
> 	 * ...
> 	 * Return error before all steps are done.
> 	 */
> 	return -EINVAL;

Yes, if you remove the 'out:' label like you've shown in your reply,
it's actually _less_ code.  The labels are really only necessary when
you have common work to "undo" something before returning from the
function.  Here, you can just return.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL
  2023-01-09 10:30     ` Huang, Kai
@ 2023-01-09 19:54       ` Dave Hansen
  2023-01-09 22:10         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-09 19:54 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 02:30, Huang, Kai wrote:
>> I don't know that you really _need_ to split this up, but I'm just
>> pointing out that mashing three different logical things together makes
>> it hard to write a coherent Subject.  But, I've seen worse.
> Agreed.
> 
> To me seems "Add SEAMCALL infrastructure" is good enough, but I can split up if
> you want me to.

Everything else being equal, I'd rather have them split up.  But, I'm
frankly not looking forward to the additional work on my part to review
and rework three more patches and changelogs.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-01-09 19:52       ` Dave Hansen
@ 2023-01-09 22:07         ` Huang, Kai
  2023-01-09 22:11           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-09 22:07 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 11:52 -0800, Hansen, Dave wrote:
> On 1/9/23 02:25, Huang, Kai wrote:
> > On Fri, 2023-01-06 at 09:46 -0800, Dave Hansen wrote:
> ...
> > > > Note not all members in the 1024 bytes TDX module information are used
> > > > (even by the KVM).
> > > 
> > > I'm not sure what this has to do with anything.
> > 
> > You mentioned in v7 that:
> > > > > This is also a great place to mention that the tdsysinfo_struct
> contains
> > > > a *lot* of gunk which will not be used for a bit or that may never get
> > > > used.
> > 
> > https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#m168e619aac945fa418ccb1d6652113003243d895
> > 
> > Perhaps I misunderstood something but I was trying to address this.
> > 
> > Should I remove this sentence?
> 
> If someone goes looking at this patch, the see tdsysinfo_struct with
> something like two dozen defined fields.  But, very few of them get used
> in this patch.  Why?  Just saying that they are unused is a bit silly.
> 
> 	The 'tdsysinfo_struct' is fairly large (1k) and contains a lot
> 	of info about the TD.  Fully define the entire structure, but
			  ^
		should be: "about the TDX module"?
			
> 	only use the fields necessary to build the PAMT and TDMRs and
> 	pr_info() some basics about the module.

Above looks great!  Thanks.

> 
> 	The rest of the fields will get used... (by kvm?  never??)

The current KVM TDX support series uses majority of the rest fields:

https://lore.kernel.org/lkml/99e5fcf2a7127347816982355fd4141ee1038a54.1667110240.git.isaku.yamahata@intel.com/

Only one field isn't used, but I don't want to assume it won't be used forever,
so I think "The rest of the fields will get used by KVM." is good enough.

> 
> ...
> > > > +	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
> > > > +	int ret;
> > > > +
> > > > +	ret = tdx_get_sysinfo(sysinfo, cmr_array);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > >  	/*
> > > >  	 * TODO:
> > > >  	 *
> > > > -	 *  - Get TDX module information and TDX-capable memory regions.
> > > >  	 *  - Build the list of TDX-usable memory regions.
> > > >  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
> > > >  	 *    regions.
> > > > @@ -166,7 +239,9 @@ static int init_tdx_module(void)
> > > >  	 *
> > > >  	 *  Return error before all steps are done.
> > > >  	 */
> > > > -	return -EINVAL;
> > > > +	ret = -EINVAL;
> > > > +out:
> > > > +	return ret;
> > > >  }
> > > 
> > > I'm going to be lazy and not look into the future.  But, you don't need
> > > the "out:" label here, yet.  It doesn'serve any purpose like this, so
> > > why introduce it here?
> > 
> > The 'out' label is here because of below code:
> > 
> > 	ret = tdx_get_sysinfo(...);
> > 	if (ret)
> > 		goto out;
> > 
> > If I don't have 'out' label here in this patch, do you mean something below?
> > 
> > 	ret = tdx_get_sysinfo(...);
> > 	if (ret)
> > 		return ret;
> > 
> > 	/*
> > 	 * TODO:
> > 	 * ...
> > 	 * Return error before all steps are done.
> > 	 */
> > 	return -EINVAL;
> 
> Yes, if you remove the 'out:' label like you've shown in your reply,
> it's actually _less_ code.  The labels are really only necessary when
> you have common work to "undo" something before returning from the
> function.  Here, you can just return.
> 

Thanks will do.

I think this applies to construct_tdmrs() too (patch 09 - 11).  I'll check that
part too based on your above idea.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL
  2023-01-09 19:54       ` Dave Hansen
@ 2023-01-09 22:10         ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-09 22:10 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 11:54 -0800, Dave Hansen wrote:
> On 1/9/23 02:30, Huang, Kai wrote:
> > > I don't know that you really _need_ to split this up, but I'm just
> > > pointing out that mashing three different logical things together makes
> > > it hard to write a coherent Subject.  But, I've seen worse.
> > Agreed.
> > 
> > To me seems "Add SEAMCALL infrastructure" is good enough, but I can split up if
> > you want me to.
> 
> Everything else being equal, I'd rather have them split up.  But, I'm
> frankly not looking forward to the additional work on my part to review
> and rework three more patches and changelogs.

Yes I agree splitting up is better, but for the sake of not adding new review
work for you I'll just change the patch title.  Thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-01-09 22:07         ` Huang, Kai
@ 2023-01-09 22:11           ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2023-01-09 22:11 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 14:07, Huang, Kai wrote:
>>
>>       The 'tdsysinfo_struct' is fairly large (1k) and contains a lot
>>       of info about the TD.  Fully define the entire structure, but
>                           ^
>                 should be: "about the TDX module"?
> 

Yes, of course.  Thinko on my part.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-01-06 19:24   ` Dave Hansen
@ 2023-01-10  0:40     ` Huang, Kai
  2023-01-10  0:47       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  0:40 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 11:24 -0800, Dave Hansen wrote:
> > +struct tdmr_info_list {
> > +	struct tdmr_info *first_tdmr;
> 
> This is named badly.  This is really a pointer to an array.  While it
> _does_ of course point to the first member of the array, the naming
> should make it clear that there are multiple tdmr_infos here.

Will change to 'tdmrs' as in your code.

> 
> > +	int tdmr_sz;
> > +	int max_tdmrs;
> > +	int nr_tdmrs;	/* Actual number of TDMRs */
> > +};
> 
> This 'tdmr_info_list's is declared in an unfortunate place.  I thought
> the tdmr_size_single() function below was related to it.

I think I can move it "tdx.h", which is claimed to have both TDX-arch data
structures and linux-defined structures anyway.

I think I can also move 'enum tdx_module_status_t' and 'struct tdx_memblock'
declarations to "tdx.h" too so that all declarations are in "tdx.h".

> 
> Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
> really need to be stored here?  

It's not mandatory to keep them here.  I did it mainly because I want to avoid
passing 'sysinfo' as argument for almost all functions related to constructing
TDMRs.

For instance, 'tdmr_sz' is used to calculate the position of each individual
TDMR at a given index.  Instead of passing additional 'sysinfo' (or sysinfo-
>max_reserved_per_tdmr):

	struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,  int
idx,
				     struct tdsysinfo_struct *sysinfo) { ... }

I perfer:

	struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list, int idx)
	{...}

tdmr_entry() is basically called in all 3 steps (fill out TDMRs, allocate PAMTs,
and designate reserved areas).  Having 'sysinfo' in it will require almost all
functions related to constructing TDMRs to have 'sysinfo' as argument, which
only makes the code more complicated and hurts the readability IMHO.

> If so, I think I'd probably do something
> like this with the structure:
> 
> struct tdmr_info_list {
> 	struct tdmr_info *tdmrs;
> 	int nr_consumed_tdmrs; // How many @tdmrs are in use
> 
> 	/* Metadata for freeing this structure: */
> 	int tdmr_sz;   // Size of one 'tdmr_info' (has a flex array)
> 	int max_tdmrs; // How many @tdmrs are allocated
> };
> 
> Modulo whataver folks are doing for comments these days.

Looks nice to me.  Will use.  A slight thing is 'tdmr_sz' is also used to get
the TDMR at a given index, but not just freeing the structure.

Btw, is C++ style comment "//" OK in kernel code?
> 
> > +/* Calculate the actual TDMR size */
> > +static int tdmr_size_single(u16 max_reserved_per_tdmr)
> > +{
> > +	int tdmr_sz;
> > +
> > +	/*
> > +	 * The actual size of TDMR depends on the maximum
> > +	 * number of reserved areas.
> > +	 */
> > +	tdmr_sz = sizeof(struct tdmr_info);
> > +	tdmr_sz += sizeof(struct tdmr_reserved_area) *
> > max_reserved_per_tdmr;
> > +
> > +	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
> > +}
> > +
> > +static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
> > +			   struct tdsysinfo_struct *sysinfo)
> > +{
> > +	size_t tdmr_sz, tdmr_array_sz;
> > +	void *tdmr_array;
> > +
> > +	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
> > +	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
> > +
> > +	/*
> > +	 * To keep things simple, allocate all TDMRs together.
> > +	 * The buffer needs to be physically contiguous to make
> > +	 * sure each TDMR is physically contiguous.
> > +	 */
> > +	tdmr_array = alloc_pages_exact(tdmr_array_sz,
> > +			GFP_KERNEL | __GFP_ZERO);
> > +	if (!tdmr_array)
> > +		return -ENOMEM;
> > +
> > +	tdmr_list->first_tdmr = tdmr_array;
> > +	/*
> 
> 	^ probably missing whitepsace before the comment
> 

Will add, assuming you mean a new empty line.  Thanks for the tip.


> > +	 * Keep the size of TDMR to find the target TDMR
> > +	 * at a given index in the TDMR list.
> > +	 */
> > +	tdmr_list->tdmr_sz = tdmr_sz;
> > +	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
> > +	tdmr_list->nr_tdmrs = 0;
> > +
> > +	return 0;
> > +}
> > +

[snip]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 09/16] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2023-01-06 19:36   ` Dave Hansen
@ 2023-01-10  0:45     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  0:45 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 11:36 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > Start to transit out the "multi-steps" to construct a list of "TD Memory
> > Regions" (TDMRs) to cover all TDX-usable memory regions.
> > 
> > The kernel configures TDX-usable memory regions by passing a list of
> > TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
> > the information of the base/size of a memory region, the base/size of the
> > associated Physical Address Metadata Table (PAMT) and a list of reserved
> > areas in the region.
> > 
> > Do the first step to fill out a number of TDMRs to cover all TDX memory
> > regions.  To keep it simple, always try to use one TDMR for each memory
> > region.  As the first step only set up the base/size for each TDMR.
> > 
> > Each TDMR must be 1G aligned and the size must be in 1G granularity.
> > This implies that one TDMR could cover multiple memory regions.  If a
> > memory region spans the 1GB boundary and the former part is already
> > covered by the previous TDMR, just use a new TDMR for the remaining
> > part.
> > 
> > TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
> > are consumed but there is more memory region to cover.
> 
> This could probably use some discussion of why it is not being
> future-proofed.  Maybe:
> 
> 	There are fancier things that could be done like trying to merge
> 	adjacent TDMRs.  This would allow more pathological memory
> 	layouts to be supported.  But, current systems are not even
> 	close to exhausting the existing TDMR resources in practice.
> 	For now, keep it simple.

Looks great.  Thanks.

> 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index d36ac72ef299..5b1de0200c6b 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -407,6 +407,90 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
> >  			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
> >  }
> >  
> > +/* Get the TDMR from the list at the given index. */
> > +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> > +				    int idx)
> > +{
> > +	return (struct tdmr_info *)((unsigned long)tdmr_list->first_tdmr +
> > +			tdmr_list->tdmr_sz * idx);
> > +}
> 
> I think that's more complicated and has more casting than necessary.
> This looks nicer:
> 
> 	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
> 
> 	return (void *)tdmr_list->first_tdmr + tdmr_info_offset;
> 
> Also, it might even be worth keeping ->first_tdmr as a void*.  It isn't
> a real C array and keeping it as void* would keep anyone from doing:
> 
> 	tdmr_foo = tdmr_list->first_tdmr[foo];

Yes good point.  Will do.

[snip]


> 
> Otherwise this actually looks fine.

Thanks.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-01-07  0:47   ` Dave Hansen
@ 2023-01-10  0:47     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  0:47 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 16:47 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > @@ -574,6 +774,11 @@ static int init_tdx_module(void)
> >  	 *  Return error before all steps are done.
> >  	 */
> >  	ret = -EINVAL;
> > +	if (ret)
> > +		tdmrs_free_pamt_all(&tdmr_list);
> > +	else
> > +		pr_info("%lu pages allocated for PAMT.\n",
> > +				tdmrs_count_pamt_pages(&tdmr_list));
> 
> Could you please convert this to megabytes or kilobytes?  dmesg is for
> humans and humans don't generally know how large their systems or DIMMs
> are in pages without looking or grabbing a calculator.

Will convert to print out kilobytes.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-01-10  0:40     ` Huang, Kai
@ 2023-01-10  0:47       ` Dave Hansen
  2023-01-10  2:23         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10  0:47 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 16:40, Huang, Kai wrote:
> On Fri, 2023-01-06 at 11:24 -0800, Dave Hansen wrote:
...
>> Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
>> really need to be stored here?
> 
> It's not mandatory to keep them here.  I did it mainly because I want to avoid
> passing 'sysinfo' as argument for almost all functions related to constructing
> TDMRs.

I don't think it hurts readability that much.  On the contrary, it makes
it more clear what data is needed for initialization.

>> If so, I think I'd probably do something
>> like this with the structure:
>>
>> struct tdmr_info_list {
>>       struct tdmr_info *tdmrs;
>>       int nr_consumed_tdmrs; // How many @tdmrs are in use
>>
>>       /* Metadata for freeing this structure: */
>>       int tdmr_sz;   // Size of one 'tdmr_info' (has a flex array)
>>       int max_tdmrs; // How many @tdmrs are allocated
>> };
>>
>> Modulo whataver folks are doing for comments these days.
> 
> Looks nice to me.  Will use.  A slight thing is 'tdmr_sz' is also used to get
> the TDMR at a given index, but not just freeing the structure.
> 
> Btw, is C++ style comment "//" OK in kernel code?

It's OK with me, but I don't think there's much consensus on it.
Probably best to stick with normal arch/x86 style for now.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-01-06 21:53   ` Dave Hansen
@ 2023-01-10  0:49     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  0:49 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 13:53 -0800, Dave Hansen wrote:
> Looks good so far.
> 
> > +/*
> > + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
> > + * within @tdmr, and set up PAMTs for @tdmr.
> > + */
> > +static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > +			    struct list_head *tmb_list,
> > +			    u16 pamt_entry_size)
> > +{
> > +	unsigned long pamt_base[TDX_PS_1G + 1];
> > +	unsigned long pamt_size[TDX_PS_1G + 1];
> 
> Nit: please define a TDX_PS_NR rather than open-coding this.

Will do.

> 
> > +	unsigned long tdmr_pamt_base;
> > +	unsigned long tdmr_pamt_size;
> > +	struct page *pamt;
> > +	int pgsz, nid;
> > +
> > +	nid = tdmr_get_nid(tdmr, tmb_list);
> > +
> > +	/*
> > +	 * Calculate the PAMT size for each TDX supported page size
> > +	 * and the total PAMT size.
> > +	 */
> > +	tdmr_pamt_size = 0;
> > +	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
> > +		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> > +				pamt_entry_size);
> 
> This alignment is wonky.  Should be way over here:
> 
> > +						   pamt_entry_size);

Will do.

> 
> > +		tdmr_pamt_size += pamt_size[pgsz];
> > +	}
> > +
> > 

[snip]

> 
> Other than the two nits:
> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> 

Thanks.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-06 22:07   ` Dave Hansen
@ 2023-01-10  1:19     ` Huang, Kai
  2023-01-10  1:22       ` Dave Hansen
  2023-01-10 11:01       ` Huang, Kai
  0 siblings, 2 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  1:19 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 14:07 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> > +			      u64 size, u16 max_reserved_per_tdmr)
> > +{
> > +	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
> > +	int idx = *p_idx;
> > +
> > +	/* Reserved area must be 4K aligned in offset and size */
> > +	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
> > +		return -EINVAL;
> > +
> > +	if (idx >= max_reserved_per_tdmr)
> > +		return -E2BIG;
> > +
> > +	rsvd_areas[idx].offset = addr - tdmr->base;
> > +	rsvd_areas[idx].size = size;
> > +
> > +	*p_idx = idx + 1;
> > +
> > +	return 0;
> > +}
> 
> It's probably worth at least a comment here to say:
> 
> 	/*
> 	 * Consume one reserved area per call.  Make no effort to
> 	 * optimize or reduce the number of reserved areas which are
> 	 * consumed by contiguous reserved areas, for instance.
> 	 */

I'll add this comment before the code to set up rsvd_areas[idx].

> 
> I think the -E2BIG is also wrong.  It should be ENOSPC.  I'd also add a
> pr_warn() there.  Especially with how lazy this whole thing is, I can
> start to see how the reserved areas might be exhausted.  Let's be kind
> to our future selves and make the error (and the fix) easier to find.

Yes agreed.  Will change to -ENOSPC and add pr_warn().

> It's probably also worth noting *somewhere* that there's a balance to be
> had between TDMRs and reserved areas.  A system that is running out of
> reserved areas in a TDMR could split a TDMR to get more reserved areas.
> A system that has run out of TDMRs could relatively easily coalesce two
> adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
> if there was a gap between them.

We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
since that patch is the first place where the balance of TDMRs and reserved
areas is related.

What is your suggestion?

> 
> I'm *really* close to acking this patch once those are fixed up.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10  1:19     ` Huang, Kai
@ 2023-01-10  1:22       ` Dave Hansen
  2023-01-10 11:01         ` Huang, Kai
  2023-01-10 11:01       ` Huang, Kai
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10  1:22 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 17:19, Huang, Kai wrote:
>> It's probably also worth noting *somewhere* that there's a balance to be
>> had between TDMRs and reserved areas.  A system that is running out of
>> reserved areas in a TDMR could split a TDMR to get more reserved areas.
>> A system that has run out of TDMRs could relatively easily coalesce two
>> adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
>> if there was a gap between them.
> We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
> Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
> since that patch is the first place where the balance of TDMRs and reserved
> areas is related.
> 
> What is your suggestion?

Just put it close to the code that actually hits the problem so the
potential solution is close at hand to whoever hits the problem.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-01-10  0:47       ` Dave Hansen
@ 2023-01-10  2:23         ` Huang, Kai
  2023-01-10 19:12           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10  2:23 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 16:47 -0800, Dave Hansen wrote:
> On 1/9/23 16:40, Huang, Kai wrote:
> > On Fri, 2023-01-06 at 11:24 -0800, Dave Hansen wrote:
> ...
> > > Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
> > > really need to be stored here?
> > 
> > It's not mandatory to keep them here.  I did it mainly because I want to avoid
> > passing 'sysinfo' as argument for almost all functions related to constructing
> > TDMRs.
> 
> I don't think it hurts readability that much.  On the contrary, it makes
> it more clear what data is needed for initialization.

Sorry one thing I forgot to mention is if we keep 'tdmr_sz' in 'struct
tdmr_info_list', it only needs to be calculated at once when allocating the
buffer.  Otherwise, we need to calculate it based on sysinfo-
>max_reserved_per_tdmr each time we want to get a TDMR at a given index.

To me putting relevant fields (tdmrs, tdmr_sz, max_tdmrs, nr_consumed_tdmrs)
together makes how the TDMR list is organized more clear.  But please let me
know if you prefer removing 'tdmr_sz' and 'max_tdmrs'.

Btw, if we remove 'tdmr_sz' and 'max_tdmrs', even nr_consumed_tdmrs is not
absolutely necessary here.  It can be a local variable of init_tdx_module() (as
shown in v7), and the 'struct tdmr_info_list' will only have the 'tdmrs' member
(as you commented in v7):

https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#mb9826e2bcf8bf6399c13cc5f95a948fe4b3a46d9

Please let me know what's your preference?

> 
> > > If so, I think I'd probably do something
> > > like this with the structure:
> > > 
> > > struct tdmr_info_list {
> > >       struct tdmr_info *tdmrs;
> > >       int nr_consumed_tdmrs; // How many @tdmrs are in use
> > > 
> > >       /* Metadata for freeing this structure: */
> > >       int tdmr_sz;   // Size of one 'tdmr_info' (has a flex array)
> > >       int max_tdmrs; // How many @tdmrs are allocated
> > > };
> > > 
> > > Modulo whataver folks are doing for comments these days.
> > 
> > Looks nice to me.  Will use.  A slight thing is 'tdmr_sz' is also used to get
> > the TDMR at a given index, but not just freeing the structure.
> > 
> > Btw, is C++ style comment "//" OK in kernel code?
> 
> It's OK with me, but I don't think there's much consensus on it.
> Probably best to stick with normal arch/x86 style for now.
> 
> 

Will use normal arch/x86 style for now.  Thanks for the info.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2023-01-06 22:49   ` Dave Hansen
@ 2023-01-10 10:15     ` Huang, Kai
  2023-01-10 16:53       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 10:15 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 14:49 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > After the list of TDMRs and the global KeyID are configured to the TDX
> > module, the kernel needs to configure the key of the global KeyID on all
> > packages using TDH.SYS.KEY.CONFIG.
> > 
> > TDH.SYS.KEY.CONFIG needs to be done on one (any) cpu for each package.
> > Also, it cannot run concurrently on different cpus, so just use
> > smp_call_function_single() to do it one by one.
> > 
> > Note to keep things simple, neither the function to configure the global
> > KeyID on all packages nor the tdx_enable() checks whether there's at
> > least one online cpu for each package.  Also, neither of them explicitly
> > prevents any cpu from going offline.  It is caller's responsibility to
> > guarantee this.
> 
> OK, but does someone *actually* do this?

Please see below reply around the code.

> 
> > Intel hardware doesn't guarantee cache coherency across different
> > KeyIDs.  The kernel needs to flush PAMT's dirty cachelines (associated
> > with KeyID 0) before the TDX module uses the global KeyID to access the
> > PAMT.  Otherwise, those dirty cachelines can silently corrupt the TDX
> > module's metadata.  Note this breaks TDX from functionality point of
> > view but TDX's security remains intact.
> 
> 	Intel hardware doesn't guarantee cache coherency across
> 	different KeyIDs.  The PAMTs are transitioning from being used
> 	by the kernel mapping (KeyId 0) to the TDX module's "global
> 	KeyID" mapping.
> 
> 	This means that the kernel must flush any dirty KeyID-0 PAMT
> 	cachelines before the TDX module uses the global KeyID to access
> 	the PAMT.  Otherwise, if those dirty cachelines were written
> 	back, they would corrupt the TDX module's metadata.  Aside: This
> 	corruption would be detected by the memory integrity hardware on
> 	the next read of the memory with the global KeyID.  The result
> 	would likely be fatal to the system but would not impact TDX
> 	security.

Thanks!

> 
> > Following the TDX module specification, flush cache before configuring
> > the global KeyID on all packages.  Given the PAMT size can be large
> > (~1/256th of system RAM), just use WBINVD on all CPUs to flush.
> > 
> > Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
> > used the global KeyID to write any PAMT.  Therefore, need to use WBINVD
> > to flush cache before freeing the PAMTs back to the kernel.
> 
> 						s/need to// ^

Will do.  Thanks.

> 
> 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index ab961443fed5..4c779e8412f1 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -946,6 +946,66 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> >  	return ret;
> >  }
> >  
> > +static void do_global_key_config(void *data)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
> > +	 * recoverable error).  Assume this is exceedingly rare and
> > +	 * just return error if encountered instead of retrying.
> > +	 */
> > +	ret = seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
> > +
> > +	*(int *)data = ret;
> > +}
> > +
> > +/*
> > + * Configure the global KeyID on all packages by doing TDH.SYS.KEY.CONFIG
> > + * on one online cpu for each package.  If any package doesn't have any
> > + * online
> 
> This looks like it stopped mid-sentence.

Oops I forgot to delete the broken sentence.

> 
> > + * Note:
> > + *
> > + * This function neither checks whether there's at least one online cpu
> > + * for each package, nor explicitly prevents any cpu from going offline.
> > + * If any package doesn't have any online cpu then the SEAMCALL won't be
> > + * done on that package and the later step of TDX module initialization
> > + * will fail.  The caller needs to guarantee this.
> > + */
> 
> *Does* the caller guarantee it?
> 
> You're basically saying, "this code needs $FOO to work", but you're not
> saying who *provides* $FOO.

In short, KVM can do something to guarantee but won't 100% guarantee this.

Specifically, KVM won't actively try to bring up cpu to guarantee this if
there's any package has no online cpu at all (see the first lore link below). 
But KVM can _check_ whether this condition has been met before calling
tdx_init() and speak out if not.  At the meantime, if the condition is met,
refuse to offline the last cpu for each package (or any cpu) during module
initialization.

And KVM needs similar handling anyway.  The reason is not only configuring the
global KeyID has such requirement, creating/destroying TD (which involves
programming/reclaiming one TDX KeyID) also require at least one online cpu for
each package.  

There were discussions around this on KVM how to handle.  IIUC the solution is
KVM will:
1) fail to create TD if any package has no online cpu.
2) refuse to offline the last cpu for each package when there's any _active_ TDX
guest running.

https://lore.kernel.org/lkml/20221102231911.3107438-1-seanjc@google.com/T/#m1ff338686cfcb7ba691cd969acc17b32ff194073
https://lore.kernel.org/lkml/de6b69781a6ba1fe65535f48db2677eef3ec6a83.1667110240.git.isaku.yamahata@intel.com/

Thus TDX module initialization in KVM can be handled in similar way.

Btw, in v7 (which has per-lp init requirement on all cpus), tdx_init() does
early check on whether all machine boot-time present cpu are online and simply 
returns error if condition is not met.  Here the difference is we don't have any
check but depend on SEAMCALL to fail.  To me there's no fundamental difference.

[snip]

> 
> >  static int init_tdx_module(void)
> >  {
> >  	/*
> > @@ -998,19 +1058,46 @@ static int init_tdx_module(void)
> >  	if (ret)
> >  		goto out_free_pamts;
> >  
> > +	/*
> > +	 * Hardware doesn't guarantee cache coherency across different
> > +	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
> > +	 * (associated with KeyID 0) before the TDX module can use the
> > +	 * global KeyID to access the PAMT.  Given PAMTs are potentially
> > +	 * large (~1/256th of system RAM), just use WBINVD on all cpus
> > +	 * to flush the cache.
> > +	 *
> > +	 * Follow the TDX spec to flush cache before configuring the
> > +	 * global KeyID on all packages.
> > +	 */
> 
> I don't think this second paragraph adds very much clarity.
> 
> 

OK will remove.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs
  2023-01-07  0:17   ` Dave Hansen
@ 2023-01-10 10:23     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 10:23 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 16:17 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > After the global KeyID has been configured on all packages, initialize
> > all TDMRs to make all TDX-usable memory regions that are passed to the
> > TDX module become usable.
> > 
> > This is the last step of initializing the TDX module.
> > 
> > Initializing different TDMRs can be parallelized.  For now to keep it
> > simple, just initialize all TDMRs one by one.  It can be enhanced in the
> > future.
> 
> The changelog probably also needs a note about this being a long process
> and also at least touching on *why* it takes so long.

Will add.  How about:

	Initializing TDMRs can be time consuming on large memory systems as it 
	involves initializing all metadata entries for all pages that can be 
	used by TDX guests.

And put it before "Initializing different TDMRs can be parallelized ..."?

> 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 4c779e8412f1..8b7314f19df2 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1006,6 +1006,55 @@ static int config_global_keyid(void)
> >  	return ret;
> >  }
> >  
> > +static int init_tdmr(struct tdmr_info *tdmr)
> > +{
> > +	u64 next;
> > +
> > +	/*
> > +	 * Initializing a TDMR can be time consuming.  To avoid long
> > +	 * SEAMCALLs, the TDX module may only initialize a part of the
> > +	 * TDMR in each call.
> > +	 */
> > +	do {
> > +		struct tdx_module_output out;
> > +		int ret;
> > +
> > +		/* All 0's are unused parameters, they mean nothing. */
> > +		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> > +				&out);
> > +		if (ret)
> > +			return ret;
> > +		/*
> > +		 * RDX contains 'next-to-initialize' address if
> > +		 * TDH.SYS.TDMR.INT succeeded.
> 
> This reads strangely.  "Success" to me really is different from "partial
> success".  Sure, partial success is also not an error, *but* this can be
> explained better.  How about:
> 
> 		 * RDX contains 'next-to-initialize' address if
> 		 * TDH.SYS.TDMR.INT did not fully complete and should
> 		 * be retried.
> 

Will do.

[snip]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module
  2023-01-06 22:21   ` Dave Hansen
@ 2023-01-10 10:48     ` Huang, Kai
  2023-01-10 16:25       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 10:48 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 14:21 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > After a list of "TD Memory Regions" (TDMRs) has been constructed to
> > cover all TDX-usable memory regions, the next step is to pick up a TDX
> > private KeyID as the "global KeyID" (which protects, i.e. TDX module's
> > metadata), and configure it to the TDX module along with the TDMRs.
> 
> For whatever reason, whenever I see "i.e." in a changelog, it's usually
> going off the rails.  This is no exception.  Let's also get rid of the
> passive voice:
> 
> 	The next step After constructing a list of "TD Memory Regions"
> 	(TDMRs) to cover all TDX-usable memory regions is to designate a
> 	TDX private KeyID as the "global KeyID".  This KeyID is used by
> 	the TDX module for mapping things like the PAMT and other TDX
> 	metadata.  This KeyID is passed to the TDX module at the same
> 	time as the TDMRs.

Thanks.  Will use.

> 
> > To keep things simple, just use the first TDX KeyID as the global KeyID.
> 
> 
> 
> 
> > ---
> >  arch/x86/virt/vmx/tdx/tdx.c | 41 +++++++++++++++++++++++++++++++++++--
> >  arch/x86/virt/vmx/tdx/tdx.h |  2 ++
> >  2 files changed, 41 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 620b35e2a61b..ab961443fed5 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -916,6 +916,36 @@ static int construct_tdmrs(struct list_head *tmb_list,
> >  	return ret;
> >  }
> >  
> > +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> > +{
> > +	u64 *tdmr_pa_array, *p;
> > +	size_t array_sz;
> > +	int i, ret;
> > +
> > +	/*
> > +	 * TDMRs are passed to the TDX module via an array of physical
> > +	 * addresses of each TDMR.  The array itself has alignment
> > +	 * requirement.
> > +	 */
> > +	array_sz = tdmr_list->nr_tdmrs * sizeof(u64) +
> > +		TDMR_INFO_PA_ARRAY_ALIGNMENT - 1;
> 
> One other way of doing this which might be a wee bit less messy:
> 
> 	array_sz = roundup_pow_of_two(array_sz);
> 	if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
> 		array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
> 
> Since that keeps 'array_sz' at a power-of-two, then kzalloc() will give
> you all the alignment you need, except if the array is too small, in
> which case you can just bloat it to the alignment requirement.
> 
> This would get rid of the PTR_ALIGN() below too.
> 
> Your choice.  What you have works too.

Your code can also get rid of the additional 'p' local variable.  As you said it
is simpler.  I'll use your code.  Thanks!

> 
> > +	p = kzalloc(array_sz, GFP_KERNEL);
> > +	if (!p)
> > +		return -ENOMEM;
> > +
> > +	tdmr_pa_array = PTR_ALIGN(p, TDMR_INFO_PA_ARRAY_ALIGNMENT);
> > +	for (i = 0; i < tdmr_list->nr_tdmrs; i++)
> > +		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
> > +
> > +	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_list->nr_tdmrs,
> > +				global_keyid, 0, NULL, NULL);
> > +
> > +	/* Free the array as it is not required anymore. */
> > +	kfree(p);
> > +
> > +	return ret;
> > +}
> > +
> >  static int init_tdx_module(void)
> >  {
> >  	/*
> > @@ -960,17 +990,24 @@ static int init_tdx_module(void)
> >  	if (ret)
> >  		goto out_free_tdmrs;
> >  
> > +	/*
> > +	 * Use the first private KeyID as the global KeyID, and pass
> > +	 * it along with the TDMRs to the TDX module.
> > +	 */
> > +	ret = config_tdx_module(&tdmr_list, tdx_keyid_start);
> > +	if (ret)
> > +		goto out_free_pamts;
> 
> This is "consuming" tdx_keyid_start.  Does it need to get incremented
> since the first guest can't use this KeyID now?


It depends on how we treat 'tdx_keyid_start'.  If it means the first _usable_
KeyID for KVM, then we should increase it; but if it only used for the hardware-
enabled TDX KeyID range, then we don't need to increase it.

Currently it is marked as __ro_after_init so my intention is the latter (also in
the spirit of keeping this series minimal).

Eventually we will need to have functions to allocate/free TDX KeyIDs anyway for
KVM, but in that we can just treat 'tdx_keyid_start + 1' as the first usable
KeyID.

[snip]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10  1:19     ` Huang, Kai
  2023-01-10  1:22       ` Dave Hansen
@ 2023-01-10 11:01       ` Huang, Kai
  2023-01-10 15:17         ` Dave Hansen
  1 sibling, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 11:01 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 01:19 +0000, Huang, Kai wrote:
> > I think the -E2BIG is also wrong.  It should be ENOSPC.  I'd also add a
> > pr_warn() there.  Especially with how lazy this whole thing is, I can
> > start to see how the reserved areas might be exhausted.  Let's be kind
> > to our future selves and make the error (and the fix) easier to find.
> 
> Yes agreed.  Will change to -ENOSPC and add pr_warn().

Btw, in patch ("x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions"),
when there are too many TDMRs, I suppose I should also return -ENOSPC instead of
-E2BIG?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10  1:22       ` Dave Hansen
@ 2023-01-10 11:01         ` Huang, Kai
  2023-01-10 15:19           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 11:01 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 17:22 -0800, Dave Hansen wrote:
> On 1/9/23 17:19, Huang, Kai wrote:
> > > It's probably also worth noting *somewhere* that there's a balance to be
> > > had between TDMRs and reserved areas.  A system that is running out of
> > > reserved areas in a TDMR could split a TDMR to get more reserved areas.
> > > A system that has run out of TDMRs could relatively easily coalesce two
> > > adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
> > > if there was a gap between them.
> > We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
> > Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
> > since that patch is the first place where the balance of TDMRs and reserved
> > areas is related.
> > 
> > What is your suggestion?
> 
> Just put it close to the code that actually hits the problem so the
> potential solution is close at hand to whoever hits the problem.
> 

Sorry to double check: the code which hits the problem is the 'if (idx >=
max_reserved_per_tdmr)' check in tdmr_add_rsvd_area(), so I think I can add
right before this check?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-01-07  0:35   ` Dave Hansen
@ 2023-01-10 11:29     ` Huang, Kai
  2023-01-10 15:27       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 11:29 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > There are two problems in terms of using kexec() to boot to a new kernel
> > when the old kernel has enabled TDX: 1) Part of the memory pages are
> > still TDX private pages (i.e. metadata used by the TDX module, and any
> > TDX guest memory if kexec() happens when there's any TDX guest alive).
> > 2) There might be dirty cachelines associated with TDX private pages.
> > 
> > Because the hardware doesn't guarantee cache coherency among different
> > KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
> > before booting to the new kernel.  Also, reading TDX private page using
> > any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
> > Therefore ideally, the kernel should convert all TDX private pages back
> > to normal before booting to the new kernel.
> 
> This is just talking about way too many things that just don't apply.
> 
> Let's focus on the *ACTUAL* problem that's being addressed instead of
> the 15 problems that aren't actual practical problems.

Will get rid of those.

> 
> > However, this implementation doesn't convert TDX private pages back to
> > normal in kexec() because of below considerations:
> > 
> > 1) Neither the kernel nor the TDX module has existing infrastructure to
> >    track which pages are TDX private pages.
> > 2) The number of TDX private pages can be large, and converting all of
> >    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
> >    be time consuming.
> > 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
> >    0 doesn't support integrity-check, so it's OK.
> > 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
> >    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
> >    page with the new MKTME KeyID (just like TDX does) before using it.
> 
> Yeah, why are we getting all worked up about MKTME when there is not
> support?

I am not sure whether we need to consider 3rd party kernel case?

> 
> The only thing that matters here is dirty cacheline writeback.  There
> are two things the kernel needs to do to mitigate that:
> 
>  1. Stop accessing TDX private memory mappings
>   1a. Stop making TDX module calls (uses global private KeyID)
>   1b. Stop TDX guests from running (uses per-guest KeyID)
>  2. Flush any cachelines from previous private KeyID writes
> 
> There are a couple of ways we can do #2.  We do *NOT* need to convert
> *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
> in any way as far as I can tell.

May I ask why?  When I was writing this patch I was not sure whether kexec()
should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
during kexec() but depends on the new kernel to do that too, but I don't know
what's the general guide of supporting kexec().

> 
> I think you're also saying that since all CPUs go through this path and
> there is no TDX activity between the WBINVD and the native_halt() that
> 1a and 1b basically happen for "free" without needing to do theme
> explicitly.

Yes.  Should we mention this part in changelog?

AMD SME kexec() support patch bba4ed011a52d ("x86/mm, kexec: Allow kexec to be
used with SME") seems doesn't mention anything similar (SME and TDX may be
different, though).

> 
> > Therefore, this implementation just flushes cache to make sure there are
> > no stale dirty cachelines associated with any TDX private KeyIDs before
> > booting to the new kernel, otherwise they may silently corrupt the new
> > kernel.
> 
> That's fine.  So, this patch kinda happens to land in the right spot
> even after thrashing about about a while.
> 
> > Following SME support, use wbinvd() to flush cache in stop_this_cpu().
> > Theoretically, cache flush is only needed when the TDX module has been
> > initialized.  However initializing the TDX module is done on demand at
> > runtime, and it takes a mutex to read the module status.  Just check
> > whether TDX is enabled by BIOS instead to flush cache.
> 
> Yeah, close enough.
> 
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index c21b7347a26d..0cc84977dc62 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
> >  	 *
> >  	 * Test the CPUID bit directly because the machine might've cleared
> >  	 * X86_FEATURE_SME due to cmdline options.
> > +	 *
> > +	 * Similar to SME, if the TDX module is ever initialized, the
> > +	 * cachelines associated with any TDX private KeyID must be flushed
> > +	 * before transiting to the new kernel.  The TDX module is initialized
> > +	 * on demand, and it takes the mutex to read its status.  Just check
> > +	 * whether TDX is enabled by BIOS instead to flush cache.
> >  	 */
> 
> There's too much detail here.  Let's up-level it a bit.  We don't need
> to be talking about TDX locking here.

Sure will do.  Thanks!
> 
> 	/*
> 	 * The TDX module or guests might have left dirty cachelines
> 	 * behind.  Flush them to avoid corruption from later writeback.
> 	 * Note that this flushes on all systems where TDX is possible,
> 	 * but does not actually check that TDX was in use.
> 	 */
> 
> > -	if (cpuid_eax(0x8000001f) & BIT(0))
> > +	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
> >  		native_wbinvd();
> >  	for (;;) {
> >  		/*
> 
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-09 16:51       ` Dave Hansen
@ 2023-01-10 12:09         ` Huang, Kai
  2023-01-10 16:18           ` Dave Hansen
  2023-01-12 11:33         ` Huang, Kai
  1 sibling, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 12:09 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> On 1/9/23 03:48, Huang, Kai wrote:
> ...
> > > > > > This approach works as in practice all boot-time present DIMMs are TDX
> > > > > > convertible memory.  However, if any non-TDX-convertible memory has been
> > > > > > hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> > > > > > module, the module initialization will fail.
> > > > 
> > > > I really don't know what this is trying to say.
> > 
> > My intention is to explain and call out that such design (use all memory regions
> > in memblock at the time of module initialization) works in practice, as long as
> > non-CMR memory hasn't been added via memory hotplug.
> > 
> > Not sure if it is necessary, but I was thinking it may help reviewer to judge
> > whether such design is acceptable.
> 
> This is yet another case where you've mechanically described the "what",
> but left out the implications or the underlying basis "why".
> 
> I'd take a more methodical approach to describe what is going on here.
> List the steps that must occur, or at least *one* example of those steps
> and how they intereact with the code in this patch.  Then, explain the
> fallout.
> 
> I also don't think it's quite right to call out "CXL memory via kmem
> driver".  If the CXL memory was "System RAM", it should get covered by a
> CMR and TDMR.  The kmem driver can still go wild with it.
> 
> > > > *How* and *why* does this module initialization failure occur?
> > 
> > If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> > will fail.
> 
> I'm frankly lost now.  Please go back and try to explain this better.
> Let me know if you want to iterate on this faster than resending this
> series five more times.  I've got some ideas.

Let me try to do my work first.  Thanks.

> 
> > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > the "TDX-capable" nodes.
> > > > 
> > > > Why does it need to be enhanced?  What's the problem?
> > 
> > The problem is after TDX module initialization, no more memory can be hot-added
> > to the page allocator.
> > 
> > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > utilize all memory.
> > 
> > But probably it is not necessarily to call out in the changelog?
> 
> Let's say that we add this TDX-compatible-node ABI in the future.  What
> will old code do that doesn't know about this ABI?

Right.  The old app will break w/o knowing the new ABI.  One resolution, I
think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
TDX guest memory from those "TDX-capable" nodes.

Anyway, perhaps we can just delete this part from the changelog?

> 
> ...
> > > > > > +       list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > > > +               /*
> > > > > > +                * The new range is TDX memory if it is fully covered by
> > > > > > +                * any TDX memory block.
> > > > > > +                *
> > > > > > +                * Note TDX memory blocks are originated from memblock
> > > > > > +                * memory regions, which can only be contiguous when two
> > > > > > +                * regions have different NUMA nodes or flags.  Therefore
> > > > > > +                * the new range cannot cross multiple TDX memory blocks.
> > > > > > +                */
> > > > > > +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > > > +                       return true;
> > > > > > +       }
> > > > > > +       return false;
> > > > > > +}
> > > > 
> > > > I don't really like that comment.  It should first state its behavior
> > > > and assumptions, like:
> > > > 
> > > >     This check assumes that the start_pfn<->end_pfn range does not
> > > >     cross multiple tdx_memlist entries.
> > > > 
> > > > Only then should it describe why that is OK:
> > > > 
> > > >     A single memory hotplug even across mutliple memblocks (from
> > > >     which tdx_memlist entries are derived) is impossible.  ... then
> > > >     actually explain
> > > > 
> > 
> > How about below?
> > 
> >         /*
> >          * This check assumes that the start_pfn<->end_pfn range does not cross
> >          * multiple tdx_memlist entries. A single memory hotplug event across
> >          * multiple memblocks (from which tdx_memlist entries are derived) is
> >          * impossible. That means start_pfn<->end_pfn range cannot exceed a
> >          * tdx_memlist entry, and the new range is TDX memory if it is fully
> >          * covered by any tdx_memlist entry.
> >          */
> 
> I was hoping you would actually explain why it is impossible.
> 
> Is there something fundamental that keeps a memory area that spans two
> nodes from being removed and then a new area added that is comprised of
> a single node?
> Boot time:
> 
> 	| memblock  |  memblock |
> 	<--Node=0--> <--Node=1-->
> 
> Funky hotplug... nothing to see here, then:
> 
> 	<--------Node=2-------->

I must have missed something, but how can this happen?

I had memory that this cannot happen because the BIOS always allocates address
ranges for all NUMA nodes during machine boot.  Those address ranges don't
necessarily need to have DIMM fully populated but they don't change during
machine's runtime.

> 
> I would believe that there is no current bare-metal TDX system that has
> an implementation like this.  But, the comments above speak like it's
> fundamentally impossible.  That should be clarified.
> 
> In other words, that comment talks about memblock attributes as being
> the core underlying reason that that simplified check is OK.  Is that
> it, or is it really the reduced hotplug feature set on TDX systems?

Let me do more homework and get back to you.

> 
> 
> ...
> > > > > > +        * Build the list of "TDX-usable" memory regions which cover all
> > > > > > +        * pages in the page allocator to guarantee that.  Do it while
> > > > > > +        * holding mem_hotplug_lock read-lock as the memory hotplug code
> > > > > > +        * path reads the @tdx_memlist to reject any new memory.
> > > > > > +        */
> > > > > > +       get_online_mems();
> > > > 
> > > > Oh, it actually uses the memory hotplug locking for list protection.
> > > > That's at least a bit subtle.  Please document that somewhere in the
> > > > functions that actually manipulate the list.
> > 
> > add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> > list_del() to manipulate the list, but they actually takes 'struct list_head
> > *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> > 
> > Do you mean document the locking around the implementation of add_tdx_memblock()
> > and free_tdx_memlist()?
> > 
> > Or should we just mention it around the 'tdx_memlist' variable?
> > 
> > /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> > static LIST_HEAD(tdx_memlist);
> 
> I don't think I'd hate it being in all three spots.  Also "protected by
> memory hotplug locking" is pretty generic.  Please be more specific.

OK will do.

> 
> > > > I think it's also worth saying something here about the high-level
> > > > effects of what's going on:
> > > > 
> > > >     Take a snapshot of the memory configuration (memblocks).  This
> > > >     snapshot will be used to enable TDX support for *this* memory
> > > >     configuration only.  Use a memory hotplug notifier to ensure
> > > >     that no other RAM can be added outside of this configuration.
> > > > 
> > > > That's it, right?
> > 
> > Yes. I'll somehow include above into the comment around get_online_mems().
> > 
> > But should I move "Use a memory hotplug notifier ..." part to:
> > 
> >         err = register_memory_notifier(&tdx_memory_nb);
> > 
> > because this is where we actually use the memory hotplug notifier?
> 
> I actually want that snippet in the changelog.

Will do.

[snip]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10 11:01       ` Huang, Kai
@ 2023-01-10 15:17         ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 15:17 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 03:01, Huang, Kai wrote:
> On Tue, 2023-01-10 at 01:19 +0000, Huang, Kai wrote:
>>> I think the -E2BIG is also wrong.  It should be ENOSPC.  I'd also add a
>>> pr_warn() there.  Especially with how lazy this whole thing is, I can
>>> start to see how the reserved areas might be exhausted.  Let's be kind
>>> to our future selves and make the error (and the fix) easier to find.
>> Yes agreed.  Will change to -ENOSPC and add pr_warn().
> Btw, in patch ("x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions"),
> when there are too many TDMRs, I suppose I should also return -ENOSPC instead of
> -E2BIG?

Yes.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10 11:01         ` Huang, Kai
@ 2023-01-10 15:19           ` Dave Hansen
  2023-01-11 10:57             ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 15:19 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 03:01, Huang, Kai wrote:
> On Mon, 2023-01-09 at 17:22 -0800, Dave Hansen wrote:
>> On 1/9/23 17:19, Huang, Kai wrote:
>>>> It's probably also worth noting *somewhere* that there's a balance to be
>>>> had between TDMRs and reserved areas.  A system that is running out of
>>>> reserved areas in a TDMR could split a TDMR to get more reserved areas.
>>>> A system that has run out of TDMRs could relatively easily coalesce two
>>>> adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
>>>> if there was a gap between them.
>>> We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
>>> Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
>>> since that patch is the first place where the balance of TDMRs and reserved
>>> areas is related.
>>>
>>> What is your suggestion?
>> Just put it close to the code that actually hits the problem so the
>> potential solution is close at hand to whoever hits the problem.
>>
> Sorry to double check: the code which hits the problem is the 'if (idx >=
> max_reserved_per_tdmr)' check in tdmr_add_rsvd_area(), so I think I can add
> right before this check?

Please just hack together how you think it should look and either reply
with an updated patch, or paste the relevant code snippet in your reply.
 That'll keep me from having to go chase this code back down.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-01-10 11:29     ` Huang, Kai
@ 2023-01-10 15:27       ` Dave Hansen
  2023-01-11  0:13         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 15:27 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 03:29, Huang, Kai wrote:
> On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
>> On 12/8/22 22:52, Kai Huang wrote:
...
>>> However, this implementation doesn't convert TDX private pages back to
>>> normal in kexec() because of below considerations:
>>>
>>> 1) Neither the kernel nor the TDX module has existing infrastructure to
>>>    track which pages are TDX private pages.
>>> 2) The number of TDX private pages can be large, and converting all of
>>>    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
>>>    be time consuming.
>>> 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
>>>    0 doesn't support integrity-check, so it's OK.
>>> 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
>>>    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
>>>    page with the new MKTME KeyID (just like TDX does) before using it.
>>
>> Yeah, why are we getting all worked up about MKTME when there is not
>> support?
> 
> I am not sure whether we need to consider 3rd party kernel case?

No, we don't.

>> The only thing that matters here is dirty cacheline writeback.  There
>> are two things the kernel needs to do to mitigate that:
>>
>>  1. Stop accessing TDX private memory mappings
>>   1a. Stop making TDX module calls (uses global private KeyID)
>>   1b. Stop TDX guests from running (uses per-guest KeyID)
>>  2. Flush any cachelines from previous private KeyID writes
>>
>> There are a couple of ways we can do #2.  We do *NOT* need to convert
>> *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
>> in any way as far as I can tell.
> 
> May I ask why?  When I was writing this patch I was not sure whether kexec()
> should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
> during kexec() but depends on the new kernel to do that too, but I don't know
> what's the general guide of supporting kexec().

Think about it this way: kexec() is modifying persistent (across kexec)
state to get the system ready for the new kernel.  The caches are
persistent state.  Devices have persistent state.  Memory state persists
across kexec().  The memory integrity metadata persists.

What persistent state does a conversion to KeyID-0 affect?  It resets
the integrity metadata and the memory contents.

Kexec leaves memory contents in place and doesn't zero them, so memory
contents don't matter.  The integrity metadata also doesn't matter
because the memory will be used as KeyID-0 and that KeyID doesn't read
the integrity metadata.

What practical impact does a conversion back to KeyID-0 serve?  What
persistent state does it affect that matters?

>> I think you're also saying that since all CPUs go through this path and
>> there is no TDX activity between the WBINVD and the native_halt() that
>> 1a and 1b basically happen for "free" without needing to do theme
>> explicitly.
> 
> Yes.  Should we mention this part in changelog?

That would be nice.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-10 12:09         ` Huang, Kai
@ 2023-01-10 16:18           ` Dave Hansen
  2023-01-11 10:00             ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 16:18 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 04:09, Huang, Kai wrote:
> On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> On 1/9/23 03:48, Huang, Kai wrote:
>>>>>>> This can also be enhanced in the future, i.e. by allowing adding non-TDX
>>>>>>> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>>>>>>> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>>>>>>> needs to guarantee memory pages for TDX guests are always allocated from
>>>>>>> the "TDX-capable" nodes.
>>>>>
>>>>> Why does it need to be enhanced?  What's the problem?
>>>
>>> The problem is after TDX module initialization, no more memory can be hot-added
>>> to the page allocator.
>>>
>>> Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>>> actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>>> bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>>> utilize all memory.
>>>
>>> But probably it is not necessarily to call out in the changelog?
>>
>> Let's say that we add this TDX-compatible-node ABI in the future.  What
>> will old code do that doesn't know about this ABI?
> 
> Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> TDX guest memory from those "TDX-capable" nodes.

That doesn't actually hide all of the behavior from users.  Let's say
they do:

	numactl --membind=6 qemu-kvm ...

In other words, take all of this guest's memory and put it on node 6.
There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
they make it a TDX guest:

	numactl --membind=6 qemu-kvm -tdx ...

What happens?  Does the kernel silently ignore the --membind=6?  Or does
it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
memory on node 6.

In other words, I don't think the kernel can just enforce this
internally and hide it from userspace.

>> Is there something fundamental that keeps a memory area that spans two
>> nodes from being removed and then a new area added that is comprised of
>> a single node?
>> Boot time:
>>
>> 	| memblock  |  memblock |
>> 	<--Node=0--> <--Node=1-->
>>
>> Funky hotplug... nothing to see here, then:
>>
>> 	<--------Node=2-------->
> 
> I must have missed something, but how can this happen?
> 
> I had memory that this cannot happen because the BIOS always allocates address
> ranges for all NUMA nodes during machine boot.  Those address ranges don't
> necessarily need to have DIMM fully populated but they don't change during
> machine's runtime.

Is your memory correct?  Is there evidence, or requirements in any
specification to support your memory?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module
  2023-01-10 10:48     ` Huang, Kai
@ 2023-01-10 16:25       ` Dave Hansen
  2023-01-10 23:33         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 16:25 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 02:48, Huang, Kai wrote:
>>>
>>> +   /*
>>> +    * Use the first private KeyID as the global KeyID, and pass
>>> +    * it along with the TDMRs to the TDX module.
>>> +    */
>>> +   ret = config_tdx_module(&tdmr_list, tdx_keyid_start);
>>> +   if (ret)
>>> +           goto out_free_pamts;
>> This is "consuming" tdx_keyid_start.  Does it need to get incremented
>> since the first guest can't use this KeyID now?
> 
> It depends on how we treat 'tdx_keyid_start'.  If it means the first _usable_
> KeyID for KVM, then we should increase it; but if it only used for the hardware-
> enabled TDX KeyID range, then we don't need to increase it.
> 
> Currently it is marked as __ro_after_init so my intention is the latter (also in
> the spirit of keeping this series minimal).
> 
> Eventually we will need to have functions to allocate/free TDX KeyIDs anyway for
> KVM, but in that we can just treat 'tdx_keyid_start + 1' as the first usable
> KeyID.

So, basically, you're going to depend on the KVM code (which isn't in
this series) to magically know exactly what this series did?  Then,
you're expecting that this code will never change in a way that breaks
this random KVM code?

That's frankly awful.

Make the variable read/write.  Call it tdx_guest_keyid_start, and
increment it when you make a keyid unavailable for guest use.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2023-01-10 10:15     ` Huang, Kai
@ 2023-01-10 16:53       ` Dave Hansen
  2023-01-11  0:06         ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 16:53 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 02:15, Huang, Kai wrote:
> On Fri, 2023-01-06 at 14:49 -0800, Dave Hansen wrote:
>> On 12/8/22 22:52, Kai Huang wrote:
...
>>> + * Note:
>>> + *
>>> + * This function neither checks whether there's at least one online cpu
>>> + * for each package, nor explicitly prevents any cpu from going offline.
>>> + * If any package doesn't have any online cpu then the SEAMCALL won't be
>>> + * done on that package and the later step of TDX module initialization
>>> + * will fail.  The caller needs to guarantee this.
>>> + */
>>
>> *Does* the caller guarantee it?
>>
>> You're basically saying, "this code needs $FOO to work", but you're not
>> saying who *provides* $FOO.
> 
> In short, KVM can do something to guarantee but won't 100% guarantee this.
> 
> Specifically, KVM won't actively try to bring up cpu to guarantee this if
> there's any package has no online cpu at all (see the first lore link below).
> But KVM can _check_ whether this condition has been met before calling
> tdx_init() and speak out if not.  At the meantime, if the condition is met,
> refuse to offline the last cpu for each package (or any cpu) during module
> initialization.
> 
> And KVM needs similar handling anyway.  The reason is not only configuring the
> global KeyID has such requirement, creating/destroying TD (which involves
> programming/reclaiming one TDX KeyID) also require at least one online cpu for
> each package.
> 
> There were discussions around this on KVM how to handle.  IIUC the solution is
> KVM will:
> 1) fail to create TD if any package has no online cpu.
> 2) refuse to offline the last cpu for each package when there's any _active_ TDX
> guest running.
> 
> https://lore.kernel.org/lkml/20221102231911.3107438-1-seanjc@google.com/T/#m1ff338686cfcb7ba691cd969acc17b32ff194073
> https://lore.kernel.org/lkml/de6b69781a6ba1fe65535f48db2677eef3ec6a83.1667110240.git.isaku.yamahata@intel.com/
> 
> Thus TDX module initialization in KVM can be handled in similar way.
> 
> Btw, in v7 (which has per-lp init requirement on all cpus), tdx_init() does
> early check on whether all machine boot-time present cpu are online and simply
> returns error if condition is not met.  Here the difference is we don't have any
> check but depend on SEAMCALL to fail.  To me there's no fundamental difference.

So, I'm going to call shenanigans here.

You say:

	The caller needs to guarantee this.

Then, you go and tell us how the *ONE* caller of this function doesn't
actually guarantee this.  Plus, you *KNOW* this.

Those are shenanigans.

Let's do something like this instead of asking for something impossible
and pretending that the callers are going to provide some fantasy solution.

/*
 * Attempt to configure the global KeyID on all physical packages.
 *
 * This requires running code on at least one CPU in each package.  If a
 * package has no online CPUs, that code will not run and TDX module
 * initialization (TDH.whatever) will fail.
 *
 * This code takes no affirmative steps to online CPUs.  Callers (aka.
 * KVM) can ensure success by ensuring sufficient CPUs are online for
 * this to succeed.
 */

Now, since this _is_ all imperfect, what will our users see if this
house of cards falls down?  Will they get a nice error message like:

     TDX: failed to configure module, no online CPUs in package 12

Or, will they see:

     TDX: Hurr, durr, I'm confused and you should be too

?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-01-10  2:23         ` Huang, Kai
@ 2023-01-10 19:12           ` Dave Hansen
  2023-01-11  9:23             ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-10 19:12 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/9/23 18:23, Huang, Kai wrote:
> On Mon, 2023-01-09 at 16:47 -0800, Dave Hansen wrote:
>> On 1/9/23 16:40, Huang, Kai wrote:
>>> On Fri, 2023-01-06 at 11:24 -0800, Dave Hansen wrote:
>> ...
>>>> Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
>>>> really need to be stored here?
>>>
>>> It's not mandatory to keep them here.  I did it mainly because I want to avoid
>>> passing 'sysinfo' as argument for almost all functions related to constructing
>>> TDMRs.
>>
>> I don't think it hurts readability that much.  On the contrary, it makes
>> it more clear what data is needed for initialization.
> 
> Sorry one thing I forgot to mention is if we keep 'tdmr_sz' in 'struct
> tdmr_info_list', it only needs to be calculated at once when allocating the
> buffer.  Otherwise, we need to calculate it based on sysinfo-
> max_reserved_per_tdmr each time we want to get a TDMR at a given index.

What's the problem with recalculating it?  It is calculated like this:

	tdmr_sz = ALIGN(constant1 + constant2 * variable);

So, what's the problem?  You're concerned about too many multiplications?

> To me putting relevant fields (tdmrs, tdmr_sz, max_tdmrs, nr_consumed_tdmrs)
> together makes how the TDMR list is organized more clear.  But please let me
> know if you prefer removing 'tdmr_sz' and 'max_tdmrs'.
> 
> Btw, if we remove 'tdmr_sz' and 'max_tdmrs', even nr_consumed_tdmrs is not
> absolutely necessary here.  It can be a local variable of init_tdx_module() (as
> shown in v7), and the 'struct tdmr_info_list' will only have the 'tdmrs' member
> (as you commented in v7):
> 
> https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#mb9826e2bcf8bf6399c13cc5f95a948fe4b3a46d9
> 
> Please let me know what's your preference?

I dunno.  My gut says that passing sysinfo around and just deriving the
sizes values from that with helpers is the best way.  'struct
tdmr_info_list' isn't a horrible idea in and of itself, but I think it's
a confusing structure because it's not clear how the pieces fit together
when half of it is *required* and the other half is just for some kind
of perceived convenience.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module
  2023-01-10 16:25       ` Dave Hansen
@ 2023-01-10 23:33         ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-10 23:33 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 08:25 -0800, Dave Hansen wrote:
> On 1/10/23 02:48, Huang, Kai wrote:
> > > > 
> > > > +   /*
> > > > +    * Use the first private KeyID as the global KeyID, and pass
> > > > +    * it along with the TDMRs to the TDX module.
> > > > +    */
> > > > +   ret = config_tdx_module(&tdmr_list, tdx_keyid_start);
> > > > +   if (ret)
> > > > +           goto out_free_pamts;
> > > This is "consuming" tdx_keyid_start.  Does it need to get incremented
> > > since the first guest can't use this KeyID now?
> > 
> > It depends on how we treat 'tdx_keyid_start'.  If it means the first _usable_
> > KeyID for KVM, then we should increase it; but if it only used for the hardware-
> > enabled TDX KeyID range, then we don't need to increase it.
> > 
> > Currently it is marked as __ro_after_init so my intention is the latter (also in
> > the spirit of keeping this series minimal).
> > 
> > Eventually we will need to have functions to allocate/free TDX KeyIDs anyway for
> > KVM, but in that we can just treat 'tdx_keyid_start + 1' as the first usable
> > KeyID.
> 
> So, basically, you're going to depend on the KVM code (which isn't in
> this series) to magically know exactly what this series did?  Then,
> you're expecting that this code will never change in a way that breaks
> this random KVM code?
> 
> That's frankly awful.

Sorry I should have said this in my previous reply:  The two functions will be
implemented in here together with 'tdx_keyid_start' and 'nr_tdx_keyids'
variables, so they are not KVM code, although they will only be used by KVM for
now:

https://lore.kernel.org/lkml/Y19NzlQcwhV%2F2wl3@debian.me/T/#m0735de9e60138da8fa69828b755f1387e031d08d

Another benefit of putting here (not in KVM) is just in case in the future other
kernel components might need to allocate TDX KeyID too.

Btw this series itself is not enough for KVM to support TDX.  There will be some
minor x86 patches based on this series for that (exposing tdsysinfo_struct to
KVM, and this KeyID allocation, etc). 

(Or should I just include them in this series?)

> 
> Make the variable read/write.  Call it tdx_guest_keyid_start, and
> increment it when you make a keyid unavailable for guest use.
> 

Yes I can do if you prefer.

One minor thing is 'tdx_keyid_start' is introduced in the second patch in this
series ("x86/virt/tdx: Detect TDX during kernel boot").  IMHO it would be a
little weird to call it 'tdx_guest_keyid_start' in that patch.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2023-01-10 16:53       ` Dave Hansen
@ 2023-01-11  0:06         ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-11  0:06 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 08:53 -0800, Dave Hansen wrote:
> On 1/10/23 02:15, Huang, Kai wrote:
> > On Fri, 2023-01-06 at 14:49 -0800, Dave Hansen wrote:
> > > On 12/8/22 22:52, Kai Huang wrote:
> ...
> > > > + * Note:
> > > > + *
> > > > + * This function neither checks whether there's at least one online cpu
> > > > + * for each package, nor explicitly prevents any cpu from going offline.
> > > > + * If any package doesn't have any online cpu then the SEAMCALL won't be
> > > > + * done on that package and the later step of TDX module initialization
> > > > + * will fail.  The caller needs to guarantee this.
> > > > + */
> > > 
> > > *Does* the caller guarantee it?
> > > 
> > > You're basically saying, "this code needs $FOO to work", but you're not
> > > saying who *provides* $FOO.
> > 
> > In short, KVM can do something to guarantee but won't 100% guarantee this.
> > 
> > Specifically, KVM won't actively try to bring up cpu to guarantee this if
> > there's any package has no online cpu at all (see the first lore link below).
> > But KVM can _check_ whether this condition has been met before calling
> > tdx_init() and speak out if not.  At the meantime, if the condition is met,
> > refuse to offline the last cpu for each package (or any cpu) during module
> > initialization.
> > 
> > And KVM needs similar handling anyway.  The reason is not only configuring the
> > global KeyID has such requirement, creating/destroying TD (which involves
> > programming/reclaiming one TDX KeyID) also require at least one online cpu for
> > each package.
> > 
> > There were discussions around this on KVM how to handle.  IIUC the solution is
> > KVM will:
> > 1) fail to create TD if any package has no online cpu.
> > 2) refuse to offline the last cpu for each package when there's any _active_ TDX
> > guest running.
> > 
> > https://lore.kernel.org/lkml/20221102231911.3107438-1-seanjc@google.com/T/#m1ff338686cfcb7ba691cd969acc17b32ff194073
> > https://lore.kernel.org/lkml/de6b69781a6ba1fe65535f48db2677eef3ec6a83.1667110240.git.isaku.yamahata@intel.com/
> > 
> > Thus TDX module initialization in KVM can be handled in similar way.
> > 
> > Btw, in v7 (which has per-lp init requirement on all cpus), tdx_init() does
> > early check on whether all machine boot-time present cpu are online and simply
> > returns error if condition is not met.  Here the difference is we don't have any
> > check but depend on SEAMCALL to fail.  To me there's no fundamental difference.
> 
> So, I'm going to call shenanigans here.
> 
> You say:
> 
> 	The caller needs to guarantee this.
> 
> Then, you go and tell us how the *ONE* caller of this function doesn't
> actually guarantee this.  Plus, you *KNOW* this.
> 
> Those are shenanigans.

Agreed.

> 
> Let's do something like this instead of asking for something impossible
> and pretending that the callers are going to provide some fantasy solution.
> 
> /*
>  * Attempt to configure the global KeyID on all physical packages.
>  *
>  * This requires running code on at least one CPU in each package.  If a
>  * package has no online CPUs, that code will not run and TDX module
>  * initialization (TDH.whatever) will fail.
>  *
>  * This code takes no affirmative steps to online CPUs.  Callers (aka.
>  * KVM) can ensure success by ensuring sufficient CPUs are online for
>  * this to succeed.
>  */

Thanks.  Will update changelog accordingly.

> 
> Now, since this _is_ all imperfect, what will our users see if this
> house of cards falls down?  Will they get a nice error message like:
> 
>      TDX: failed to configure module, no online CPUs in package 12
> 
> Or, will they see:
> 
>      TDX: Hurr, durr, I'm confused and you should be too
> 
> ?

I am expecting the former.  I will work with Isaku to make sure of it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-01-10 15:27       ` Dave Hansen
@ 2023-01-11  0:13         ` Huang, Kai
  2023-01-11  0:30           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-11  0:13 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
> On 1/10/23 03:29, Huang, Kai wrote:
> > On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
> > > On 12/8/22 22:52, Kai Huang wrote:
> ...
> > > > However, this implementation doesn't convert TDX private pages back to
> > > > normal in kexec() because of below considerations:
> > > > 
> > > > 1) Neither the kernel nor the TDX module has existing infrastructure to
> > > >    track which pages are TDX private pages.
> > > > 2) The number of TDX private pages can be large, and converting all of
> > > >    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
> > > >    be time consuming.
> > > > 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
> > > >    0 doesn't support integrity-check, so it's OK.
> > > > 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
> > > >    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
> > > >    page with the new MKTME KeyID (just like TDX does) before using it.
> > > 
> > > Yeah, why are we getting all worked up about MKTME when there is not
> > > support?
> > 
> > I am not sure whether we need to consider 3rd party kernel case?
> 
> No, we don't.

Good to know.

> 
> > > The only thing that matters here is dirty cacheline writeback.  There
> > > are two things the kernel needs to do to mitigate that:
> > > 
> > >  1. Stop accessing TDX private memory mappings
> > >   1a. Stop making TDX module calls (uses global private KeyID)
> > >   1b. Stop TDX guests from running (uses per-guest KeyID)
> > >  2. Flush any cachelines from previous private KeyID writes
> > > 
> > > There are a couple of ways we can do #2.  We do *NOT* need to convert
> > > *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
> > > in any way as far as I can tell.
> > 
> > May I ask why?  When I was writing this patch I was not sure whether kexec()
> > should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
> > during kexec() but depends on the new kernel to do that too, but I don't know
> > what's the general guide of supporting kexec().
> 
> Think about it this way: kexec() is modifying persistent (across kexec)
> state to get the system ready for the new kernel.  The caches are
> persistent state.  Devices have persistent state.  Memory state persists
> across kexec().  The memory integrity metadata persists.
> 
> What persistent state does a conversion to KeyID-0 affect?  It resets
> the integrity metadata and the memory contents.
> 
> Kexec leaves memory contents in place and doesn't zero them, so memory
> contents don't matter.  The integrity metadata also doesn't matter
> because the memory will be used as KeyID-0 and that KeyID doesn't read
> the integrity metadata.

Right.  So I guess we just need to call out the new kernel will use memory as
KeyID-0?

> 
> What practical impact does a conversion back to KeyID-0 serve?  What
> persistent state does it affect that matters?

If we can be sure the new kernel will use KeyID-0, then we don't need to
convert.  In the 3) and 4) in my changelog, I actually was trying to convery
this.
  
> 
> > > I think you're also saying that since all CPUs go through this path and
> > > there is no TDX activity between the WBINVD and the native_halt() that
> > > 1a and 1b basically happen for "free" without needing to do theme
> > > explicitly.
> > 
> > Yes.  Should we mention this part in changelog?
> 
> That would be nice.
> 

Will do.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-01-11  0:13         ` Huang, Kai
@ 2023-01-11  0:30           ` Dave Hansen
  2023-01-11  1:58             ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-11  0:30 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/10/23 16:13, Huang, Kai wrote:
> On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
...
>> Think about it this way: kexec() is modifying persistent (across kexec)
>> state to get the system ready for the new kernel.  The caches are
>> persistent state.  Devices have persistent state.  Memory state persists
>> across kexec().  The memory integrity metadata persists.
>>
>> What persistent state does a conversion to KeyID-0 affect?  It resets
>> the integrity metadata and the memory contents.
>>
>> Kexec leaves memory contents in place and doesn't zero them, so memory
>> contents don't matter.  The integrity metadata also doesn't matter
>> because the memory will be used as KeyID-0 and that KeyID doesn't read
>> the integrity metadata.
> 
> Right.  So I guess we just need to call out the new kernel will use memory as
> KeyID-0?

Not even that.

Say the new kernel wanted to use the memory as KeyID-3.  What would it
do?  It would *ASSUME* that the memory *WASN'T* KeyID-3.  It would
convert it to KeyID-3.  That conversion would work from *any* KeyID.

So:

	KeyID-0: OK, because it has no integrity enforcement
	KeyID-1: OK, new kernel will convert the page
	KeyID-2: OK, new kernel will convert the page
	...
	KeyID-$MAX: OK, new kernel will convert the page

So, "OK" everywhere.  Nothing to do... anywhere.

Either I'm totally missing how this works, or you're desperately trying
to make this more complicated than it is.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-01-11  0:30           ` Dave Hansen
@ 2023-01-11  1:58             ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-11  1:58 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 16:30 -0800, Hansen, Dave wrote:
> On 1/10/23 16:13, Huang, Kai wrote:
> > On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
> ...
> > > Think about it this way: kexec() is modifying persistent (across kexec)
> > > state to get the system ready for the new kernel.  The caches are
> > > persistent state.  Devices have persistent state.  Memory state persists
> > > across kexec().  The memory integrity metadata persists.
> > > 
> > > What persistent state does a conversion to KeyID-0 affect?  It resets
> > > the integrity metadata and the memory contents.
> > > 
> > > Kexec leaves memory contents in place and doesn't zero them, so memory
> > > contents don't matter.  The integrity metadata also doesn't matter
> > > because the memory will be used as KeyID-0 and that KeyID doesn't read
> > > the integrity metadata.
> > 
> > Right.  So I guess we just need to call out the new kernel will use memory as
> > KeyID-0?
> 
> Not even that.
> 
> Say the new kernel wanted to use the memory as KeyID-3.  What would it
> do?  It would *ASSUME* that the memory *WASN'T* KeyID-3.  It would
> convert it to KeyID-3.  That conversion would work from *any* KeyID.
> 
> So:
> 
> 	KeyID-0: OK, because it has no integrity enforcement
> 	KeyID-1: OK, new kernel will convert the page
> 	KeyID-2: OK, new kernel will convert the page
> 	...
> 	KeyID-$MAX: OK, new kernel will convert the page
> 
> So, "OK" everywhere.  Nothing to do... anywhere.
> 
> Either I'm totally missing how this works, or you're desperately trying
> to make this more complicated than it is.
> 

You are right.  The page conversion must do MOVDIR64B first even converting the
page from KeyID 0.  I was wrongly thinking when converting from KeyID 0 we don't
need to do MOVDIR64B.  My bad.

Sorry for the noise.  Thanks for your time.  I'll remove all those staff.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-01-10 19:12           ` Dave Hansen
@ 2023-01-11  9:23             ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-11  9:23 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 11:12 -0800, Hansen, Dave wrote:
> On 1/9/23 18:23, Huang, Kai wrote:
> > On Mon, 2023-01-09 at 16:47 -0800, Dave Hansen wrote:
> > > On 1/9/23 16:40, Huang, Kai wrote:
> > > > On Fri, 2023-01-06 at 11:24 -0800, Dave Hansen wrote:
> > > ...
> > > > > Also, tdmr_sz and max_tdmrs can both be derived from 'sysinfo'.  Do they
> > > > > really need to be stored here?
> > > > 
> > > > It's not mandatory to keep them here.  I did it mainly because I want to avoid
> > > > passing 'sysinfo' as argument for almost all functions related to constructing
> > > > TDMRs.
> > > 
> > > I don't think it hurts readability that much.  On the contrary, it makes
> > > it more clear what data is needed for initialization.
> > 
> > Sorry one thing I forgot to mention is if we keep 'tdmr_sz' in 'struct
> > tdmr_info_list', it only needs to be calculated at once when allocating the
> > buffer.  Otherwise, we need to calculate it based on sysinfo-
> > max_reserved_per_tdmr each time we want to get a TDMR at a given index.
> 
> What's the problem with recalculating it?  It is calculated like this:
> 
> 	tdmr_sz = ALIGN(constant1 + constant2 * variable);
> 
> So, what's the problem?  You're concerned about too many multiplications?

No problem.  I don't have concern about multiplications, but since they can be
avoided, I thought perhaps it's better to avoid.

So I am fine with either way, no problem.

> 
> > To me putting relevant fields (tdmrs, tdmr_sz, max_tdmrs, nr_consumed_tdmrs)
> > together makes how the TDMR list is organized more clear.  But please let me
> > know if you prefer removing 'tdmr_sz' and 'max_tdmrs'.
> > 
> > Btw, if we remove 'tdmr_sz' and 'max_tdmrs', even nr_consumed_tdmrs is not
> > absolutely necessary here.  It can be a local variable of init_tdx_module() (as
> > shown in v7), and the 'struct tdmr_info_list' will only have the 'tdmrs' member
> > (as you commented in v7):
> > 
> > https://lore.kernel.org/linux-mm/cc195eb6499cf021b4ce2e937200571915bfe66f.camel@intel.com/T/#mb9826e2bcf8bf6399c13cc5f95a948fe4b3a46d9
> > 
> > Please let me know what's your preference?
> 
> I dunno.  My gut says that passing sysinfo around and just deriving the
> sizes values from that with helpers is the best way.  'struct
> tdmr_info_list' isn't a horrible idea in and of itself, but I think it's
> a confusing structure because it's not clear how the pieces fit together
> when half of it is *required* and the other half is just for some kind
> of perceived convenience.
> 

Sure.  No more argument about this.

However, for the sake of not adding more review burden to you, how about keeping
the 'struct tdmr_info_list' as is this time?  Of course I am willing to remove
the 'tdmr_sz' and 'max_tdmrs' from 'struct tdmr_info_list' but only keep 'tdmrs'
and 'nr_consumed_tdmrs' if you are wiling or want to look at what will the new
code look like.

Please let me know?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-10 16:18           ` Dave Hansen
@ 2023-01-11 10:00             ` Huang, Kai
  2023-01-12  0:56               ` Huang, Ying
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-11 10:00 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> On 1/10/23 04:09, Huang, Kai wrote:
> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > the "TDX-capable" nodes.
> > > > > > 
> > > > > > Why does it need to be enhanced?  What's the problem?
> > > > 
> > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > to the page allocator.
> > > > 
> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > utilize all memory.
> > > > 
> > > > But probably it is not necessarily to call out in the changelog?
> > > 
> > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > will old code do that doesn't know about this ABI?
> > 
> > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > TDX guest memory from those "TDX-capable" nodes.
> 
> That doesn't actually hide all of the behavior from users.  Let's say
> they do:
> 
> 	numactl --membind=6 qemu-kvm ...
> 
> In other words, take all of this guest's memory and put it on node 6.
> There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> they make it a TDX guest:
> 
> 	numactl --membind=6 qemu-kvm -tdx ...
> 
> What happens?  Does the kernel silently ignore the --membind=6?  Or does
> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> memory on node 6.
> 
> In other words, I don't think the kernel can just enforce this
> internally and hide it from userspace.

IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
the memory will always be allocated from those "TDX-capable" nodes.  KVM can
refuse to create the TDX guest if it found task's mempolicy doesn't have any
"TDX-capable" node, and print out a clear message to the userspace.

But I am new to the core-mm, so I might have some misunderstanding.

> 
> > > Is there something fundamental that keeps a memory area that spans two
> > > nodes from being removed and then a new area added that is comprised of
> > > a single node?
> > > Boot time:
> > > 
> > > 	| memblock  |  memblock |
> > > 	<--Node=0--> <--Node=1-->
> > > 
> > > Funky hotplug... nothing to see here, then:
> > > 
> > > 	<--------Node=2-------->
> > 
> > I must have missed something, but how can this happen?
> > 
> > I had memory that this cannot happen because the BIOS always allocates address
> > ranges for all NUMA nodes during machine boot.  Those address ranges don't
> > necessarily need to have DIMM fully populated but they don't change during
> > machine's runtime.
> 
> Is your memory correct?  Is there evidence, or requirements in any
> specification to support your memory?
> 

I tried to find whether there's any spec mentioning this, but so far didn't find
any.  I'll ask around to see whether this case can happen.

At the meantime, I also spent some time looking into the memory hotplug code
more deeply.  Below is my thinking:

For TDX system, AFAICT a non-buggy BIOS won't support physically hot-removing
CMR memory (thus no hot-add of CMR memory either).  So we are either talking
about hot-adding of non-TDX-usable memory (those are not configured to TDX
module), or kernel soft offline -> (optional remove -> add ->) online any TDX-
usable memory.

For the former we don't need to care about whether the new range can cross
multiple tdx_memlist entries.  For the latter, the offline granularity is
'struct memory_block', which is a fixed size after boot IIUC.  

And we can only offline one memory_block when it meets: 1) no memory hole, and;
2) all pages are in single zone.  IIUC this means it's not possible that we
offline two adjacent contiguous tdx_memlist entries and then online them
together as a single one.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-10 15:19           ` Dave Hansen
@ 2023-01-11 10:57             ` Huang, Kai
  2023-01-11 16:16               ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-11 10:57 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-01-10 at 07:19 -0800, Dave Hansen wrote:
> On 1/10/23 03:01, Huang, Kai wrote:
> > On Mon, 2023-01-09 at 17:22 -0800, Dave Hansen wrote:
> > > On 1/9/23 17:19, Huang, Kai wrote:
> > > > > It's probably also worth noting *somewhere* that there's a balance to be
> > > > > had between TDMRs and reserved areas.  A system that is running out of
> > > > > reserved areas in a TDMR could split a TDMR to get more reserved areas.
> > > > > A system that has run out of TDMRs could relatively easily coalesce two
> > > > > adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
> > > > > if there was a gap between them.
> > > > We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
> > > > Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
> > > > since that patch is the first place where the balance of TDMRs and reserved
> > > > areas is related.
> > > > 
> > > > What is your suggestion?
> > > Just put it close to the code that actually hits the problem so the
> > > potential solution is close at hand to whoever hits the problem.
> > > 
> > Sorry to double check: the code which hits the problem is the 'if (idx >=
> > max_reserved_per_tdmr)' check in tdmr_add_rsvd_area(), so I think I can add
> > right before this check?
> 
> Please just hack together how you think it should look and either reply
> with an updated patch, or paste the relevant code snippet in your reply.
>  That'll keep me from having to go chase this code back down.
> 

Thanks for the tip.  How about below?

static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
                              u64 size, u16 max_reserved_per_tdmr)
{
        struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
        int idx = *p_idx;

        /* Reserved area must be 4K aligned in offset and size */
        if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
                return -EINVAL;

        /*
         * The TDX module supports only limited number of TDMRs and
         * limited number of reserved areas for each TDMR.  There's a
         * balance to be had between TDMRs and reserved areas.  A system
         * that is running out of reserved areas in a TDMR could split a
         * TDMR to get more reserved areas.  A system that has run out
         * of TDMRs could relatively easily coalesce two adjacent TDMRs
         * (before the PAMTs are allocated) and use a reserved area if
         * there was a gap between them.
         */
        if (idx >= max_reserved_per_tdmr) {
                pr_warn("too many reserved areas for TDMR [0x%llx, 0x%llx)\n",
                                tdmr->base, tdmr_end(tdmr));
                return -ENOSPC;
        }

        /*
         * Consume one reserved area per call.  Make no effort to
         * optimize or reduce the number of reserved areas which are
         * consumed by contiguous reserved areas, for instance.
         */
        rsvd_areas[idx].offset = addr - tdmr->base;
        rsvd_areas[idx].size = size;

        *p_idx = idx + 1;

        return 0;
}


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-11 10:57             ` Huang, Kai
@ 2023-01-11 16:16               ` Dave Hansen
  2023-01-11 22:10                 ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2023-01-11 16:16 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 1/11/23 02:57, Huang, Kai wrote:
> On Tue, 2023-01-10 at 07:19 -0800, Dave Hansen wrote:
>> On 1/10/23 03:01, Huang, Kai wrote:
>>> On Mon, 2023-01-09 at 17:22 -0800, Dave Hansen wrote:
>>>> On 1/9/23 17:19, Huang, Kai wrote:
>>>>>> It's probably also worth noting *somewhere* that there's a balance to be
>>>>>> had between TDMRs and reserved areas.  A system that is running out of
>>>>>> reserved areas in a TDMR could split a TDMR to get more reserved areas.
>>>>>> A system that has run out of TDMRs could relatively easily coalesce two
>>>>>> adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
>>>>>> if there was a gap between them.
>>>>> We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
>>>>> Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
>>>>> since that patch is the first place where the balance of TDMRs and reserved
>>>>> areas is related.
>>>>>
>>>>> What is your suggestion?
>>>> Just put it close to the code that actually hits the problem so the
>>>> potential solution is close at hand to whoever hits the problem.
>>>>
>>> Sorry to double check: the code which hits the problem is the 'if (idx >=
>>> max_reserved_per_tdmr)' check in tdmr_add_rsvd_area(), so I think I can add
>>> right before this check?
>>
>> Please just hack together how you think it should look and either reply
>> with an updated patch, or paste the relevant code snippet in your reply.
>>  That'll keep me from having to go chase this code back down.
>>
> 
> Thanks for the tip.  How about below?
> 
> static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
>                               u64 size, u16 max_reserved_per_tdmr)
> {
>         struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
>         int idx = *p_idx;
> 
>         /* Reserved area must be 4K aligned in offset and size */
>         if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
>                 return -EINVAL;
> 
>         /*
>          * The TDX module supports only limited number of TDMRs and
>          * limited number of reserved areas for each TDMR.  There's a
>          * balance to be had between TDMRs a2nd reserved areas.  A system
>          * that is running out of reserved areas in a TDMR could split a
>          * TDMR to get more reserved areas.  A system that has run out
>          * of TDMRs could relatively easily coalesce two adjacent TDMRs
>          * (before the PAMTs are allocated) and use a reserved area if
>          * there was a gap between them.
>          */
>         if (idx >= max_reserved_per_tdmr) {
>                 pr_warn("too many reserved areas for TDMR [0x%llx, 0x%llx)\n",
>                                 tdmr->base, tdmr_end(tdmr));
>                 return -ENOSPC;
>         }

This isn't really converging on a solution.  At this point, I just see
my verbatim text being copied and pasted into these functions without
really anything additional.

This comment, for instance, just blathers about what could be done but
doesn't actually explain what it is doing here.

But, again, this isn't converging.  It's just thrashing and not getting
any better.  I guess I'll just fix it up best I can when I apply it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-01-11 16:16               ` Dave Hansen
@ 2023-01-11 22:10                 ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-11 22:10 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Wed, 2023-01-11 at 08:16 -0800, Dave Hansen wrote:
> On 1/11/23 02:57, Huang, Kai wrote:
> > On Tue, 2023-01-10 at 07:19 -0800, Dave Hansen wrote:
> > > On 1/10/23 03:01, Huang, Kai wrote:
> > > > On Mon, 2023-01-09 at 17:22 -0800, Dave Hansen wrote:
> > > > > On 1/9/23 17:19, Huang, Kai wrote:
> > > > > > > It's probably also worth noting *somewhere* that there's a balance to be
> > > > > > > had between TDMRs and reserved areas.  A system that is running out of
> > > > > > > reserved areas in a TDMR could split a TDMR to get more reserved areas.
> > > > > > > A system that has run out of TDMRs could relatively easily coalesce two
> > > > > > > adjacent TDMRs (before the PAMTs are allocated) and use a reserved area
> > > > > > > if there was a gap between them.
> > > > > > We can add above to the changelog of this patch, or the patch 09 ("x86/virt/tdx:
> > > > > > Fill out TDMRs to cover all TDX memory regions").  The latter perhaps is better
> > > > > > since that patch is the first place where the balance of TDMRs and reserved
> > > > > > areas is related.
> > > > > > 
> > > > > > What is your suggestion?
> > > > > Just put it close to the code that actually hits the problem so the
> > > > > potential solution is close at hand to whoever hits the problem.
> > > > > 
> > > > Sorry to double check: the code which hits the problem is the 'if (idx >=
> > > > max_reserved_per_tdmr)' check in tdmr_add_rsvd_area(), so I think I can add
> > > > right before this check?
> > > 
> > > Please just hack together how you think it should look and either reply
> > > with an updated patch, or paste the relevant code snippet in your reply.
> > >  That'll keep me from having to go chase this code back down.
> > > 
> > 
> > Thanks for the tip.  How about below?
> > 
> > static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> >                               u64 size, u16 max_reserved_per_tdmr)
> > {
> >         struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
> >         int idx = *p_idx;
> > 
> >         /* Reserved area must be 4K aligned in offset and size */
> >         if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
> >                 return -EINVAL;
> > 
> >         /*
> >          * The TDX module supports only limited number of TDMRs and
> >          * limited number of reserved areas for each TDMR.  There's a
> >          * balance to be had between TDMRs a2nd reserved areas.  A system
> >          * that is running out of reserved areas in a TDMR could split a
> >          * TDMR to get more reserved areas.  A system that has run out
> >          * of TDMRs could relatively easily coalesce two adjacent TDMRs
> >          * (before the PAMTs are allocated) and use a reserved area if
> >          * there was a gap between them.
> >          */
> >         if (idx >= max_reserved_per_tdmr) {
> >                 pr_warn("too many reserved areas for TDMR [0x%llx, 0x%llx)\n",
> >                                 tdmr->base, tdmr_end(tdmr));
> >                 return -ENOSPC;
> >         }
> 
> This isn't really converging on a solution.  At this point, I just see
> my verbatim text being copied and pasted into these functions without
> really anything additional.
> 
> This comment, for instance, just blathers about what could be done but
> doesn't actually explain what it is doing here.
> 
> But, again, this isn't converging.  It's just thrashing and not getting
> any better.  I guess I'll just fix it up best I can when I apply it.

Appreciate your help!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-11 10:00             ` Huang, Kai
@ 2023-01-12  0:56               ` Huang, Ying
  2023-01-12  1:18                 ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Ying @ 2023-01-12  0:56 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Hansen, Dave, linux-kernel, Luck, Tony, bagasdotme, ak,
	Wysocki, Rafael J, kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, peterz, imammedo, Gao, Chao, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Williams, Dan J

"Huang, Kai" <kai.huang@intel.com> writes:

> On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
>> On 1/10/23 04:09, Huang, Kai wrote:
>> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> > > On 1/9/23 03:48, Huang, Kai wrote:
>> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
>> > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
>> > > > > > > > the "TDX-capable" nodes.
>> > > > > >
>> > > > > > Why does it need to be enhanced?  What's the problem?
>> > > >
>> > > > The problem is after TDX module initialization, no more memory can be hot-added
>> > > > to the page allocator.
>> > > >
>> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>> > > > utilize all memory.
>> > > >
>> > > > But probably it is not necessarily to call out in the changelog?
>> > >
>> > > Let's say that we add this TDX-compatible-node ABI in the future.  What
>> > > will old code do that doesn't know about this ABI?
>> >
>> > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
>> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
>> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
>> > TDX guest memory from those "TDX-capable" nodes.
>>
>> That doesn't actually hide all of the behavior from users.  Let's say
>> they do:
>>
>>       numactl --membind=6 qemu-kvm ...
>>
>> In other words, take all of this guest's memory and put it on node 6.
>> There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
>> they make it a TDX guest:
>>
>>       numactl --membind=6 qemu-kvm -tdx ...
>>
>> What happens?  Does the kernel silently ignore the --membind=6?  Or does
>> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
>> memory on node 6.
>>
>> In other words, I don't think the kernel can just enforce this
>> internally and hide it from userspace.
>
> IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> refuse to create the TDX guest if it found task's mempolicy doesn't have any
> "TDX-capable" node, and print out a clear message to the userspace.
>
> But I am new to the core-mm, so I might have some misunderstanding.

KVM here means in-kernel KVM module?  If so, KVM can only output some
message in dmesg.  Which isn't very good for users to digest.  It's
better for the user space QEMU to detect whether current configuration
is usable and respond to users, via GUI, or syslog, etc.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-12  0:56               ` Huang, Ying
@ 2023-01-12  1:18                 ` Huang, Kai
  2023-01-12  1:59                   ` Huang, Ying
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-12  1:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kvm, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-kernel, linux-mm, peterz, imammedo, Gao, Chao, Brown, Len,
	Shahar, Sagi, sathyanarayanan.kuppuswamy, Williams, Dan J

On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
> "Huang, Kai" <kai.huang@intel.com> writes:
> 
> > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> > > On 1/10/23 04:09, Huang, Kai wrote:
> > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > > > the "TDX-capable" nodes.
> > > > > > > > 
> > > > > > > > Why does it need to be enhanced?  What's the problem?
> > > > > > 
> > > > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > > > to the page allocator.
> > > > > > 
> > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > > > utilize all memory.
> > > > > > 
> > > > > > But probably it is not necessarily to call out in the changelog?
> > > > > 
> > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > > > will old code do that doesn't know about this ABI?
> > > > 
> > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > > > TDX guest memory from those "TDX-capable" nodes.
> > > 
> > > That doesn't actually hide all of the behavior from users.  Let's say
> > > they do:
> > > 
> > >       numactl --membind=6 qemu-kvm ...
> > > 
> > > In other words, take all of this guest's memory and put it on node 6.
> > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> > > they make it a TDX guest:
> > > 
> > >       numactl --membind=6 qemu-kvm -tdx ...
> > > 
> > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
> > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> > > memory on node 6.
> > > 
> > > In other words, I don't think the kernel can just enforce this
> > > internally and hide it from userspace.
> > 
> > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> > refuse to create the TDX guest if it found task's mempolicy doesn't have any
> > "TDX-capable" node, and print out a clear message to the userspace.
> > 
> > But I am new to the core-mm, so I might have some misunderstanding.
> 
> KVM here means in-kernel KVM module?  If so, KVM can only output some
> message in dmesg.  Which isn't very good for users to digest.  It's
> better for the user space QEMU to detect whether current configuration
> is usable and respond to users, via GUI, or syslog, etc.

I am not against this. For instance, maybe we can add some dedicated error code
and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
discuss this more when we have patches actually sent out to the community.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-12  1:18                 ` Huang, Kai
@ 2023-01-12  1:59                   ` Huang, Ying
  2023-01-12  2:22                     ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: Huang, Ying @ 2023-01-12  1:59 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-kernel, linux-mm, peterz, imammedo, Gao, Chao, Brown, Len,
	Shahar, Sagi, sathyanarayanan.kuppuswamy, Williams, Dan J

"Huang, Kai" <kai.huang@intel.com> writes:

> On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
>> "Huang, Kai" <kai.huang@intel.com> writes:
>>
>> > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
>> > > On 1/10/23 04:09, Huang, Kai wrote:
>> > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> > > > > On 1/9/23 03:48, Huang, Kai wrote:
>> > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
>> > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>> > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>> > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
>> > > > > > > > > > the "TDX-capable" nodes.
>> > > > > > > >
>> > > > > > > > Why does it need to be enhanced?  What's the problem?
>> > > > > >
>> > > > > > The problem is after TDX module initialization, no more memory can be hot-added
>> > > > > > to the page allocator.
>> > > > > >
>> > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>> > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>> > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>> > > > > > utilize all memory.
>> > > > > >
>> > > > > > But probably it is not necessarily to call out in the changelog?
>> > > > >
>> > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
>> > > > > will old code do that doesn't know about this ABI?
>> > > >
>> > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
>> > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
>> > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
>> > > > TDX guest memory from those "TDX-capable" nodes.
>> > >
>> > > That doesn't actually hide all of the behavior from users.  Let's say
>> > > they do:
>> > >
>> > >       numactl --membind=6 qemu-kvm ...
>> > >
>> > > In other words, take all of this guest's memory and put it on node 6.
>> > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
>> > > they make it a TDX guest:
>> > >
>> > >       numactl --membind=6 qemu-kvm -tdx ...
>> > >
>> > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
>> > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
>> > > memory on node 6.
>> > >
>> > > In other words, I don't think the kernel can just enforce this
>> > > internally and hide it from userspace.
>> >
>> > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
>> > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
>> > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
>> > refuse to create the TDX guest if it found task's mempolicy doesn't have any
>> > "TDX-capable" node, and print out a clear message to the userspace.
>> >
>> > But I am new to the core-mm, so I might have some misunderstanding.
>>
>> KVM here means in-kernel KVM module?  If so, KVM can only output some
>> message in dmesg.  Which isn't very good for users to digest.  It's
>> better for the user space QEMU to detect whether current configuration
>> is usable and respond to users, via GUI, or syslog, etc.
>
> I am not against this. For instance, maybe we can add some dedicated error code
> and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
> discuss this more when we have patches actually sent out to the community.

Error code is a kind of ABI too. :-)

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-12  1:59                   ` Huang, Ying
@ 2023-01-12  2:22                     ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-12  2:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kvm, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, linux-kernel,
	Yamahata, Isaku, peterz, imammedo, Gao, Chao, Brown, Len, Shahar,
	Sagi, sathyanarayanan.kuppuswamy, Williams, Dan J

On Thu, 2023-01-12 at 09:59 +0800, Huang, Ying wrote:
> "Huang, Kai" <kai.huang@intel.com> writes:
> 
> > On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
> > > "Huang, Kai" <kai.huang@intel.com> writes:
> > > 
> > > > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> > > > > On 1/10/23 04:09, Huang, Kai wrote:
> > > > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > > > > > the "TDX-capable" nodes.
> > > > > > > > > > 
> > > > > > > > > > Why does it need to be enhanced?  What's the problem?
> > > > > > > > 
> > > > > > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > > > > > to the page allocator.
> > > > > > > > 
> > > > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > > > > > utilize all memory.
> > > > > > > > 
> > > > > > > > But probably it is not necessarily to call out in the changelog?
> > > > > > > 
> > > > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > > > > > will old code do that doesn't know about this ABI?
> > > > > > 
> > > > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > > > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > > > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > > > > > TDX guest memory from those "TDX-capable" nodes.
> > > > > 
> > > > > That doesn't actually hide all of the behavior from users.  Let's say
> > > > > they do:
> > > > > 
> > > > >       numactl --membind=6 qemu-kvm ...
> > > > > 
> > > > > In other words, take all of this guest's memory and put it on node 6.
> > > > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> > > > > they make it a TDX guest:
> > > > > 
> > > > >       numactl --membind=6 qemu-kvm -tdx ...
> > > > > 
> > > > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
> > > > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> > > > > memory on node 6.
> > > > > 
> > > > > In other words, I don't think the kernel can just enforce this
> > > > > internally and hide it from userspace.
> > > > 
> > > > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> > > > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> > > > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> > > > refuse to create the TDX guest if it found task's mempolicy doesn't have any
> > > > "TDX-capable" node, and print out a clear message to the userspace.
> > > > 
> > > > But I am new to the core-mm, so I might have some misunderstanding.
> > > 
> > > KVM here means in-kernel KVM module?  If so, KVM can only output some
> > > message in dmesg.  Which isn't very good for users to digest.  It's
> > > better for the user space QEMU to detect whether current configuration
> > > is usable and respond to users, via GUI, or syslog, etc.
> > 
> > I am not against this. For instance, maybe we can add some dedicated error code
> > and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
> > discuss this more when we have patches actually sent out to the community.
> 
> Error code is a kind of ABI too. :-)
> 

Right.  I can bring this up in the KVM TDX support series (when time is right)
to see whether KVM guys want such error code at the initial support, or just
want to extend in the future.  The worst case is the old Qemu may not recognize
the new error code (while the error is still available in dmesg) but still can
do the right behaviour (stop to run, etc).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-09 16:51       ` Dave Hansen
  2023-01-10 12:09         ` Huang, Kai
@ 2023-01-12 11:33         ` Huang, Kai
  1 sibling, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-12 11:33 UTC (permalink / raw)
  To: kvm, Hansen, Dave, linux-kernel
  Cc: Luck, Tony, bagasdotme, ak, Wysocki, Rafael J, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, Yamahata, Isaku,
	tglx, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len, peterz,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > > +       list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > > > +               /*
> > > > > > +                * The new range is TDX memory if it is fully
> > > > > > covered by
> > > > > > +                * any TDX memory block.
> > > > > > +                *
> > > > > > +                * Note TDX memory blocks are originated from
> > > > > > memblock
> > > > > > +                * memory regions, which can only be contiguous when
> > > > > > two
> > > > > > +                * regions have different NUMA nodes or flags. 
> > > > > > Therefore
> > > > > > +                * the new range cannot cross multiple TDX memory
> > > > > > blocks.
> > > > > > +                */
> > > > > > +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb-
> > > > > > >end_pfn)
> > > > > > +                       return true;
> > > > > > +       }
> > > > > > +       return false;
> > > > > > +}
> > > > 
> > > > I don't really like that comment.  It should first state its behavior
> > > > and assumptions, like:
> > > > 
> > > >      This check assumes that the start_pfn<->end_pfn range does not
> > > >      cross multiple tdx_memlist entries.
> > > > 
> > > > Only then should it describe why that is OK:
> > > > 
> > > >      A single memory hotplug even across mutliple memblocks (from
> > > >      which tdx_memlist entries are derived) is impossible.  ... then
> > > >      actually explain
> > > > 
> > 
> > How about below?
> > 
> >          /*
> >           * This check assumes that the start_pfn<->end_pfn range does not
> > cross
> >           * multiple tdx_memlist entries. A single memory hotplug event
> > across
> >           * multiple memblocks (from which tdx_memlist entries are derived)
> > is
> >           * impossible. That means start_pfn<->end_pfn range cannot exceed a
> >           * tdx_memlist entry, and the new range is TDX memory if it is
> > fully
> >           * covered by any tdx_memlist entry.
> >           */
> 
> I was hoping you would actually explain why it is impossible.
> 
> Is there something fundamental that keeps a memory area that spans two
> nodes from being removed and then a new area added that is comprised of
> a single node?
> 
> Boot time:
> 
> 	| memblock  |  memblock |
> 	<--Node=0--> <--Node=1-->
> 
> Funky hotplug... nothing to see here, then:
> 
> 	<--------Node=2-------->
> 
> I would believe that there is no current bare-metal TDX system that has
> an implementation like this.  But, the comments above speak like it's
> fundamentally impossible.  That should be clarified.
> 
> In other words, that comment talks about memblock attributes as being
> the core underlying reason that that simplified check is OK.  Is that
> it, or is it really the reduced hotplug feature set on TDX systems?

Hi Dave,

I think I have been forgetting that we have switched to reject non-TDX memory in
memory online, but not in memory hot-add.  

Memory offline/online is done on granularity of 'struct memory_block', but not
memblock.  In fact, the hotpluggable memory region (one memblock) must be
multiple of memory_block, and a "to-be-online" memory_block must be full range
memory (no memory hole).

So if I am not missing something, IIUC that means if the start_pfn<->end_pfn is
TDX memory, it must be fully within some @tdx_memlist entry, but cannot cross
multiple small entries.  And the memory hotplug case in your above diagram
actually shouldn't matter.

If above stands, how about below?

        /*
         * This check assumes that the start_pfn<->end_pfn range does not 
         * cross multiple @tdx_memlist entries.  A single memory online   
         * event across multiple @tdx_memlist entries (which are derived  
         * from memblocks at the time of module initialization) is not    
         * possible.
         *
         * This is because memory offline/online is done on granularity   
         * of 'struct memory_block', and the hotpluggable memory region   
         * (one memblock) must be multiple of memory_block.  Also, the    
         * "to-be-online" memory_block must be full memory (no memory     
         * hole, i.e. containing multiple small memblocks).
         *
         * This means if the start_pfn<->end_pfn range is TDX memory, it  
         * must be fully within one @tdx_memlist entry, but cannot cross  
         * multiple small entries.
         */
        list_for_each_entry(tmb, &tdx_memlist, list) {
                if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
                        return true;
        }


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
  2023-01-06 18:18   ` Dave Hansen
@ 2023-01-18 11:08   ` Huang, Kai
  2023-01-18 13:57     ` David Hildenbrand
  1 sibling, 1 reply; 84+ messages in thread
From: Huang, Kai @ 2023-01-18 11:08 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, david, Wysocki,
	Rafael J, kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, akpm, osalvador, peterz, Shahar, Sagi, imammedo, Gao,
	Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J

+Dave, Oscar and Andrew.

Hi Memory hotplug maintainers,

Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
Extensions) patch, since it is related to memory hotplug (not modifying any
common memory hotplug directly, though)?  Dave suggested it's better to get
memory hotplug guys to help to review sooner than later.

This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
your time.

On Fri, 2022-12-09 at 19:52 +1300, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
> 
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible.  The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module.  Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, just make sure
> all pages in the page allocator are TDX memory.
> 
> To guarantee that, stash off the memblock memory regions at the time of
> initializing the TDX module as TDX's own usable memory regions, and in
> the meantime, register a TDX memory notifier to reject to online any new
> memory in memory hotplug.
> 
> This approach works as in practice all boot-time present DIMMs are TDX
> convertible memory.  However, if any non-TDX-convertible memory has been
> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> module, the module initialization will fail.
> 
> This can also be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v7 -> v8:
>  - Trimed down changelog (Dave).
>  - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
>    (Ying).
>  - Moved memory hotplug handling from add_arch_memory() to
>    memory_notifier (Dan/David).
>  - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
>  - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
>  - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
>  - Improve the comment around first 1MB (Dave).
>  - Added a comment around reserve_real_mode() to point out TDX code
>    relies on first 1MB being reserved (Ying).
>  - Added comment to explain why the new online memory range cannot
>    cross multiple TDX memory blocks (Dave).
>  - Improved other comments (Dave).
> 
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/kernel/setup.c     |   2 +
>  arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 162 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 216fee7144ee..3a841a77fda4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
>  	 *
>  	 * Moreover, on machines with SandyBridge graphics or in setups that use
>  	 * crashkernel the entire 1M is reserved anyway.
> +	 *
> +	 * Note the host kernel TDX also requires the first 1MB being reserved.
>  	 */
>  	reserve_real_mode();
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6fe505c32599..f010402f443d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,13 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/memory.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> +#include <linux/pfn.h>
>  #include <asm/pgtable_types.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -25,6 +32,12 @@ enum tdx_module_status_t {
>  	TDX_MODULE_ERROR
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};
> +
>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 nr_tdx_keyids __ro_after_init;
>  
> @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
>  /* Prevent concurrent attempts on TDX detection and initialization */
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
>  	return 0;
>  }
>  
> +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled. */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered by
> +		 * any TDX memory block.
> +		 *
> +		 * Note TDX memory blocks are originated from memblock
> +		 * memory regions, which can only be contiguous when two
> +		 * regions have different NUMA nodes or flags.  Therefore
> +		 * the new range cannot cross multiple TDX memory blocks.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * Not all memory is compatible with TDX.  Reject
> +	 * to online any incompatible memory.
> +	 */
> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +	.notifier_call = tdx_memory_notifier,
> +};
> +
>  static int __init tdx_init(void)
>  {
>  	int err;
> @@ -89,6 +149,13 @@ static int __init tdx_init(void)
>  		goto no_tdx;
>  	}
>  
> +	err = register_memory_notifier(&tdx_memory_nb);
> +	if (err) {
> +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> +				err);
> +		goto no_tdx;
> +	}
> +
>  	return 0;
>  no_tdx:
>  	clear_tdx();
> @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>  	return 0;
>  }
>  
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */
> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -226,10 +364,25 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, just make
> +	 * sure all pages in the page allocator are TDX memory.
> +	 *
> +	 * Build the list of "TDX-usable" memory regions which cover all
> +	 * pages in the page allocator to guarantee that.  Do it while
> +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> +	 * path reads the @tdx_memlist to reject any new memory.
> +	 */
> +	get_online_mems();
> +
> +	ret = build_tdx_memlist(&tdx_memlist);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>  	 *    regions.
>  	 *  - Pick up one TDX private KeyID as the global KeyID.
> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
>  	return ret;
>  }
>  


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-18 11:08   ` Huang, Kai
@ 2023-01-18 13:57     ` David Hildenbrand
  2023-01-18 19:38       ` Huang, Kai
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2023-01-18 13:57 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, akpm, osalvador, peterz, Shahar, Sagi, imammedo, Gao,
	Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J

On 18.01.23 12:08, Huang, Kai wrote:
> +Dave, Oscar and Andrew.
> 
> Hi Memory hotplug maintainers,
> 
> Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
> Extensions) patch, since it is related to memory hotplug (not modifying any
> common memory hotplug directly, though)?  Dave suggested it's better to get
> memory hotplug guys to help to review sooner than later.
> 
> This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
> your time.

Hi,

I remember discussing that part (notifier) already and it looked good to 
me. No objection from my side.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-01-18 13:57     ` David Hildenbrand
@ 2023-01-18 19:38       ` Huang, Kai
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Kai @ 2023-01-18 19:38 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, akpm, kirill.shutemov,
	Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, Wysocki, Rafael J, osalvador, Shahar, Sagi, peterz,
	imammedo, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J

On Wed, 2023-01-18 at 14:57 +0100, David Hildenbrand wrote:
> On 18.01.23 12:08, Huang, Kai wrote:
> > +Dave, Oscar and Andrew.
> > 
> > Hi Memory hotplug maintainers,
> > 
> > Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
> > Extensions) patch, since it is related to memory hotplug (not modifying any
> > common memory hotplug directly, though)?  Dave suggested it's better to get
> > memory hotplug guys to help to review sooner than later.
> > 
> > This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
> > your time.
> 
> Hi,
> 
> I remember discussing that part (notifier) already and it looked good to 
> me. No objection from my side.
> 

Yes.  Thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2023-01-18 19:38 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-09  6:52 [PATCH v8 00/16] TDX host kernel support Kai Huang
2022-12-09  6:52 ` [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-01-06 19:04   ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-01-06 17:09   ` Dave Hansen
2023-01-08 22:25     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-01-06 19:04   ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2023-01-06 17:14   ` Dave Hansen
2023-01-08 22:26     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
2023-01-06 17:29   ` Dave Hansen
2023-01-09 10:30     ` Huang, Kai
2023-01-09 19:54       ` Dave Hansen
2023-01-09 22:10         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-01-06 17:46   ` Dave Hansen
2023-01-09 10:25     ` Huang, Kai
2023-01-09 19:52       ` Dave Hansen
2023-01-09 22:07         ` Huang, Kai
2023-01-09 22:11           ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-01-06 18:18   ` Dave Hansen
2023-01-09 11:48     ` Huang, Kai
2023-01-09 16:51       ` Dave Hansen
2023-01-10 12:09         ` Huang, Kai
2023-01-10 16:18           ` Dave Hansen
2023-01-11 10:00             ` Huang, Kai
2023-01-12  0:56               ` Huang, Ying
2023-01-12  1:18                 ` Huang, Kai
2023-01-12  1:59                   ` Huang, Ying
2023-01-12  2:22                     ` Huang, Kai
2023-01-12 11:33         ` Huang, Kai
2023-01-18 11:08   ` Huang, Kai
2023-01-18 13:57     ` David Hildenbrand
2023-01-18 19:38       ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-01-06 19:24   ` Dave Hansen
2023-01-10  0:40     ` Huang, Kai
2023-01-10  0:47       ` Dave Hansen
2023-01-10  2:23         ` Huang, Kai
2023-01-10 19:12           ` Dave Hansen
2023-01-11  9:23             ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 09/16] x86/virt/tdx: Fill out " Kai Huang
2023-01-06 19:36   ` Dave Hansen
2023-01-10  0:45     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-01-06 21:53   ` Dave Hansen
2023-01-10  0:49     ` Huang, Kai
2023-01-07  0:47   ` Dave Hansen
2023-01-10  0:47     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-01-06 22:07   ` Dave Hansen
2023-01-10  1:19     ` Huang, Kai
2023-01-10  1:22       ` Dave Hansen
2023-01-10 11:01         ` Huang, Kai
2023-01-10 15:19           ` Dave Hansen
2023-01-11 10:57             ` Huang, Kai
2023-01-11 16:16               ` Dave Hansen
2023-01-11 22:10                 ` Huang, Kai
2023-01-10 11:01       ` Huang, Kai
2023-01-10 15:17         ` Dave Hansen
2022-12-09  6:52 ` [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Kai Huang
2023-01-06 22:21   ` Dave Hansen
2023-01-10 10:48     ` Huang, Kai
2023-01-10 16:25       ` Dave Hansen
2023-01-10 23:33         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-01-06 22:49   ` Dave Hansen
2023-01-10 10:15     ` Huang, Kai
2023-01-10 16:53       ` Dave Hansen
2023-01-11  0:06         ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-01-07  0:17   ` Dave Hansen
2023-01-10 10:23     ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
2023-01-07  0:35   ` Dave Hansen
2023-01-10 11:29     ` Huang, Kai
2023-01-10 15:27       ` Dave Hansen
2023-01-11  0:13         ` Huang, Kai
2023-01-11  0:30           ` Dave Hansen
2023-01-11  1:58             ` Huang, Kai
2022-12-09  6:52 ` [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).