All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/16] TDX host kernel support
@ 2023-03-06 14:13 Kai Huang
  2023-03-06 14:13 ` [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
                   ` (17 more replies)
  0 siblings, 18 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests.  KVM support for TDX is being
developed separately[2].  A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3].  The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

This series doesn't aim to support all functionalities, and doesn't aim
to resolve all things perfectly.  For example, memory hotplug is handled
in simple way (please refer to "Kernel policy on TDX memory" and "Memory
hotplug" sections below).

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
help to provide comments.)

And TDX module metadata allocation just uses alloc_contig_pages() to
allocate large chunk at runtime, thus it can fail.  It is imperfect now
but _will_ be improved in the future.

Also, the patch to add the new kernel comline tdx="force" isn't included
in this initial version, as Dave suggested it isn't mandatory.  But I
_will_ add one once this initial version gets merged.

All other optimizations will be posted as follow-up once this initial
TDX support is upstreamed.

Hi Dave, Peter, Thomas, Dan (and Intel reviewers),

The environment to test the new LP.INIT SEAMCALL behaviour hasn't been
done yet, thus I haven't tested the new behaviour.  Instead, I tested
with all cpus are online when initializing the TDX module.  CPU hotplug
path isn't really tested although I did some basic test that I can
offline some cpus after module initialization, online them again and the
LP.INIT was skipped successfully for them.

However I believe there should be no issue when the new module is ready.
I will test and report back when the new module is ready.

I would appreciate if folks could review this presumptive series anyway.
   
And I would appreciate reviewed-by or acked-by tags if the patches look
good to you.

----- Changelog history: ------

- v9 -> v10:

 - Changed the per-cpu initalization handling
   - Gave up "ensuring all online cpus are TDX-runnable when TDX module
     is initialized", but just provide two basic functions, tdx_enable()
     and tdx_cpu_enable(), to let the user of TDX to make sure the
     tdx_cpu_enable() has been done successfully when the user wants to
     use particular cpu for TDX.
   - Thus, moved per-cpu initialization out of tdx_enable().  Now
     tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
     on all online cpus before calling it.
   - Merged the tdx_enable() skeleton patch and per-cpu initialization
     patch together to tell better story.
   - Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.

 v9: https://lore.kernel.org/lkml/cover.1676286526.git.kai.huang@intel.com/

- v8 -> v9:

 - Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
 - Other changes please refer to changelog histroy in individual patches.

 v8: https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/

- v7 -> v8:

 - 200+ LOC removed (from 1800+ -> 1600+).
 - Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT
   (Dave/Peter/Thomas).
 - Removed patch to shut down TDX module (Sean).
 - For memory hotplug, changed to reject non-TDX memory from
   arch_add_memory() to memory_notifier (Dan/David).
 - Simplified the "skeletion patch" as a result of removing
   TDH.SYS.LP.INIT patch.
 - Refined changelog/comments for most of the patches (to tell better
   story, remove silly comments, etc) (Dave).
 - Added new 'struct tdmr_info_list' struct, and changed all TDMR related
   patches to use it (Dave).
 - Effectively merged patch "Reserve TDX module global KeyID" and
   "Configure TDX module with TDMRs and global KeyID", and removed the
   static variable 'tdx_global_keyid', following Dave's suggestion on
   making tdx_sysinfo local variable.
 - For detailed changes please see individual patch changelog history.

 v7: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v6 -> v7:
  - Added memory hotplug support.
  - Changed how to choose the list of "TDX-usable" memory regions from at
    kernel boot time to TDX module initialization time.
  - Addressed comments received in previous versions. (Andi/Dave).
  - Improved the commit message and the comments of kexec() support patch,
    and the patch handles returnning PAMTs back to the kernel when TDX
    module initialization fails. Please also see "kexec()" section below.
  - Changed the documentation patch accordingly.
  - For all others please see individual patch changelog history.

 v6: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v5 -> v6:

  - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
  - Removed patch to disable driver-managed memory hotplug (Intel
    internal discussion).
  - Added one patch to introduce enum type for TDX supported page size
    level to replace the hard-coded values in TDX guest code (Dave).
  - Added one patch to make TDX depends on X2APIC being enabled (Dave).
  - Added one patch to build all boot-time present memory regions as TDX
    memory during kernel boot.
  - Added Reviewed-by from others to some patches.
  - For all others please see individual patch changelog history.

 v5: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v4 -> v5:

  This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

  There are also very minor code and commit message update from v4:

  - Rebased to latest tip/x86/tdx.
  - Fixed a checkpatch issue that I missed in v4.
  - Removed an obsoleted comment that I missed in patch 6.
  - Very minor update to the commit message of patch 12.

  For other changes to individual patches since v3, please refer to the
  changelog histroy of individual patches (I just used v3 -> v5 since
  there's basically no code change to v4).

 v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/

- v3 -> v4 (addressed Dave's comments, and other comments from others):

 - Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

 v3: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- V2 -> v3:

 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

 v2: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

 v1: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized.  This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

This series adds a function tdx_enable() to allow the caller to initialize
TDX at runtime:

        if (tdx_enable())
                goto no_tdx;
	// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done).  So far, KVM is the only user of TDX, and it already handles VMXON.
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.

2. CPU hotplug

TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
must be done on one cpu before any other SEAMCALLs can be made on that
cpu, including those involved during the module initialization.

The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
the user wants to use a new cpu for TDX task.

TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel.  This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator.  This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime.  No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

This _will_ be enhanced in the future after first submission.

A better solution, suggested by Kirill, is similar to the per-node memory
encryption flag in this series [4].  We can allow adding/onlining non-TDX
memory to separate NUMA nodes so that both "TDX-capable" nodes and
"TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
nodes via NUMA ABIs.

5. Physical Memory Hotplug

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

6. Kexec()

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
Even the new kernel wants to use any non-zero KeyID, it needs to convert
the memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

This series just uses wbinvd() to flush cache in stop_this_cpu()
following AMD's SME.



Kai Huang (16):
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Detect TDX during kernel boot
  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  x86/virt/tdx: Add SEAMCALL infrastructure
  x86/virt/tdx: Add skeleton to enable TDX on demand
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Designate reserved areas for all TDMRs
  x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  Documentation/x86: Add documentation for TDX host support

 Documentation/x86/tdx.rst        |  186 ++++-
 arch/x86/Kconfig                 |   15 +
 arch/x86/Makefile                |    2 +
 arch/x86/coco/tdx/tdx.c          |    6 +-
 arch/x86/include/asm/msr-index.h |    3 +
 arch/x86/include/asm/tdx.h       |   21 +
 arch/x86/kernel/process.c        |    7 +-
 arch/x86/kernel/setup.c          |    2 +
 arch/x86/virt/Makefile           |    2 +
 arch/x86/virt/vmx/Makefile       |    2 +
 arch/x86/virt/vmx/tdx/Makefile   |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
 arch/x86/virt/vmx/tdx/tdx.c      | 1324 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  150 ++++
 arch/x86/virt/vmx/tdx/tdxcall.S  |   19 +-
 15 files changed, 1776 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h


base-commit: 1e70c680375aa33cca97bff0bca68c0f82f5023c
-- 
2.39.2


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-16 12:37   ` David Hildenbrand
  2023-03-06 14:13 ` [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
defined by the TDX module spec and used as TDX module ABI.  Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page.  However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:
 - Added Dave's Reviewed-by

v7 -> v8:
 - Improved the comment of TDX supported page sizes macros (Dave)

v6 -> v7:
 - Removed the helper to convert kernel page level to TDX page level.
 - Changed to use macro to define TDX supported page sizes.

---
 arch/x86/coco/tdx/tdx.c    | 6 +++---
 arch/x86/include/asm/tdx.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index b593009b30ab..e27c3cd97fcb 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -777,13 +777,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	 */
 	switch (pg_level) {
 	case PG_LEVEL_4K:
-		page_size = 0;
+		page_size = TDX_PS_4K;
 		break;
 	case PG_LEVEL_2M:
-		page_size = 1;
+		page_size = TDX_PS_2M;
 		break;
 	case PG_LEVEL_1G:
-		page_size = 2;
+		page_size = TDX_PS_1G;
 		break;
 	default:
 		return false;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..25fd6070dc0b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,11 @@
 
 #ifndef __ASSEMBLY__
 
+/* TDX supported page sizes from the TDX module ABI. */
+#define TDX_PS_4K	0
+#define TDX_PS_2M	1
+#define TDX_PS_1G	2
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
  2023-03-06 14:13 ` [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-16 12:48   ` David Hildenbrand
  2023-03-06 14:13 ` [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
private KeyIDs are consistently and correctly programmed by the BIOS
across all CPU packages before it enables TDX on any CPU core.  A valid
TDX private KeyID range on BSP indicates TDX has been enabled by the
BIOS, otherwise the BIOS is buggy.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests.  The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.

Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
private KeyIDs.  Also add a function to report whether TDX is enabled by
the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
cache flush is needed.

The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
own protection.  Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM uses TDX.  Make the new config option depend
on KVM_INTEL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:
 - Moved MSR macro from local tdx.h to <asm/msr-index.h> (Dave).
 - Moved reserving the TDX global KeyID from later patch to here.
 - Changed 'tdx_keyid_start' and 'nr_tdx_keyids' to
   'tdx_guest_keyid_start' and 'tdx_nr_guest_keyids' to represent KeyIDs
   can be used by guest. (Dave)
 - Slight changelog update according to above changes.

v7 -> v8: (address Dave's comments)
 - Improved changelog:
    - "KVM user" -> "The TDX module will be initialized by KVM when ..."
    - Changed "tdx_int" part to "Just say what this patch is doing"
    - Fixed the last sentence of "kexec()" paragraph
  - detect_tdx() -> record_keyid_partitioning()
  - Improved how to calculate tdx_keyid_start.
  - tdx_keyid_num -> nr_tdx_keyids.
  - Improved dmesg printing.
  - Add comment to clear_tdx().

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed SEAMRR detection to make code simpler.
 - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
 - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).

---
 arch/x86/Kconfig                 |  12 ++++
 arch/x86/Makefile                |   2 +
 arch/x86/include/asm/msr-index.h |   3 +
 arch/x86/include/asm/tdx.h       |   7 +++
 arch/x86/virt/Makefile           |   2 +
 arch/x86/virt/vmx/Makefile       |   2 +
 arch/x86/virt/vmx/tdx/Makefile   |   2 +
 arch/x86/virt/vmx/tdx/tdx.c      | 105 +++++++++++++++++++++++++++++++
 8 files changed, 135 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..fc010973a6ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1952,6 +1952,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	depends on KVM_INTEL
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9cf07322875a..972b5a64ce38 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -252,6 +252,8 @@ archheaders:
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 37ff47552bcb..952374ddb167 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -512,6 +512,9 @@
 #define MSR_RELOAD_PMC0			0x000014c1
 #define MSR_RELOAD_FIXED_CTR0		0x00001309
 
+/* KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
 /*
  * AMD64 MSRs. Not complete. See the architecture manual for a more
  * complete list.
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 25fd6070dc0b..4dfe2e794411 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else	/* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..a600b5d0879d
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2023 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+
+static u32 tdx_global_keyid __ro_after_init;
+static u32 tdx_guest_keyid_start __ro_after_init;
+static u32 tdx_nr_guest_keyids __ro_after_init;
+
+/*
+ * Use tdx_global_keyid to indicate that TDX is uninitialized.
+ * This is used in TDX initialization error paths to take it from
+ * initialized -> uninitialized.
+ */
+static void __init clear_tdx(void)
+{
+	tdx_global_keyid = 0;
+}
+
+static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+					    u32 *nr_tdx_keyids)
+{
+	u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
+	int ret;
+
+	/*
+	 * IA32_MKTME_KEYID_PARTIONING:
+	 *   Bit [31:0]:	Number of MKTME KeyIDs.
+	 *   Bit [63:32]:	Number of TDX private KeyIDs.
+	 */
+	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
+			&_nr_tdx_keyids);
+	if (ret)
+		return -ENODEV;
+
+	if (!_nr_tdx_keyids)
+		return -ENODEV;
+
+	/* TDX KeyIDs start after the last MKTME KeyID. */
+	_tdx_keyid_start = _nr_mktme_keyids + 1;
+
+	*tdx_keyid_start = _tdx_keyid_start;
+	*nr_tdx_keyids = _nr_tdx_keyids;
+
+	return 0;
+}
+
+static int __init tdx_init(void)
+{
+	u32 tdx_keyid_start, nr_tdx_keyids;
+	int err;
+
+	err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
+	if (err)
+		return err;
+
+	pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
+			tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+	/*
+	 * The TDX module itself requires one 'TDX global KeyID' to
+	 * protect its metadata.  Just use the first one.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+	tdx_keyid_start++;
+	nr_tdx_keyids--;
+
+	/*
+	 * If there's no more TDX KeyID left, KVM won't be able to run
+	 * any TDX guest.  Disable TDX in this case as initializing the
+	 * TDX module alone is meaningless.
+	 */
+	if (!nr_tdx_keyids) {
+		pr_info("initialization failed: too few private KeyIDs available.\n");
+		goto no_tdx;
+	}
+
+	tdx_guest_keyid_start = tdx_keyid_start;
+	tdx_nr_guest_keyids = nr_tdx_keyids;
+
+	return 0;
+no_tdx:
+	clear_tdx();
+	return -ENODEV;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+	return !!tdx_global_keyid;
+}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
  2023-03-06 14:13 ` [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
  2023-03-06 14:13 ` [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-16 12:57   ` David Hildenbrand
  2023-03-06 14:13 ` [PATCH v10 04/16] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

TDX capable platforms are locked to X2APIC mode and cannot fall back to
the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.

Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Only make INTEL_TDX_HOST depend on X86_X2APIC but removed other code
 - Rewrote the changelog.

v6 -> v7:
 - Changed to use "Link" for the two lore links to get rid of checkpatch
   warning.

---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc010973a6ff..6dd5d5586099 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1957,6 +1957,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	depends on KVM_INTEL
+	depends on X86_X2APIC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 04/16] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (2 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.  The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.

Add infrastructure to make SEAMCALLs.  The SEAMCALL ABI is very similar
to the TDCALL ABI and leverages much TDCALL infrastructure.

SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
when CPU is not in VMX operation.  Currently, only KVM code mocks with
VMX enabling, and KVM is the only user of TDX.  This implementation
chooses to make KVM itself responsible for enabling VMX before using
TDX and let the rest of the kernel stay blissfully unaware of VMX.

The current TDX_MODULE_CALL macro handles neither #GP nor #UD.  The
kernel would hit Oops if SEAMCALL were mistakenly made w/o enabling VMX
first.  Architecturally, there is no CPU flag to check whether the CPU
is in VMX operation.  Also, if a BIOS were buggy, it could still report
valid TDX private KeyIDs when TDX actually couldn't be enabled.

Extend the TDX_MODULE_CALL macro to handle #UD and #GP to return error
codes.  Introduce two new TDX error codes for them respectively so the
caller can distinguish.

Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
to the kernel error code, and print out SEAMCALL error code to help the
user to understand what went wrong.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v9 -> v10:
 - Make the TDX_SEAMCALL_{GP|UD} error codes unconditional but doesn't
   define them when INTEL_TDX_HOST is enabled. (Dave)
 - Slightly improved changelog to explain why add assembly code to handle
   #UD and #GP.

v8 -> v9:
 - Changed patch title (Dave).
 - Enhanced seamcall() to include the cpu id to the error message when
   SEAMCALL fails.

v7 -> v8:
 - Improved changelog (Dave):
   - Trim down some sentences (Dave).
   - Removed __seamcall() and seamcall() function name and changed
     accordingly (Dave).
   - Improved the sentence explaining why to handle #GP (Dave).
 - Added code to print out error message in seamcall(), following
   the idea that tdx_enable() to return universal error and print out
   error message to make clear what's going wrong (Dave).  Also mention
   this in changelog.

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to handle #UD and #GP (Dave).
 - Moved the seamcall() wrapper function to this patch, and used a
   temporary __always_unused to avoid compile warning (Dave).

- v3 -> v5 (no feedback on v4):
 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.

---
 arch/x86/include/asm/tdx.h       |  5 +++
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 +++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c      | 56 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 10 ++++++
 arch/x86/virt/vmx/tdx/tdxcall.S  | 19 +++++++++--
 6 files changed, 141 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4dfe2e794411..b489b5b9de5d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,8 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
+#include <asm/trapnr.h>
+
 /*
  * SW-defined error codes.
  *
@@ -18,6 +20,9 @@
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
+#define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
+#define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
+
 #ifndef __ASSEMBLY__
 
 /* TDX supported page sizes from the TDX module ABI. */
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 93ca8b73e1f1..38d534f2c113 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f81be6b9c133
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ *		  (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function.  Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a600b5d0879d..b65b838f3b5d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -12,9 +12,11 @@
 #include <linux/init.h>
 #include <linux/errno.h>
 #include <linux/printk.h>
+#include <linux/smp.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
+#include "tdx.h"
 
 static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
@@ -103,3 +105,57 @@ bool platform_tdx_enabled(void)
 {
 	return !!tdx_global_keyid;
 }
+
+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				    u64 *seamcall_ret,
+				    struct tdx_module_output *out)
+{
+	int cpu, ret = 0;
+	u64 sret;
+
+	/* Need a stable CPU id for printing error message */
+	cpu = get_cpu();
+
+	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	/* Save SEAMCALL return code if the caller wants it */
+	if (seamcall_ret)
+		*seamcall_ret = sret;
+
+	/* SEAMCALL was successful */
+	if (!sret)
+		goto out;
+
+	switch (sret) {
+	case TDX_SEAMCALL_GP:
+		pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n");
+		ret = -ENODEV;
+		break;
+	case TDX_SEAMCALL_VMFAILINVALID:
+		pr_err_once("TDX module is not loaded.\n");
+		ret = -ENODEV;
+		break;
+	case TDX_SEAMCALL_UD:
+		pr_err_once("SEAMCALL failed: CPU %d is not in VMX operation.\n",
+				cpu);
+		ret = -EINVAL;
+		break;
+	default:
+		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
+				cpu, fn, sret);
+		if (out)
+			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
+					out->rcx, out->rdx, out->r8,
+					out->r9, out->r10, out->r11);
+		ret = -EIO;
+	}
+out:
+	put_cpu();
+	return ret;
+}
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..48ad1a1ba737
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
+#endif
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..757b0c34be10 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>
 
 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
 	/* Leave input param 2 in RDX */
 
 	.if \host
+1:
 	seamcall
 	/*
 	 * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,10 +59,23 @@
 	 * This value will never be used as actual SEAMCALL error code as
 	 * it is from the Reserved status code class.
 	 */
-	jnc .Lno_vmfailinvalid
+	jnc .Lseamcall_out
 	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+	jmp .Lseamcall_out
+2:
+	/*
+	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+	 * the trap number.  Convert the trap number to the TDX error
+	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
+	 *
+	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
+	 * only accepts 32-bit immediate at most.
+	 */
+	mov $TDX_SW_ERROR, %r12
+	orq %r12, %rax
 
+	_ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out:
 	.else
 	tdcall
 	.endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (3 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 04/16] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-08 22:27   ` Isaku Yamahata
  2023-03-16  0:31   ` Isaku Yamahata
  2023-03-06 14:13 ` [PATCH v10 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

To enable TDX the kernel needs to initialize TDX from two perspectives:
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
on one logical cpu before the kernel wants to make any other SEAMCALLs
on that cpu (including those involved during module initialization and
running TDX guests).

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

Similar to module initialization, also make the per-cpu initialization
"on demand" as it also depends on VMX to be enabled.

Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
module and enable TDX on local cpu respectively.  For now tdx_enable()
is a placeholder.  The TODO list will be pared down as functionality is
added.

In tdx_enable() use a state machine protected by mutex to make sure the
initialization will only be done once, as tdx_enable() can be called
multiple times (i.e. KVM module can be reloaded) and may be called
concurrently by other kernel components in the future.

The per-cpu initialization on each cpu can only be done once during the
module's life time.  Use a per-cpu variable to track its status to make
sure it is only done once in tdx_cpu_enable().

Also, a SEAMCALL to do TDX module global initialization must be done
once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
it inside tdx_cpu_enable() too (if hasn't been done).

tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
per-cpu initialization must be done before those SEAMCALLs are invoked
on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
per-cpu initialization regardless of whether the TDX module has been
initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
but assume the caller has disabled CPU hotplug and done VMXON and
tdx_cpu_enable() on all online cpus before calling tdx_enable().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v9 -> v10:
 - Merged the patch to handle per-cpu initialization to this patch to
   tell the story better.
 - Changed how to handle the per-cpu initialization to only provide a
   tdx_cpu_enable() function to let the user of TDX to do it when the
   user wants to run TDX code on a certain cpu.
 - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
   call lockdep_assert_cpus_held() to assume the caller has done that.
 - Improved comments around tdx_enable() and tdx_cpu_enable().
 - Improved changelog to tell the story better accordingly.

v8 -> v9:
 - Removed detailed TODO list in the changelog (Dave).
 - Added back steps to do module global initialization and per-cpu
   initialization in the TODO list comment.
 - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.

---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 182 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  25 +++++
 3 files changed, 211 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b489b5b9de5d..112a5b9bd5cd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_cpu_enable(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_cpu_enable(void) { return -EINVAL; }
+static inline int tdx_enable(void)  { return -EINVAL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b65b838f3b5d..29127cb70f51 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,10 @@
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
+#include <linux/spinlock.h>
+#include <linux/percpu-defs.h>
+#include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -22,6 +26,18 @@ static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static unsigned int tdx_global_init_status;
+static DEFINE_SPINLOCK(tdx_global_init_lock);
+#define TDX_GLOBAL_INIT_DONE	_BITUL(0)
+#define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
+
+static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
+#define TDX_LP_INIT_DONE	_BITUL(0)
+#define TDX_LP_INIT_FAILED	_BITUL(1)
+
+static enum tdx_module_status_t tdx_module_status;
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * Use tdx_global_keyid to indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -159,3 +175,169 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	put_cpu();
 	return ret;
 }
+
+static int try_init_module_global(void)
+{
+	int ret;
+
+	/*
+	 * The TDX module global initialization only needs to be done
+	 * once on any cpu.
+	 */
+	spin_lock(&tdx_global_init_lock);
+
+	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
+		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
+			-EINVAL : 0;
+		goto out;
+	}
+
+	/* All '0's are just unused parameters. */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+
+	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
+	if (ret)
+		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
+out:
+	spin_unlock(&tdx_global_init_lock);
+
+	return ret;
+}
+
+/**
+ * tdx_cpu_enable - Enable TDX on local cpu
+ *
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ *
+ * Note this function must be called when preemption is not possible
+ * (i.e. via SMP call or in per-cpu thread).  It is not IRQ safe either
+ * (i.e. cannot be called in per-cpu thread and via SMP call from remote
+ * cpu simultaneously).
+ *
+ * Return 0 on success, otherwise errors.
+ */
+int tdx_cpu_enable(void)
+{
+	unsigned int lp_status;
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -EINVAL;
+
+	lp_status = __this_cpu_read(tdx_lp_init_status);
+
+	/* Already done */
+	if (lp_status & TDX_LP_INIT_DONE)
+		return lp_status & TDX_LP_INIT_FAILED ? -EINVAL : 0;
+
+	/*
+	 * The TDX module global initialization is the very first step
+	 * to enable TDX.  Need to do it first (if hasn't been done)
+	 * before doing the per-cpu initialization.
+	 */
+	ret = try_init_module_global();
+
+	/*
+	 * If the module global initialization failed, there's no point
+	 * to do the per-cpu initialization.  Just mark it as done but
+	 * failed.
+	 */
+	if (ret)
+		goto update_status;
+
+	/* All '0's are just unused parameters */
+	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+
+update_status:
+	lp_status = TDX_LP_INIT_DONE;
+	if (ret)
+		lp_status |= TDX_LP_INIT_FAILED;
+
+	this_cpu_write(tdx_lp_init_status, lp_status);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
+	 *    all TDX-usable memory regions.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err("TDX module initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		/*
+		 * Just return one universal error code.
+		 * For now the caller cannot recover anyway.
+		 */
+		return -EINVAL;
+	}
+
+	pr_info("TDX module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
+ *
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
+ * and tdx_cpu_enable() on all online cpus.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -EINVAL;
+
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 48ad1a1ba737..4d6220e86ccf 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -4,6 +4,31 @@
 
 #include <linux/types.h>
 
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
+
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNKNOWN,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (4 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

Start to transit out the "multi-steps" to initialize the TDX module.

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.

CMRs tell the kernel which memory is TDX compatible.  The kernel takes
CMRs (plus a little more metadata) and constructs "TD Memory Regions"
(TDMRs).  TDMRs let the kernel grant TDX protections to some or all of
the CMR areas.

The TDX module also reports necessary information to let the kernel
build TDMRs and run TDX guests in structure 'tdsysinfo_struct'.  The
list of CMRs, along with the TDX module information, is available to
the kernel by querying the TDX module.

As a preparation to construct TDMRs, get the TDX module information and
the list of CMRs.  Print out CMRs to help user to decode which memory
regions are TDX convertible.

The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
of info about the TDX module.  Fully define the entire structure, but
only use the fields necessary to build the TDMRs and pr_info() some
basics about the module.  The rest of the fields will get used by KVM.

For now both 'tdsysinfo_struct' and CMRs are only used during the module
initialization.  But because they are both relatively big, declare them
inside the module initialization function but as static variables.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - Added back "start to transit out..." as now per-cpu init has been
   moved out from tdx_enable().

v8 -> v9:
 - Removed "start to trransit out ..." part in changelog since this patch
   is no longer the first step anymore.
 - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and
   changed changelog accordingly (Dave).
 - Improved changelog to explain why to declare  'tdsysinfo_struct' in
   full but only use a few members of them (Dave).

v7 -> v8: (Dave)
 - Improved changelog to tell this is the first patch to transit out the
   "multi-steps" init_tdx_module().
 - Removed all CMR check/trim code but to depend on later SEAMCALL.
 - Variable 'vertical alignment' in print TDX module information.
 - Added DECLARE_PADDED_STRUCT() for padded structure.
 - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
   (and rename them accordingly), and added -Wframe-larger-than=4096 flag
   to silence the build warning.

v6 -> v7:
 - Simplified the check of CMRs due to the fact that TDX actually
   verifies CMRs (that are passed by the BIOS) before enabling TDX.
 - Changed the function name from check_cmrs() -> trim_empty_cmrs().
 - Added CMR page aligned check so that later patch can just get the PFN
   using ">> PAGE_SHIFT".

v5 -> v6:
 - Added to also print TDX module's attribute (Isaku).
 - Removed all arguments in tdx_gete_sysinfo() to use static variables
   of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
   directly in other functions in later patches.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Renamed sanitize_cmrs() to check_cmrs().
 - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
   actual size returned by TDH.SYS.INFO.
 - Changed -EFAULT to -EINVAL in couple places.
 - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
   used by TDH.SYS.INFO ABI.
 - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
   arguments in tdx_get_sysinfo().
 - Changed to only print BIOS-CMR when check_cmrs() fails.

---
 arch/x86/virt/vmx/tdx/tdx.c | 67 +++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 72 +++++++++++++++++++++++++++++++++++++
 2 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 29127cb70f51..981e11492d0e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -19,6 +19,7 @@
 #include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
+#include <asm/page.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -261,12 +262,76 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+static inline bool is_cmr_empty(struct cmr_info *cmr)
+{
+	return !cmr->size;
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
+{
+	int i;
+
+	for (i = 0; i < nr_cmrs; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		/*
+		 * The array of CMRs reported via TDH.SYS.INFO can
+		 * contain tail empty CMRs.  Don't print them.
+		 */
+		if (is_cmr_empty(cmr))
+			break;
+
+		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
+				cmr->base + cmr->size);
+	}
+}
+
+/*
+ * Get the TDX module information (TDSYSINFO_STRUCT) and the array of
+ * CMRs, and save them to @sysinfo and @cmr_array.  @sysinfo must have
+ * been padded to have enough room to save the TDSYSINFO_STRUCT.
+ */
+static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
+			   struct cmr_info *cmr_array)
+{
+	struct tdx_module_output out;
+	u64 sysinfo_pa, cmr_array_pa;
+	int ret;
+
+	sysinfo_pa = __pa(sysinfo);
+	cmr_array_pa = __pa(cmr_array);
+	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
+			cmr_array_pa, MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		sysinfo->attributes,	sysinfo->vendor_id,
+		sysinfo->major_version, sysinfo->minor_version,
+		sysinfo->build_date,	sysinfo->build_num);
+
+	/* R9 contains the actual entries written to the CMR array. */
+	print_cmrs(cmr_array, out.r9);
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
+	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
+			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
+	static struct cmr_info cmr_array[MAX_CMRS]
+			__aligned(CMR_INFO_ARRAY_ALIGNMENT);
+	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
+	int ret;
+
+	ret = tdx_get_sysinfo(sysinfo, cmr_array);
+	if (ret)
+		return ret;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Get TDX module information and TDX-capable memory regions.
 	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 4d6220e86ccf..2f2d8737a364 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,8 @@
 #define _X86_VIRT_TDX_H
 
 #include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/compiler_attributes.h>
 
 /*
  * This file contains both macros and data structures defined by the TDX
@@ -16,6 +18,76 @@
  */
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_INFO		32
+
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define DECLARE_PADDED_STRUCT(type, name, size, alignment)	\
+	struct type##_padded {					\
+		union {						\
+			struct type name;			\
+			u8 padding[size];			\
+		};						\
+	} name##_padded __aligned(alignment)
+
+#define PADDED_STRUCT(name)	(name##_padded.name)
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+/*
+ * The size of this structure itself is flexible.  The actual structure
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
+ * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
+ */
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.
+	 */
+	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
+} __packed;
 
 /*
  * Do not put any hardware-defined TDX structure representations below
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (5 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-09  1:38   ` Isaku Yamahata
  2023-03-06 14:13 ` [PATCH v10 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

To keep things simple, assume that all TDX-protected memory will come
from the page allocator.  Make sure all pages in the page allocator
*are* TDX-usable memory.

As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug).  This snapshot is used to
enable TDX support for *this* memory configuration only.  Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.

This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module.  This approach works when all boot-time "system RAM" is
TDX convertible memory, and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.

For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module.  The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---

v9 -> v10:
 - Moved empty @tdx_memlist check out of is_tdx_memory() to make the
   logic better.
 - Added Ying's Reviewed-by.

v8 -> v9:
 - Replace "The initial support ..." with timeless sentence in both
   changelog and comments(Dave).
 - Fix run-on sentence in changelog, and senstence to explain why to
   stash off memblock (Dave).
 - Tried to improve why to choose this approach and how it work in
   changelog based on Dave's suggestion.
 - Many other comments enhancement (Dave).

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).

---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 165 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   6 ++
 4 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6dd5d5586099..f23bc540778a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 88188549647c..a8a119a9b48c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1165,6 +1165,8 @@ void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	x86_platform.realmode_reserve();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 981e11492d0e..9149144cd7e7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -17,6 +17,13 @@
 #include <linux/spinlock.h>
 #include <linux/percpu-defs.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -39,6 +46,9 @@ static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
 static enum tdx_module_status_t tdx_module_status;
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * Use tdx_global_keyid to indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -77,6 +87,54 @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * This check assumes that the start_pfn<->end_pfn range does not
+	 * cross multiple @tdx_memlist entries.  A single memory online
+	 * event across multiple memblocks (from which @tdx_memlist
+	 * entries are derived at the time of module initialization) is
+	 * not possible.  This is because memory offline/online is done
+	 * on granularity of 'struct memory_block', and the hotpluggable
+	 * memory region (one memblock) must be multiple of memory_block.
+	 */
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Empty list means TDX isn't enabled.  Allow any memory
+	 * to go online.
+	 */
+	if (list_empty(&tdx_memlist))
+		return NOTIFY_OK;
+
+	/*
+	 * The TDX memory configuration is static and can not be
+	 * changed.  Reject onlining any memory which is outside of
+	 * the static configuration whether it supports TDX or not.
+	 */
+	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
+		NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	u32 tdx_keyid_start, nr_tdx_keyids;
@@ -107,6 +165,13 @@ static int __init tdx_init(void)
 		goto no_tdx;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		goto no_tdx;
+	}
+
 	tdx_guest_keyid_start = tdx_keyid_start;
 	tdx_nr_guest_keyids = nr_tdx_keyids;
 
@@ -316,6 +381,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	/* @tmb_list is protected by mem_hotplug_lock */
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	/* @tmb_list is protected by mem_hotplug_lock */
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
@@ -329,10 +467,25 @@ static int init_tdx_module(void)
 	if (ret)
 		return ret;
 
+	/*
+	 * To keep things simple, assume that all TDX-protected memory
+	 * will come from the page allocator.  Make sure all pages in the
+	 * page allocator are TDX-usable memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
@@ -341,7 +494,15 @@ static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+out:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
+
+	return ret;
 }
 
 static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2f2d8737a364..6518024fcb68 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -101,6 +101,12 @@ enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (6 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 09/16] x86/virt/tdx: Fill out " Kai Huang
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).

Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.

=== Long Version ===

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be represented.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.

Constructing the list of TDMRs consists below steps:

1) Fill out TDMRs to cover all memory regions that the TDX module will
   use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps.  To keep
things simple, just allocate enough space to hold maximum number of
TDMRs up front.  Always free the space of the TDMRs after the module
initialization (no matter successful or not) as TDMRs are only used
during the module initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - Changed the TDMR list from static variable back to local variable as
   now TDX module isn't disabled when tdx_cpu_enable() fails.

v8 -> v9:
 - Changes around 'struct tdmr_info_list' (Dave):
   - Moved the declaration from tdx.c to tdx.h.
   - Renamed 'first_tdmr' to 'tdmrs'.
   - 'nr_tdmrs' -> 'nr_consumed_tdmrs'.
   - Changed 'tdmrs' to 'void *'.
   - Improved comments for all structure members.
 - Added a missing empty line in alloc_tdmr_list() (Dave).

v7 -> v8:
 - Improved changelog to tell this is one step of "TODO list" in
   init_tdx_module().
 - Other changelog improvement suggested by Dave (with "Create TDMRs" to
   "Fill out TDMRs" to align with the code).
 - Added a "TODO list" comment to lay out the steps to construct TDMRs,
   following the same idea of "TODO list" in tdx_module_init().
 - Introduced 'struct tdmr_info_list' (Dave)
 - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
   simplify getting TDMR by given index, and reduce passing arguments
   around functions.
 - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
   uses tdmr_size_single() (Dave).
 - tdmr_num -> nr_tdmrs (Dave).

v6 -> v7:
 - Improved commit message to explain 'int' overflow cannot happen
   in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

v5 -> v6:
 - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
   used instead of memblock.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Moved calculating TDMR size to this patch.
 - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
   once, instead of allocating each TDMR individually.
 - Removed "crypto protection" in the changelog.
 - -EFAULT -> -EINVAL in couple of places.

---
 arch/x86/virt/vmx/tdx/tdx.c | 98 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++
 2 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9149144cd7e7..2b87cedc7fce 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -24,6 +24,7 @@
 #include <linux/minmax.h>
 #include <linux/sizes.h>
 #include <linux/pfn.h>
+#include <linux/align.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -454,6 +455,80 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 	return ret;
 }
 
+/* Calculate the actual TDMR size */
+static int tdmr_size_single(u16 max_reserved_per_tdmr)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR depends on the maximum
+	 * number of reserved areas.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
+
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	size_t tdmr_sz, tdmr_array_sz;
+	void *tdmr_array;
+
+	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
+	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
+
+	/*
+	 * To keep things simple, allocate all TDMRs together.
+	 * The buffer needs to be physically contiguous to make
+	 * sure each TDMR is physically contiguous.
+	 */
+	tdmr_array = alloc_pages_exact(tdmr_array_sz,
+			GFP_KERNEL | __GFP_ZERO);
+	if (!tdmr_array)
+		return -ENOMEM;
+
+	tdmr_list->tdmrs = tdmr_array;
+
+	/*
+	 * Keep the size of TDMR to find the target TDMR
+	 * at a given index in the TDMR list.
+	 */
+	tdmr_list->tdmr_sz = tdmr_sz;
+	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
+	tdmr_list->nr_consumed_tdmrs = 0;
+
+	return 0;
+}
+
+static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
+{
+	free_pages_exact(tdmr_list->tdmrs,
+			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
+}
+
+/*
+ * Construct a list of TDMRs on the preallocated space in @tdmr_list
+ * to cover all TDX memory regions in @tmb_list based on the TDX module
+ * information in @sysinfo.
+ */
+static int construct_tdmrs(struct list_head *tmb_list,
+			   struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Fill out TDMRs to cover all TDX memory regions.
+	 *  - Allocate and set up PAMTs for each TDMR.
+	 *  - Designate reserved areas for each TDMR.
+	 *
+	 * Return -EINVAL until constructing TDMRs is done
+	 */
+	return -EINVAL;
+}
+
 static int init_tdx_module(void)
 {
 	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
@@ -461,6 +536,7 @@ static int init_tdx_module(void)
 	static struct cmr_info cmr_array[MAX_CMRS]
 			__aligned(CMR_INFO_ARRAY_ALIGNMENT);
 	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
+	struct tdmr_info_list tdmr_list;
 	int ret;
 
 	ret = tdx_get_sysinfo(sysinfo, cmr_array);
@@ -483,11 +559,19 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Allocate enough space for constructing TDMRs */
+	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdx_mem;
+
+	/* Cover all TDX-usable memory regions in TDMRs */
+	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
-	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
@@ -495,6 +579,16 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * Free the space for the TDMRs no matter the initialization is
+	 * successful or not.  They are not needed anymore after the
+	 * module initialization.
+	 */
+	free_tdmr_list(&tdmr_list);
+out_free_tdx_mem:
+	if (ret)
+		free_tdx_memlist(&tdx_memlist);
 out:
 	/*
 	 * @tdx_memlist is written here and read at memory hotplug time.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6518024fcb68..3ad1e06be0f1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -89,6 +89,29 @@ struct tdsysinfo_struct {
 	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
 } __packed;
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
@@ -107,6 +130,15 @@ struct tdx_memblock {
 	unsigned long end_pfn;
 };
 
+struct tdmr_info_list {
+	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
+	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
+
+	/* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
+	int tdmr_sz;	/* Size of one 'tdmr_info' */
+	int max_tdmrs;	/* How many 'tdmr_info's are allocated */
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 09/16] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (7 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.

The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Do the first step to fill out a number of TDMRs to cover all TDX memory
regions.  To keep it simple, always try to use one TDMR for each memory
region.  As the first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just use a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

There are fancier things that could be done like trying to merge
adjacent TDMRs.  This would allow more pathological memory layouts to be
supported.  But, current systems are not even close to exhausting the
existing TDMR resources in practice.  For now, keep it simple.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:

 - Added the last paragraph in the changelog (Dave).
 - Removed unnecessary type cast in tdmr_entry() (Dave).

---
 arch/x86/virt/vmx/tdx/tdx.c | 94 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2b87cedc7fce..e2487d872bbd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -508,6 +508,93 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
 			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
 }
 
+/* Get the TDMR from the list at the given index. */
+static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
+				    int idx)
+{
+	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
+
+	return (void *)tdmr_list->tdmrs + tdmr_info_offset;
+}
+
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
+/*
+ * Take the memory referenced in @tmb_list and populate the
+ * preallocated @tdmr_list, following all the special alignment
+ * and size rules for TDMR.
+ */
+static int fill_out_tdmrs(struct list_head *tmb_list,
+			  struct tdmr_info_list *tdmr_list)
+{
+	struct tdx_memblock *tmb;
+	int tdmr_idx = 0;
+
+	/*
+	 * Loop over TDX memory regions and fill out TDMRs to cover them.
+	 * To keep it simple, always try to use one TDMR to cover one
+	 * memory region.
+	 *
+	 * In practice TDX1.0 supports 64 TDMRs, which is big enough to
+	 * cover all memory regions in reality if the admin doesn't use
+	 * 'memmap' to create a bunch of discrete memory regions.  When
+	 * there's a real problem, enhancement can be done to merge TDMRs
+	 * to reduce the final number of TDMRs.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		u64 start, end;
+
+		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
+		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
+
+		/*
+		 * A valid size indicates the current TDMR has already
+		 * been filled out to cover the previous memory region(s).
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next if the current memory region
+			 * has already been fully covered.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/* Otherwise, skip the already covered part. */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdmr_list->max_tdmrs) {
+				pr_warn("initialization failed: TDMRs exhausted.\n");
+				return -ENOSPC;
+			}
+
+			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -517,10 +604,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
 			   struct tdsysinfo_struct *sysinfo)
 {
+	int ret;
+
+	ret = fill_out_tdmrs(tmb_list, tdmr_list);
+	if (ret)
+		return ret;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Fill out TDMRs to cover all TDX memory regions.
 	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (8 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 09/16] x86/virt/tdx: Fill out " Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-21  7:44   ` Dong, Eddie
  2023-03-06 14:13 ` [PATCH v10 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v9 -> v10:
 - Removed code change in disable_tdx_module() as it doesn't exist
   anymore.

v8 -> v9:
 - Added TDX_PS_NR macro instead of open-coding (Dave).
 - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
 - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Changelog:
  - Added a sentence to state PAMT allocation will be improved.
  - Others suggested by Dave.
 - Moved 'nid' of 'struct tdx_memblock' to this patch.
 - Improved comments around tdmr_get_nid().
 - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).   

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 216 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 3 files changed, 213 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f23bc540778a..2a4d4097c5e6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
 	depends on KVM_INTEL
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e2487d872bbd..8f66cab1902e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -388,7 +388,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
  * overlap.
  */
 static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-			    unsigned long end_pfn)
+			    unsigned long end_pfn, int nid)
 {
 	struct tdx_memblock *tmb;
 
@@ -399,6 +399,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
 	INIT_LIST_HEAD(&tmb->list);
 	tmb->start_pfn = start_pfn;
 	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
 
 	/* @tmb_list is protected by mem_hotplug_lock */
 	list_add_tail(&tmb->list, tmb_list);
@@ -426,9 +427,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
 static int build_tdx_memlist(struct list_head *tmb_list)
 {
 	unsigned long start_pfn, end_pfn;
-	int i, ret;
+	int i, nid, ret;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
 		/*
 		 * The first 1MB is not reported as TDX convertible memory.
 		 * Although the first 1MB is always reserved and won't end up
@@ -444,7 +445,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		 * memblock has already guaranteed they are in address
 		 * ascending order and don't overlap.
 		 */
-		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
 		if (ret)
 			goto err;
 	}
@@ -595,6 +596,202 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
 	return 0;
 }
 
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
+				      u16 pamt_entry_size)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+
+	switch (pgsz) {
+	case TDX_PS_4K:
+		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PS_2M:
+		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PS_1G:
+		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = nr_pamt_entries * pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate a NUMA node which should hold the allocation of the @tdmr
+ * PAMT.  This node will have some memory covered by the TDMR.  The
+ * relative amount of memory covered is not considered.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * A TDMR must cover at least part of one TMB.  That TMB will end
+	 * after the TDMR begins.  But, that TMB may have started before
+	 * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
+	 * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		if (tmb->end_pfn > PHYS_PFN(tdmr->base))
+			return tmb->nid;
+	}
+
+	/*
+	 * Fall back to allocating the TDMR's metadata from node 0 when
+	 * no TDX memory block can be found.  This should never happen
+	 * since TDMRs originate from TDX memory blocks.
+	 */
+	pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
+			tdmr->base, tdmr_end(tdmr));
+	return 0;
+}
+
+#define TDX_PS_NR	(TDX_PS_1G + 1)
+
+/*
+ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
+ * within @tdmr, and set up PAMTs for @tdmr.
+ */
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
+			    struct list_head *tmb_list,
+			    u16 pamt_entry_size)
+{
+	unsigned long pamt_base[TDX_PS_NR];
+	unsigned long pamt_size[TDX_PS_NR];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	struct page *pamt;
+	int pgsz, nid;
+
+	nid = tdmr_get_nid(tdmr, tmb_list);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
+					pamt_entry_size);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/*
+	 * Break the contiguous allocation back up into the
+	 * individual PAMTs for each page size.
+	 */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+			  unsigned long *pamt_npages)
+{
+	unsigned long pamt_base, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_base = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	*pamt_pfn = PHYS_PFN(pamt_base);
+	*pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_npages;
+
+	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_npages)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_npages);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
+				 struct list_head *tmb_list,
+				 u16 pamt_entry_size)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
+				pamt_entry_size);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_list);
+	return ret;
+}
+
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
+{
+	unsigned long pamt_npages = 0;
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long pfn, npages;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages);
+		pamt_npages += npages;
+	}
+
+	return pamt_npages;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -610,10 +807,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
+			sysinfo->pamt_entry_size);
+	if (ret)
+		return ret;
 	/*
 	 * TODO:
 	 *
-	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
 	 * Return -EINVAL until constructing TDMRs is done
@@ -670,7 +870,13 @@ static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
+
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(&tdmr_list);
+	else
+		pr_info("%lu KBs allocated for PAMT.\n",
+				tdmrs_count_pamt_pages(&tdmr_list) * 4);
 out_free_tdmrs:
 	/*
 	 * Free the space for the TDMRs no matter the initialization is
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 3ad1e06be0f1..65fe34f21025 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -128,6 +128,7 @@ struct tdx_memblock {
 	struct list_head list;
 	unsigned long start_pfn;
 	unsigned long end_pfn;
+	int nid;
 };
 
 struct tdmr_info_list {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (9 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 12/16] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

As the last step of constructing TDMRs, populate reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:
 - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
   optimization to save reserved areas. (Dave).

v7 -> v8: (Dave)
 - "set_up" -> "populate" in function name change (Dave).
 - Improved comment suggested by Dave.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
   hole and PAMT respectively.
 - Added Isaku's Reviewed-by.

---
 arch/x86/virt/vmx/tdx/tdx.c | 220 ++++++++++++++++++++++++++++++++++--
 1 file changed, 212 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8f66cab1902e..99d2e8d939d3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -25,6 +25,7 @@
 #include <linux/sizes.h>
 #include <linux/pfn.h>
 #include <linux/align.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -792,6 +793,210 @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list)
 	return pamt_npages;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
+			      u64 size, u16 max_reserved_per_tdmr)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	if (idx >= max_reserved_per_tdmr) {
+		pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
+				tdmr->base, tdmr_end(tdmr));
+		return -ENOSPC;
+	}
+
+	/*
+	 * Consume one reserved area per call.  Make no effort to
+	 * optimize or reduce the number of reserved areas which are
+	 * consumed by contiguous reserved areas, for instance.
+	 */
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/*
+ * Go through @tmb_list to find holes between memory areas.  If any of
+ * those holes fall within @tdmr, set up a TDMR reserved area to cover
+ * the hole.
+ */
+static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	struct tdx_memblock *tmb;
+	u64 prev_end;
+	int ret;
+
+	/*
+	 * Start looking for reserved blocks at the
+	 * beginning of the TDMR.
+	 */
+	prev_end = tdmr->base;
+	list_for_each_entry(tmb, tmb_list, list) {
+		u64 start, end;
+
+		start = PFN_PHYS(tmb->start_pfn);
+		end   = PFN_PHYS(tmb->end_pfn);
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr->base)
+			continue;
+
+		/*
+		 * Skip over memory areas that
+		 * have already been dealt with.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				start - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
+ * overlaps with @tdmr, set up a TDMR reserved area to cover the
+ * overlapping part.
+ */
+static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
+		unsigned long pamt_start_pfn, pamt_npages;
+		u64 pamt_start, pamt_end;
+
+		tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages);
+		/* Each TDMR must already have PAMT allocated */
+		WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn);
+
+		pamt_start = PFN_PHYS(pamt_start_pfn);
+		pamt_end   = PFN_PHYS(pamt_start_pfn + pamt_npages);
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr->base) ||
+				(pamt_start >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < tdmr->base)
+			pamt_start = tdmr->base;
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start,
+				pamt_end - pamt_start,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  The caller must guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/*
+ * Populate reserved areas for the given @tdmr, including memory holes
+ * (via @tmb_list) and PAMTs (via @tdmr_list).
+ */
+static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
+				    struct list_head *tmb_list,
+				    struct tdmr_info_list *tdmr_list,
+				    u16 max_reserved_per_tdmr)
+{
+	int ret, rsvd_idx = 0;
+
+	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+/*
+ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
+ * holes (via @tmb_list) and PAMTs.
+ */
+static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
+					 struct list_head *tmb_list,
+					 u16 max_reserved_per_tdmr)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
+				tmb_list, tdmr_list, max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -811,14 +1016,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			sysinfo->pamt_entry_size);
 	if (ret)
 		return ret;
-	/*
-	 * TODO:
-	 *
-	 *  - Designate reserved areas for each TDMR.
-	 *
-	 * Return -EINVAL until constructing TDMRs is done
-	 */
-	return -EINVAL;
+
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+			sysinfo->max_reserved_per_tdmr);
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_list);
+
+	return ret;
 }
 
 static int init_tdx_module(void)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 12/16] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (10 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

The TDX module uses a private KeyID as the "global KeyID" for mapping
things like the PAMT and other TDX metadata.  This KeyID has already
been reserved when detecting TDX during the kernel early boot.

After the list of "TD Memory Regions" (TDMRs) has been constructed to
cover all TDX-usable memory regions, the next step is to pass them to
the TDX module together with the global KeyID.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Removed 'i.e.' in changelog and removed the passive voice (Dave).
 - Improved changelog (Dave).
 - Changed the way to allocate aligned PA array (Dave).
 - Moved reserving the TDX global KeyID to the second patch, and also
   changed 'tdx_keyid_start' and 'nr_tdx_keyid' to guest's KeyIDs in
   that patch (Dave).

v7 -> v8:
 - Merged "Reserve TDX module global KeyID" patch to this patch, and
   removed 'tdx_global_keyid' but use 'tdx_keyid_start' directly.
 - Changed changelog accordingly.
 - Changed how to allocate aligned array (Dave).

---
 arch/x86/virt/vmx/tdx/tdx.c | 41 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 99d2e8d939d3..28562cf88414 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -26,6 +26,7 @@
 #include <linux/pfn.h>
 #include <linux/align.h>
 #include <linux/sort.h>
+#include <linux/log2.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/page.h>
@@ -1025,6 +1026,39 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	size_t array_sz;
+	int i, ret;
+
+	/*
+	 * TDMRs are passed to the TDX module via an array of physical
+	 * addresses of each TDMR.  The array itself also has certain
+	 * alignment requirement.
+	 */
+	array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64);
+	array_sz = roundup_pow_of_two(array_sz);
+	if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
+		array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
+
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
+				tdmr_list->nr_consumed_tdmrs,
+				global_keyid, 0, NULL, NULL);
+
+	/* Free the array as it is not required anymore. */
+	kfree(tdmr_pa_array);
+
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
@@ -1065,10 +1099,14 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/* Pass the TDMRs and the global KeyID to the TDX module */
+	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
@@ -1076,6 +1114,7 @@ static int init_tdx_module(void)
 	 */
 
 	ret = -EINVAL;
+out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(&tdmr_list);
 	else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 65fe34f21025..6cab15184af5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -19,6 +19,7 @@
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_INFO		32
+#define TDH_SYS_CONFIG		45
 
 struct cmr_info {
 	u64	base;
@@ -95,6 +96,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 13/16] x86/virt/tdx: Configure global KeyID on all packages
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (11 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 12/16] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:13 ` [PATCH v10 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.

Just use the helper, which conditionally calls function on all online
cpus, to configure the global KeyID on all packages.  Loop all online
cpus, keep track which packages have been called and skip all cpus for
those already called packages.

To keep things simple, this implementation takes no affirmative steps to
online cpus to make sure there's at least one cpu for each package.  The
callers (aka. KVM) can ensure success by ensuring that.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The PAMTs are transitioning from being used by the kernel
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.

This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
before the TDX module uses the global KeyID to access the PAMT.
Otherwise, if those dirty cachelines were written back, they would
corrupt the TDX module's metadata.  Aside: This corruption would be
detected by the memory integrity hardware on the next read of the memory
with the global KeyID.  The result would likely be fatal to the system
but would not impact TDX security.

Following the TDX module specification, flush cache before configuring
the global KeyID on all packages.  Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.

Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have
used the global KeyID to write any PAMT.  Therefore, use WBINVD to flush
cache before freeing the PAMTs back to the kernel.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - Changed to use 'smp_call_on_cpu()' directly to do key configuration.

v8 -> v9:
 - Improved changelog (Dave).
 - Improved comments to explain the function to configure global KeyID
   "takes no affirmative action to online any cpu". (Dave).
 - Improved other comments suggested by Dave.

v7 -> v8: (Dave)
 - Changelog changes:
  - Point out this is the step of "multi-steps" of init_tdx_module().
  - Removed MOVDIR64B part.
  - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
 - Changed to loop over online cpus and use smp_call_function_single()
   directly as the patch to shut down TDX module has been removed.
 - Removed MOVDIR64B part in comment.

v6 -> v7:
 - Improved changelong and comment to explain why MOVDIR64B isn't used
   when returning PAMTs back to the kernel.


---
 arch/x86/virt/vmx/tdx/tdx.c | 80 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 28562cf88414..0a3b3374c5cb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1059,6 +1059,55 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
 	return ret;
 }
 
+static int do_global_key_config(void *data)
+{
+	/*
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
+	 * recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 *
+	 * All '0's are just unused parameters.
+	 */
+	return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
+}
+
+/*
+ * Attempt to configure the global KeyID on all physical packages.
+ *
+ * This requires running code on at least one CPU in each package.  If a
+ * package has no online CPUs, that code will not run and TDX module
+ * initialization (TDMR initialization) will fail.
+ *
+ * This code takes no affirmative steps to online CPUs.  Callers (aka.
+ * KVM) can ensure success by ensuring sufficient CPUs are online for
+ * this to succeed.
+ */
+static int config_global_keyid(void)
+{
+	cpumask_var_t packages;
+	int cpu, ret = -EINVAL;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		/*
+		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
+		 * different cpus, so just do it one by one.
+		 */
+		ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
@@ -1104,10 +1153,24 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
@@ -1115,8 +1178,18 @@ static int init_tdx_module(void)
 
 	ret = -EINVAL;
 out_free_pamts:
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMT may already have been initialized by the
+		 * TDX module.  Flush cache before returning PAMT back
+		 * to the kernel.
+		 *
+		 * No need to worry about integrity checks here.  KeyID
+		 * 0 has integrity checking disabled.
+		 */
+		wbinvd_on_all_cpus();
 		tdmrs_free_pamt_all(&tdmr_list);
+	}
 	else
 		pr_info("%lu KBs allocated for PAMT.\n",
 				tdmrs_count_pamt_pages(&tdmr_list) * 4);
@@ -1168,6 +1241,9 @@ static int __tdx_enable(void)
  * lock to prevent any new cpu from becoming online; 2) done both VMXON
  * and tdx_cpu_enable() on all online cpus.
  *
+ * This function requires there's at least one online cpu for each CPU
+ * package to succeed.
+ *
  * This function can be called in parallel by multiple callers.
  *
  * Return 0 if TDX is enabled successfully, otherwise error.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6cab15184af5..880e90dedb3f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -20,6 +20,7 @@
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_INFO		32
 #define TDH_SYS_CONFIG		45
+#define TDH_SYS_KEY_CONFIG	31
 
 struct cmr_info {
 	u64	base;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 14/16] x86/virt/tdx: Initialize all TDMRs
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (12 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2023-03-06 14:13 ` Kai Huang
  2023-03-06 14:14 ` [PATCH v10 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:13 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.

This is the last step of initializing the TDX module.

Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests.  Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one.  It can
be enhanced in the future.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Improved changlog to explain why initializing TDMRs can take long
   time (Dave).
 - Improved comments around 'next-to-initialize' address (Dave).

v7 -> v8: (Dave)
 - Changelog:
   - explicitly call out this is the last step of TDX module initialization.
   - Trimed down changelog by removing SEAMCALL name and details.
 - Removed/trimmed down unnecessary comments.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Removed need_resched() check. -- Andi.

---
 arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 0a3b3374c5cb..ee94a7327d93 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1108,6 +1108,56 @@ static int config_global_keyid(void)
 	return ret;
 }
 
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing a TDMR can be time consuming.  To avoid long
+	 * SEAMCALLs, the TDX module may only initialize a part of the
+	 * TDMR in each call.
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		/* All 0's are unused parameters, they mean nothing. */
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+				&out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INIT did not fully complete and
+		 * should be retried.
+		 */
+		next = out.rdx;
+		cond_resched();
+		/* Keep making SEAMCALLs until the TDMR is done */
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	/*
+	 * This operation is costly.  It can be parallelized,
+	 * but keep it simple for now.
+	 */
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_entry(tdmr_list, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
@@ -1168,15 +1218,9 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * TODO:
-	 *
-	 *  - Initialize all TDMRs.
-	 *
-	 *  Return error before all steps are done.
-	 */
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(&tdmr_list);
 
-	ret = -EINVAL;
 out_free_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 880e90dedb3f..48f830087e7e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -21,6 +21,7 @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_CONFIG		45
 #define TDH_SYS_KEY_CONFIG	31
+#define TDH_SYS_TDMR_INIT	36
 
 struct cmr_info {
 	u64	base;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (13 preceding siblings ...)
  2023-03-06 14:13 ` [PATCH v10 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2023-03-06 14:14 ` Kai Huang
  2023-03-06 14:14 ` [PATCH v10 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:14 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
Even the new kernel wants to use any non-zero KeyID, it needs to convert
the memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v9 -> v10:
 - No change.

v8 -> v9:
 - Various changelog enhancement and fix (Dave).
 - Improved comment (Dave).

v7 -> v8:
 - Changelog:
   - Removed "leave TDX module open" part due to shut down patch has been
     removed.

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.

---
 arch/x86/kernel/process.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 40d156a31676..5876dda412c7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -765,8 +765,13 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * The TDX module or guests might have left dirty cachelines
+	 * behind.  Flush them to avoid corruption from later writeback.
+	 * Note that this flushes on all systems where TDX is possible,
+	 * but does not actually check that TDX was in use.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v10 16/16] Documentation/x86: Add documentation for TDX host support
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (14 preceding siblings ...)
  2023-03-06 14:14 ` [PATCH v10 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
@ 2023-03-06 14:14 ` Kai Huang
  2023-03-08  1:11 ` [PATCH v10 00/16] TDX host kernel support Isaku Yamahata
  2023-03-16 12:35 ` David Hildenbrand
  17 siblings, 0 replies; 48+ messages in thread
From: Kai Huang @ 2023-03-06 14:14 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, david, bagasdotme, sagis,
	imammedo, kai.huang

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 186 +++++++++++++++++++++++++++++++++++---
 1 file changed, 175 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index dc8d9fd2c3f7..a6f66a28bef4 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,170 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized.  The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot.  Below dmesg shows when TDX is enabled by BIOS::
+
+  [..] tdx: BIOS enabled: private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module.  The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  The kernel initializes the TDX module at runtime on
+demand.
+
+Besides initializing the TDX module, a per-cpu initialization SEAMCALL
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu.
+
+The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
+allow the user of TDX to enable the TDX module and enable TDX on local
+cpu.
+
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
+has been done).  For now both tdx_enable() and tdx_cpu_enable() don't
+handle VMXON internally, but depends on the caller to guarantee that.
+
+To enable TDX, the user of TDX should: 1) hold read lock of CPU hotplug
+lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
+3) call tdx_enable().  For example::
+
+        cpus_read_lock();
+        on_each_cpu(vmxon_and_tdx_cpu_enable());
+        ret = tdx_enable();
+        cpus_read_unlock();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+And the user of TDX must be guarantee tdx_cpu_enable() has beene
+successfully done on any cpu before it wants to run any other SEAMCALL.
+A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
+online callback, and refuse to online if tdx_cpu_enable() fails.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+  [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] tdx: 262668 KBs allocated for PAMT.
+  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+  [..] tdx: TDX module initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible.  The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module.  Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory.  Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime.  A non-buggy BIOS should never support hot-removal of
+any convertible memory.  This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu, including those involved during the module initialization.
+
+The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
+the user wants to use a new cpu for TDX task.
+
+TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
+physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages; 2) There might be dirty cachelines associated
+with TDX private pages.
+
+The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
+Even the new kernel wants use any non-zero KeyID, it needs to convert
+the memory to that KeyID and such conversion would work from any KeyID.
+
+However the old kernel needs to guarantee there's no dirty cacheline
+left behind before booting to the new kernel to avoid silent corruption
+from later cacheline writeback (Intel hardware doesn't guarantee cache
+coherency across different KeyIDs).
+
+Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
+booting to the new kernel.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +184,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +194,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +205,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +216,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +237,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +257,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +271,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +291,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +309,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +331,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +353,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 00/16] TDX host kernel support
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (15 preceding siblings ...)
  2023-03-06 14:14 ` [PATCH v10 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang
@ 2023-03-08  1:11 ` Isaku Yamahata
  2023-03-16 12:35 ` David Hildenbrand
  17 siblings, 0 replies; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-08  1:11 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, dave.hansen, peterz, tglx, seanjc,
	pbonzini, dan.j.williams, rafael.j.wysocki, kirill.shutemov,
	ying.huang, reinette.chatre, len.brown, tony.luck, ak,
	isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, david,
	bagasdotme, sagis, imammedo, isaku.yamahata

On Tue, Mar 07, 2023 at 03:13:45AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  TDX specs are available in [1].
> 
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests.  KVM support for TDX is being
> developed separately[2].  A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3].  The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.
> 
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly.  For example, memory hotplug is handled
> in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> hotplug" sections below).
> 
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> help to provide comments.)
> 
> And TDX module metadata allocation just uses alloc_contig_pages() to
> allocate large chunk at runtime, thus it can fail.  It is imperfect now
> but _will_ be improved in the future.
> 
> Also, the patch to add the new kernel comline tdx="force" isn't included
> in this initial version, as Dave suggested it isn't mandatory.  But I
> _will_ add one once this initial version gets merged.
> 
> All other optimizations will be posted as follow-up once this initial
> TDX support is upstreamed.
> 
> Hi Dave, Peter, Thomas, Dan (and Intel reviewers),
> 
> The environment to test the new LP.INIT SEAMCALL behaviour hasn't been
> done yet, thus I haven't tested the new behaviour.  Instead, I tested
> with all cpus are online when initializing the TDX module.  CPU hotplug
> path isn't really tested although I did some basic test that I can
> offline some cpus after module initialization, online them again and the
> LP.INIT was skipped successfully for them.
> 
> However I believe there should be no issue when the new module is ready.
> I will test and report back when the new module is ready.
> 
> I would appreciate if folks could review this presumptive series anyway.
>    
> And I would appreciate reviewed-by or acked-by tags if the patches look
> good to you.
> 
> ----- Changelog history: ------
> 
> - v9 -> v10:
> 
>  - Changed the per-cpu initalization handling
>    - Gave up "ensuring all online cpus are TDX-runnable when TDX module
>      is initialized", but just provide two basic functions, tdx_enable()
>      and tdx_cpu_enable(), to let the user of TDX to make sure the
>      tdx_cpu_enable() has been done successfully when the user wants to
>      use particular cpu for TDX.
>    - Thus, moved per-cpu initialization out of tdx_enable().  Now
>      tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
>      on all online cpus before calling it.
>    - Merged the tdx_enable() skeleton patch and per-cpu initialization
>      patch together to tell better story.
>    - Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.
> 
>  v9: https://lore.kernel.org/lkml/cover.1676286526.git.kai.huang@intel.com/
> 
> - v8 -> v9:
> 
>  - Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
>  - Other changes please refer to changelog histroy in individual patches.

I've rebased my TDX KVM patches to this patch series and updated initialization.
With all LPs online with the existing TDX module and I did cpu online/offline
while TD running.

Test-by: Isaku Yamahata <isaku.yamahata@intel.com>
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-06 14:13 ` [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
@ 2023-03-08 22:27   ` Isaku Yamahata
  2023-03-12 23:08     ` Huang, Kai
  2023-03-16  0:31   ` Isaku Yamahata
  1 sibling, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-08 22:27 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, dave.hansen, peterz, tglx, seanjc,
	pbonzini, dan.j.williams, rafael.j.wysocki, kirill.shutemov,
	ying.huang, reinette.chatre, len.brown, tony.luck, ak,
	isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, david,
	bagasdotme, sagis, imammedo, isaku.yamahata

On Tue, Mar 07, 2023 at 03:13:50AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX to be enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug and done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v9 -> v10:
>  - Merged the patch to handle per-cpu initialization to this patch to
>    tell the story better.
>  - Changed how to handle the per-cpu initialization to only provide a
>    tdx_cpu_enable() function to let the user of TDX to do it when the
>    user wants to run TDX code on a certain cpu.
>  - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>    call lockdep_assert_cpus_held() to assume the caller has done that.
>  - Improved comments around tdx_enable() and tdx_cpu_enable().
>  - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>  - Removed detailed TODO list in the changelog (Dave).
>  - Added back steps to do module global initialization and per-cpu
>    initialization in the TODO list comment.
>  - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>  - Refined changelog (Dave).
>  - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>  - Add a "TODO list" comment in init_tdx_module() to list all steps of
>    initializing the TDX Module to tell the story (Dave).
>  - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>    comments (Dave).
>  - Simplified __tdx_enable() to only handle success or failure.
>  - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>  - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>  - Improved comments (Dave).
>  - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>  - No change.
> 
> v5 -> v6:
>  - Added code to set status to TDX_MODULE_NONE if TDX module is not
>    loaded (Chao)
>  - Added Chao's Reviewed-by.
>  - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>  - Removed the check that SEAMRR and TDX KeyID have been detected on
>    all present cpus.
>  - Removed tdx_detect().
>  - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>    hotplug lock and return early with error message.
>  - Improved dmesg printing for TDX module detection and initialization.
> 
> ---
>  arch/x86/include/asm/tdx.h  |   4 +
>  arch/x86/virt/vmx/tdx/tdx.c | 182 ++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  25 +++++
>  3 files changed, 211 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b489b5b9de5d..112a5b9bd5cd 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -EINVAL; }
> +static inline int tdx_enable(void)  { return -EINVAL; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b65b838f3b5d..29127cb70f51 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -22,6 +26,18 @@ static u32 tdx_global_keyid __ro_after_init;
>  static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
> +static unsigned int tdx_global_init_status;
> +static DEFINE_SPINLOCK(tdx_global_init_lock);
> +#define TDX_GLOBAL_INIT_DONE	_BITUL(0)
> +#define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
> +
> +static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
> +#define TDX_LP_INIT_DONE	_BITUL(0)
> +#define TDX_LP_INIT_FAILED	_BITUL(1)
> +
> +static enum tdx_module_status_t tdx_module_status;
> +static DEFINE_MUTEX(tdx_module_lock);
> +
>  /*
>   * Use tdx_global_keyid to indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -159,3 +175,169 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	put_cpu();
>  	return ret;
>  }
> +
> +static int try_init_module_global(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	spin_lock(&tdx_global_init_lock);
> +
> +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> +			-EINVAL : 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +
> +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> +	if (ret)
> +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;

If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
In such case, we should allow the caller to retry or make this function retry
instead of marking error stickily.

Except that,
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-03-06 14:13 ` [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2023-03-09  1:38   ` Isaku Yamahata
  0 siblings, 0 replies; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-09  1:38 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, dave.hansen, peterz, tglx, seanjc,
	pbonzini, dan.j.williams, rafael.j.wysocki, kirill.shutemov,
	ying.huang, reinette.chatre, len.brown, tony.luck, ak,
	isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, david,
	bagasdotme, sagis, imammedo, isaku.yamahata

On Tue, Mar 07, 2023 at 03:13:52AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
> 
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible.  The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module.  Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
> 
> To keep things simple, assume that all TDX-protected memory will come
> from the page allocator.  Make sure all pages in the page allocator
> *are* TDX-usable memory.
> 
> As TDX-usable memory is a fixed configuration, take a snapshot of the
> memory configuration from memblocks at the time of module initialization
> (memblocks are modified on memory hotplug).  This snapshot is used to
> enable TDX support for *this* memory configuration only.  Use a memory
> hotplug notifier to ensure that no other RAM can be added outside of
> this configuration.
> 
> This approach requires all memblock memory regions at the time of module
> initialization to be TDX convertible memory to work, otherwise module
> initialization will fail in a later SEAMCALL when passing those regions
> to the module.  This approach works when all boot-time "system RAM" is
> TDX convertible memory, and no non-TDX-convertible memory is hot-added
> to the core-mm before module initialization.
> 
> For instance, on the first generation of TDX machines, both CXL memory
> and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
> any CXL memory or NVDIMM to the core-mm before module initialization
> will result in failure to initialize the module.  The SEAMCALL error
> code will be available in the dmesg to help user to understand the
> failure.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> ---
> 
> v9 -> v10:
>  - Moved empty @tdx_memlist check out of is_tdx_memory() to make the
>    logic better.
>  - Added Ying's Reviewed-by.
> 
> v8 -> v9:
>  - Replace "The initial support ..." with timeless sentence in both
>    changelog and comments(Dave).
>  - Fix run-on sentence in changelog, and senstence to explain why to
>    stash off memblock (Dave).
>  - Tried to improve why to choose this approach and how it work in
>    changelog based on Dave's suggestion.
>  - Many other comments enhancement (Dave).
> 
> v7 -> v8:
>  - Trimed down changelog (Dave).
>  - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
>    (Ying).
>  - Moved memory hotplug handling from add_arch_memory() to
>    memory_notifier (Dan/David).
>  - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
>  - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
>  - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
>  - Improve the comment around first 1MB (Dave).
>  - Added a comment around reserve_real_mode() to point out TDX code
>    relies on first 1MB being reserved (Ying).
>  - Added comment to explain why the new online memory range cannot
>    cross multiple TDX memory blocks (Dave).
>  - Improved other comments (Dave).
> 
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/kernel/setup.c     |   2 +
>  arch/x86/virt/vmx/tdx/tdx.c | 165 +++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |   6 ++
>  4 files changed, 172 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6dd5d5586099..f23bc540778a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 88188549647c..a8a119a9b48c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1165,6 +1165,8 @@ void __init setup_arch(char **cmdline_p)
>  	 *
>  	 * Moreover, on machines with SandyBridge graphics or in setups that use
>  	 * crashkernel the entire 1M is reserved anyway.
> +	 *
> +	 * Note the host kernel TDX also requires the first 1MB being reserved.
>  	 */
>  	x86_platform.realmode_reserve();
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 981e11492d0e..9149144cd7e7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -17,6 +17,13 @@
>  #include <linux/spinlock.h>
>  #include <linux/percpu-defs.h>
>  #include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/memory.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> +#include <linux/pfn.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/page.h>
> @@ -39,6 +46,9 @@ static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
>  static enum tdx_module_status_t tdx_module_status;
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> +/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * Use tdx_global_keyid to indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -77,6 +87,54 @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
>  	return 0;
>  }
>  
> +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/*
> +	 * This check assumes that the start_pfn<->end_pfn range does not
> +	 * cross multiple @tdx_memlist entries.  A single memory online
> +	 * event across multiple memblocks (from which @tdx_memlist
> +	 * entries are derived at the time of module initialization) is
> +	 * not possible.  This is because memory offline/online is done
> +	 * on granularity of 'struct memory_block', and the hotpluggable
> +	 * memory region (one memblock) must be multiple of memory_block.
> +	 */
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * Empty list means TDX isn't enabled.  Allow any memory
> +	 * to go online.
> +	 */
> +	if (list_empty(&tdx_memlist))
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * The TDX memory configuration is static and can not be
> +	 * changed.  Reject onlining any memory which is outside of
> +	 * the static configuration whether it supports TDX or not.
> +	 */
> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +	.notifier_call = tdx_memory_notifier,
> +};
> +
>  static int __init tdx_init(void)
>  {
>  	u32 tdx_keyid_start, nr_tdx_keyids;
> @@ -107,6 +165,13 @@ static int __init tdx_init(void)
>  		goto no_tdx;
>  	}
>  
> +	err = register_memory_notifier(&tdx_memory_nb);
> +	if (err) {
> +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> +				err);
> +		goto no_tdx;
> +	}
> +
>  	tdx_guest_keyid_start = tdx_keyid_start;
>  	tdx_nr_guest_keyids = nr_tdx_keyids;
>  
> @@ -316,6 +381,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>  	return 0;
>  }
>  
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	/* @tmb_list is protected by mem_hotplug_lock */
> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	/* @tmb_list is protected by mem_hotplug_lock */
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */
> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
> @@ -329,10 +467,25 @@ static int init_tdx_module(void)
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * To keep things simple, assume that all TDX-protected memory
> +	 * will come from the page allocator.  Make sure all pages in the
> +	 * page allocator are TDX-usable memory.
> +	 *
> +	 * Build the list of "TDX-usable" memory regions which cover all
> +	 * pages in the page allocator to guarantee that.  Do it while
> +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> +	 * path reads the @tdx_memlist to reject any new memory.
> +	 */
> +	get_online_mems();
> +
> +	ret = build_tdx_memlist(&tdx_memlist);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
>  	 *    all TDX-usable memory regions.
>  	 *  - Configure the TDMRs and the global KeyID to the TDX module.
> @@ -341,7 +494,15 @@ static int init_tdx_module(void)
>  	 *
>  	 *  Return error before all steps are done.
>  	 */
> -	return -EINVAL;
> +	ret = -EINVAL;
> +out:
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
> +
> +	return ret;
>  }
>  
>  static int __tdx_enable(void)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 2f2d8737a364..6518024fcb68 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -101,6 +101,12 @@ enum tdx_module_status_t {
>  	TDX_MODULE_ERROR
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};
> +
>  struct tdx_module_output;
>  u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);
> -- 
> 2.39.2
> 

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-08 22:27   ` Isaku Yamahata
@ 2023-03-12 23:08     ` Huang, Kai
  2023-03-13 23:49       ` Isaku Yamahata
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-03-12 23:08 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, bagasdotme, Hansen, Dave, Luck, Tony, david, ak, Wysocki,
	Rafael J, linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, tglx,
	kirill.shutemov, Yamahata, Isaku, peterz, Shahar, Sagi, imammedo,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J

On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > +
> > +static int try_init_module_global(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	spin_lock(&tdx_global_init_lock);
> > +
> > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > +			-EINVAL : 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +
> > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > +	if (ret)
> > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> 
> If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> In such case, we should allow the caller to retry or make this function retry
> instead of marking error stickily.

The spec says:

TDX_SYS_BUSY	The operation was invoked when another TDX module
		operation was in progress. The operation may be retried.

So I don't see how entropy is lacking is related to this error.  Perhaps you
were mixing up with KEY.CONFIG?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-12 23:08     ` Huang, Kai
@ 2023-03-13 23:49       ` Isaku Yamahata
  2023-03-14  1:50         ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-13 23:49 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, kvm, bagasdotme, Hansen, Dave, Luck, Tony, david,
	ak, Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, tglx,
	kirill.shutemov, Yamahata, Isaku, peterz, Shahar, Sagi, imammedo,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J

On Sun, Mar 12, 2023 at 11:08:44PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > +
> > > +static int try_init_module_global(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * The TDX module global initialization only needs to be done
> > > +	 * once on any cpu.
> > > +	 */
> > > +	spin_lock(&tdx_global_init_lock);
> > > +
> > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > +			-EINVAL : 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* All '0's are just unused parameters. */
> > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > +
> > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > +	if (ret)
> > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > 
> > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > In such case, we should allow the caller to retry or make this function retry
> > instead of marking error stickily.
> 
> The spec says:
> 
> TDX_SYS_BUSY	The operation was invoked when another TDX module
> 		operation was in progress. The operation may be retried.
> 
> So I don't see how entropy is lacking is related to this error.  Perhaps you
> were mixing up with KEY.CONFIG?

TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
strong stack protector enabled by clang and canary value needs to be
initialized.  By default, the canary value is stored at
%fsbase:<STACK_CANARY_OFFSET 0x28>

Although this is a job for libc or language runtime, TDX modules has to do it
itself because it's stand alone.

From tdh_sys_init.c
_STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
{
    ia32_rflags_t rflags = {.raw = 0};
    uint64_t canary;
    if (!ia32_rdrand(&rflags, &canary))
    {
        return TDX_SYS_BUSY;
    }
...
    last_page_ptr->stack_canary.canary = canary;


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-13 23:49       ` Isaku Yamahata
@ 2023-03-14  1:50         ` Huang, Kai
  2023-03-14  4:02           ` Isaku Yamahata
  2023-03-14 15:48           ` Dave Hansen
  0 siblings, 2 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-14  1:50 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, Hansen, Dave, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > +
> > > > +static int try_init_module_global(void)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * The TDX module global initialization only needs to be done
> > > > +	 * once on any cpu.
> > > > +	 */
> > > > +	spin_lock(&tdx_global_init_lock);
> > > > +
> > > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > +			-EINVAL : 0;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	/* All '0's are just unused parameters. */
> > > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > +
> > > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > +	if (ret)
> > > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > 
> > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > In such case, we should allow the caller to retry or make this function retry
> > > instead of marking error stickily.
> > 
> > The spec says:
> > 
> > TDX_SYS_BUSY	The operation was invoked when another TDX module
> > 		operation was in progress. The operation may be retried.
> > 
> > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > were mixing up with KEY.CONFIG?
> 
> TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> strong stack protector enabled by clang and canary value needs to be
> initialized.  By default, the canary value is stored at
> %fsbase:<STACK_CANARY_OFFSET 0x28>
> 
> Although this is a job for libc or language runtime, TDX modules has to do it
> itself because it's stand alone.
> 
> From tdh_sys_init.c
> _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> {
>     ia32_rflags_t rflags = {.raw = 0};
>     uint64_t canary;
>     if (!ia32_rdrand(&rflags, &canary))
>     {
>         return TDX_SYS_BUSY;
>     }
> ...
>     last_page_ptr->stack_canary.canary = canary;
> 
> 

Then it is a hidden behaviour of the TDX module that is not reflected in the
spec.  I am not sure whether we should handle because: 

1) This is an extremely rare case.  Kernel would be basically under attack if
such error happened.  In the current series we don't handle such case in
KEY.CONFIG either but just leave a comment (see patch 13).

2) Not sure whether this will be changed in the future.

So I think we should keep as is.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14  1:50         ` Huang, Kai
@ 2023-03-14  4:02           ` Isaku Yamahata
  2023-03-14  5:45             ` Dave Hansen
  2023-03-14 15:48           ` Dave Hansen
  1 sibling, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-14  4:02 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, kvm, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, Mar 14, 2023 at 01:50:40AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > +
> > > > > +static int try_init_module_global(void)
> > > > > +{
> > > > > +	int ret;
> > > > > +
> > > > > +	/*
> > > > > +	 * The TDX module global initialization only needs to be done
> > > > > +	 * once on any cpu.
> > > > > +	 */
> > > > > +	spin_lock(&tdx_global_init_lock);
> > > > > +
> > > > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > +			-EINVAL : 0;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	/* All '0's are just unused parameters. */
> > > > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > +
> > > > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > +	if (ret)
> > > > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > 
> > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > In such case, we should allow the caller to retry or make this function retry
> > > > instead of marking error stickily.
> > > 
> > > The spec says:
> > > 
> > > TDX_SYS_BUSY	The operation was invoked when another TDX module
> > > 		operation was in progress. The operation may be retried.
> > > 
> > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > were mixing up with KEY.CONFIG?
> > 
> > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > strong stack protector enabled by clang and canary value needs to be
> > initialized.  By default, the canary value is stored at
> > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > 
> > Although this is a job for libc or language runtime, TDX modules has to do it
> > itself because it's stand alone.
> > 
> > From tdh_sys_init.c
> > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > {
> >     ia32_rflags_t rflags = {.raw = 0};
> >     uint64_t canary;
> >     if (!ia32_rdrand(&rflags, &canary))
> >     {
> >         return TDX_SYS_BUSY;
> >     }
> > ...
> >     last_page_ptr->stack_canary.canary = canary;
> > 
> > 
> 
> Then it is a hidden behaviour of the TDX module that is not reflected in the
> spec.  I am not sure whether we should handle because: 
> 
> 1) This is an extremely rare case.  Kernel would be basically under attack if
> such error happened.  In the current series we don't handle such case in
> KEY.CONFIG either but just leave a comment (see patch 13).
> 
> 2) Not sure whether this will be changed in the future.
> 
> So I think we should keep as is.

TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
postpone it to TDX 1.5 activity.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14  4:02           ` Isaku Yamahata
@ 2023-03-14  5:45             ` Dave Hansen
  2023-03-14 17:16               ` Isaku Yamahata
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Hansen @ 2023-03-14  5:45 UTC (permalink / raw)
  To: Isaku Yamahata, Huang, Kai
  Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel,
	Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 3/13/23 21:02, Isaku Yamahata wrote:
>> Then it is a hidden behaviour of the TDX module that is not reflected in the
>> spec.  I am not sure whether we should handle because: 
>>
>> 1) This is an extremely rare case.  Kernel would be basically under attack if
>> such error happened.  In the current series we don't handle such case in
>> KEY.CONFIG either but just leave a comment (see patch 13).
>>
>> 2) Not sure whether this will be changed in the future.
>>
>> So I think we should keep as is.
> TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
> postpone it to TDX 1.5 activity.

What the heck does this mean?

I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
1.5".  That means that this code needs to work with _any_ TDX version.

Are features being added to new versions that break code written for old
versions?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14  1:50         ` Huang, Kai
  2023-03-14  4:02           ` Isaku Yamahata
@ 2023-03-14 15:48           ` Dave Hansen
  2023-03-15 11:10             ` Huang, Kai
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Hansen @ 2023-03-14 15:48 UTC (permalink / raw)
  To: Huang, Kai, isaku.yamahata
  Cc: kvm, david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel,
	Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 3/13/23 18:50, Huang, Kai wrote:
> On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
>> On Sun, Mar 12, 2023 at 11:08:44PM +0000,
>> "Huang, Kai" <kai.huang@intel.com> wrote:
>>
>>> On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
>>>>> +
>>>>> +static int try_init_module_global(void)
>>>>> +{
>>>>> +       int ret;
>>>>> +
>>>>> +       /*
>>>>> +        * The TDX module global initialization only needs to be done
>>>>> +        * once on any cpu.
>>>>> +        */
>>>>> +       spin_lock(&tdx_global_init_lock);
>>>>> +
>>>>> +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
>>>>> +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
>>>>> +                       -EINVAL : 0;
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>> +       /* All '0's are just unused parameters. */
>>>>> +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
>>>>> +
>>>>> +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
>>>>> +       if (ret)
>>>>> +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
>>>>
>>>> If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
>>>> In such case, we should allow the caller to retry or make this function retry
>>>> instead of marking error stickily.
>>>
>>> The spec says:
>>>
>>> TDX_SYS_BUSY        The operation was invoked when another TDX module
>>>             operation was in progress. The operation may be retried.
>>>
>>> So I don't see how entropy is lacking is related to this error.  Perhaps you
>>> were mixing up with KEY.CONFIG?
>>
>> TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
>> strong stack protector enabled by clang and canary value needs to be
>> initialized.  By default, the canary value is stored at
>> %fsbase:<STACK_CANARY_OFFSET 0x28>
>>
>> Although this is a job for libc or language runtime, TDX modules has to do it
>> itself because it's stand alone.
>>
>> From tdh_sys_init.c
>> _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
>> {
>>     ia32_rflags_t rflags = {.raw = 0};
>>     uint64_t canary;
>>     if (!ia32_rdrand(&rflags, &canary))
>>     {
>>         return TDX_SYS_BUSY;
>>     }
>> ...
>>     last_page_ptr->stack_canary.canary = canary;
>>
>>
> 
> Then it is a hidden behaviour of the TDX module that is not reflected in the
> spec.

This is true.  Could you please go ask the TDX module folks to fix this up?

> I am not sure whether we should handle because:
> 
> 1) This is an extremely rare case.  Kernel would be basically under attack if
> such error happened.  In the current series we don't handle such case in
> KEY.CONFIG either but just leave a comment (see patch 13).

Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
at the SDM:

> Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> of random numbers by software processes/threads to exceed the rate at which the random number generator
> hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> instruction indicates the occurrence of this rare situation by clearing the CF flag.

That doesn't talk about attacks.

> 2) Not sure whether this will be changed in the future.
> 
> So I think we should keep as is.

TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
RDRAND failures.  But, if you have VMM locking and don't expect two
users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
is a bad (and probably fatal) signal.

I suspect we should just throw a few retries in the seamcall()
infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
RDRAND failures.  If a retry loop fails to resolve it, then we should
probably dump a warning and return an error.

Just do this once, in common code.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14  5:45             ` Dave Hansen
@ 2023-03-14 17:16               ` Isaku Yamahata
  2023-03-14 17:38                 ` Dave Hansen
  0 siblings, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-14 17:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Isaku Yamahata, Huang, Kai, kvm, david, bagasdotme, ak, Wysocki,
	Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Mon, Mar 13, 2023 at 10:45:45PM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:

> On 3/13/23 21:02, Isaku Yamahata wrote:
> >> Then it is a hidden behaviour of the TDX module that is not reflected in the
> >> spec.  I am not sure whether we should handle because: 
> >>
> >> 1) This is an extremely rare case.  Kernel would be basically under attack if
> >> such error happened.  In the current series we don't handle such case in
> >> KEY.CONFIG either but just leave a comment (see patch 13).
> >>
> >> 2) Not sure whether this will be changed in the future.
> >>
> >> So I think we should keep as is.
> > TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
> > postpone it to TDX 1.5 activity.
> 
> What the heck does this mean?
> 
> I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
> 1.5".  That means that this code needs to work with _any_ TDX version.
> 
> Are features being added to new versions that break code written for old
> versions?

No new feature, but new error code. TDX_RND_NO_ENTROPY, lack of entropy.
For TDX 1.0, some APIs return TDX_SYS_BUSY. It can be contention(lock failure)
or the lack of entropy.  The caller can't distinguish them.
For TDX 1.5, they return TDX_RND_NO_ENTROPY instead of TDX_SYS_BUSY in the case
of rdrand/rdseed failure.

Because both TDX_SYS_BUSY and TDX_RND_NO_ENTROPY are recoverable error
(bit 63 error=1, bit 62 non_recoverable=0), the caller can check error bit and
non_recoverable bit for retry.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14 17:16               ` Isaku Yamahata
@ 2023-03-14 17:38                 ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2023-03-14 17:38 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Huang, Kai, kvm, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	Luck, Tony, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown,
	Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 3/14/23 10:16, Isaku Yamahata wrote:
>>> TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
>>> postpone it to TDX 1.5 activity.
>> What the heck does this mean?
>>
>> I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
>> 1.5".  That means that this code needs to work with _any_ TDX version.
>>
>> Are features being added to new versions that break code written for old
>> versions?
> No new feature, but new error code. TDX_RND_NO_ENTROPY, lack of entropy.
> For TDX 1.0, some APIs return TDX_SYS_BUSY. It can be contention(lock failure)
> or the lack of entropy.  The caller can't distinguish them.
> For TDX 1.5, they return TDX_RND_NO_ENTROPY instead of TDX_SYS_BUSY in the case
> of rdrand/rdseed failure.
> 
> Because both TDX_SYS_BUSY and TDX_RND_NO_ENTROPY are recoverable error
> (bit 63 error=1, bit 62 non_recoverable=0), the caller can check error bit and
> non_recoverable bit for retry.

Oh, that's actually really nice.  It separates out the "RDRAND is empty"
issue from the "the VMM should have had a lock here" issue.

For now, let's consider TDX_SYS_BUSY to basically indicate a non-fatal
kernel bug: the kernel called TDX in a way that it shouldn't have.
We'll treat it in the kernel as non-recoverable.  We'll return an error,
WARN_ON(), and keep on running.

A follow-on patch can add generic TDX_RND_NO_ENTROPY retry support to
the seamcall infrastructure.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-14 15:48           ` Dave Hansen
@ 2023-03-15 11:10             ` Huang, Kai
  2023-03-16 22:07               ` Huang, Kai
  2023-03-23 13:49               ` Dave Hansen
  0 siblings, 2 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-15 11:10 UTC (permalink / raw)
  To: Hansen, Dave, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Tue, 2023-03-14 at 08:48 -0700, Dave Hansen wrote:
> On 3/13/23 18:50, Huang, Kai wrote:
> > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > 
> > > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > > +
> > > > > > +static int try_init_module_global(void)
> > > > > > +{
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * The TDX module global initialization only needs to be done
> > > > > > +        * once on any cpu.
> > > > > > +        */
> > > > > > +       spin_lock(&tdx_global_init_lock);
> > > > > > +
> > > > > > +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > > +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > > +                       -EINVAL : 0;
> > > > > > +               goto out;
> > > > > > +       }
> > > > > > +
> > > > > > +       /* All '0's are just unused parameters. */
> > > > > > +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > > +
> > > > > > +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > > +       if (ret)
> > > > > > +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > > 
> > > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > > In such case, we should allow the caller to retry or make this function retry
> > > > > instead of marking error stickily.
> > > > 
> > > > The spec says:
> > > > 
> > > > TDX_SYS_BUSY        The operation was invoked when another TDX module
> > > >             operation was in progress. The operation may be retried.
> > > > 
> > > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > > were mixing up with KEY.CONFIG?
> > > 
> > > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > > strong stack protector enabled by clang and canary value needs to be
> > > initialized.  By default, the canary value is stored at
> > > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > > 
> > > Although this is a job for libc or language runtime, TDX modules has to do it
> > > itself because it's stand alone.
> > > 
> > > From tdh_sys_init.c
> > > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > > {
> > >     ia32_rflags_t rflags = {.raw = 0};
> > >     uint64_t canary;
> > >     if (!ia32_rdrand(&rflags, &canary))
> > >     {
> > >         return TDX_SYS_BUSY;
> > >     }
> > > ...
> > >     last_page_ptr->stack_canary.canary = canary;
> > > 
> > > 
> > 
> > Then it is a hidden behaviour of the TDX module that is not reflected in the
> > spec.
> 
> This is true.  Could you please go ask the TDX module folks to fix this up?

Sure will do.

To make sure, you mean we should ask TDX module guys to add the new
TDX_RND_NO_ENTROPY error code to TDX module 1.0?

"another TDX module operation was in progress" and "running out of entropy" are
different thing and should not be mixed together IMHO.

> 
> > I am not sure whether we should handle because:
> > 
> > 1) This is an extremely rare case.  Kernel would be basically under attack if
> > such error happened.  In the current series we don't handle such case in
> > KEY.CONFIG either but just leave a comment (see patch 13).
> 
> Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
> at the SDM:
> 
> > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> > of random numbers by software processes/threads to exceed the rate at which the random number generator
> > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> > instruction indicates the occurrence of this rare situation by clearing the CF flag.
> 
> That doesn't talk about attacks.

Thanks for citing the documentation.  I checked the kernel code before and it
seems currently there's no code to call RDRAND very frequently.  But yes we
should not say "under attack".  I have some old memory that someone said so
(maybe me?).

> 
> > 2) Not sure whether this will be changed in the future.
> > 
> > So I think we should keep as is.
> 
> TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
> RDRAND failures.  
> 

OK.  Agreed.  Then I think the TDH.SYS.KEY.CONFIG should retry when running out
of entropy too.

> But, if you have VMM locking and don't expect two
> users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
> is a bad (and probably fatal) signal.

Yes we have a lock to protect TDH.SYS.INIT from being called in parallel.  W/o
this entropy thing TDX_SYS_BUSY should never happen.

> 
> I suspect we should just throw a few retries in the seamcall()
> infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
> RDRAND failures.  If a retry loop fails to resolve it, then we should
> probably dump a warning and return an error.
> 
> Just do this once, in common code.

I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
value)?

Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common 
seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
SEAMCALL returns a different error code:

TDX_KEY_GENERATION_FAILED	Failed to generate a random key. This is 
				typically caused by an entropy error of the
				CPU's random number generator, and may
				be impacted by RDSEED, RDRAND or PCONFIG
				executing on other LPs. The operation should be
				retried.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-06 14:13 ` [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
  2023-03-08 22:27   ` Isaku Yamahata
@ 2023-03-16  0:31   ` Isaku Yamahata
  2023-03-16  2:45     ` Isaku Yamahata
  1 sibling, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-16  0:31 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, dave.hansen, peterz, tglx, seanjc,
	pbonzini, dan.j.williams, rafael.j.wysocki, kirill.shutemov,
	ying.huang, reinette.chatre, len.brown, tony.luck, ak,
	isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, david,
	bagasdotme, sagis, imammedo, isaku.yamahata

On Tue, Mar 07, 2023 at 03:13:50AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> +static int try_init_module_global(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	spin_lock(&tdx_global_init_lock);


If I use tdx_cpu_enable() via kvm hardware_enable_all(), this function is called
in the context IPI callback and the lockdep complains.  Here is my patch to
address it

From 0c4022ffe8cd68dfb455c418eb65538e4e100115 Mon Sep 17 00:00:00 2001
Message-Id: <0c4022ffe8cd68dfb455c418eb65538e4e100115.1678926123.git.isaku.yamahata@intel.com>
In-Reply-To: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
References: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <isaku.yamahata@intel.com>
Date: Wed, 15 Mar 2023 14:26:37 -0700
Subject: [PATCH] x86/virt/vmx/tdx: Use raw spin lock instead of spin lock

tdx_cpu_enable() can be called by IPI handler.  The lockdep complains about
spin lock as follows.  Use raw spin lock.

=============================
[ BUG: Invalid wait context ]
6.3.0-rc1-tdx-kvm-upstream+ #389 Not tainted
-----------------------------
swapper/154/0 is trying to lock:
ffffffffa7875e58 (tdx_global_init_lock){....}-{3:3}, at: tdx_cpu_enable+0x67/0x180
other info that might help us debug this:
context-{2:2}
no locks held by swapper/154/0.
stack backtrace:
Call Trace:
 <IRQ>
 dump_stack_lvl+0x64/0xb0
 dump_stack+0x10/0x20
 __lock_acquire+0x912/0xc30
 lock_acquire.part.0+0x99/0x220
 lock_acquire+0x60/0x170
 _raw_spin_lock_irqsave+0x43/0x70
 tdx_cpu_enable+0x67/0x180
 vt_hardware_enable+0x3b/0x60
 kvm_arch_hardware_enable+0xe7/0x2e0
 hardware_enable_nolock+0x33/0x80
 __flush_smp_call_function_queue+0xc4/0x590
 generic_smp_call_function_single_interrupt+0x1a/0xb0
 __sysvec_call_function+0x48/0x200
 sysvec_call_function+0xad/0xd0
 </IRQ>

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2ee37a5dedcf..e1c8ffad7406 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -41,7 +41,7 @@ static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
 static unsigned int tdx_global_init_status;
-static DEFINE_SPINLOCK(tdx_global_init_lock);
+static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
 #define TDX_GLOBAL_INIT_DONE	_BITUL(0)
 #define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
 
@@ -349,6 +349,7 @@ static void tdx_trace_seamcalls(u64 level)
 
 static int try_init_module_global(void)
 {
+	unsigned long flags;
 	u64 tsx_ctrl;
 	int ret;
 
@@ -356,7 +357,7 @@ static int try_init_module_global(void)
 	 * The TDX module global initialization only needs to be done
 	 * once on any cpu.
 	 */
-	spin_lock(&tdx_global_init_lock);
+	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
 
 	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
 		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
@@ -373,7 +374,7 @@ static int try_init_module_global(void)
 	if (ret)
 		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
 out:
-	spin_unlock(&tdx_global_init_lock);
+	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
 
 	if (ret) {
 		if (trace_boot_seamcalls)
-- 
2.25.1


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-16  0:31   ` Isaku Yamahata
@ 2023-03-16  2:45     ` Isaku Yamahata
  2023-03-16  2:52       ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Isaku Yamahata @ 2023-03-16  2:45 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, dave.hansen, peterz, tglx, seanjc,
	pbonzini, dan.j.williams, rafael.j.wysocki, kirill.shutemov,
	ying.huang, reinette.chatre, len.brown, tony.luck, ak,
	isaku.yamahata, chao.gao, sathyanarayanan.kuppuswamy, david,
	bagasdotme, sagis, imammedo, isaku.yamahata

On Wed, Mar 15, 2023 at 05:31:02PM -0700,
Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> On Tue, Mar 07, 2023 at 03:13:50AM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > +static int try_init_module_global(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	spin_lock(&tdx_global_init_lock);
> 
> 
> If I use tdx_cpu_enable() via kvm hardware_enable_all(), this function is called
> in the context IPI callback and the lockdep complains.  Here is my patch to
> address it
> 
> From 0c4022ffe8cd68dfb455c418eb65538e4e100115 Mon Sep 17 00:00:00 2001
> Message-Id: <0c4022ffe8cd68dfb455c418eb65538e4e100115.1678926123.git.isaku.yamahata@intel.com>
> In-Reply-To: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
> References: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> Date: Wed, 15 Mar 2023 14:26:37 -0700
> Subject: [PATCH] x86/virt/vmx/tdx: Use raw spin lock instead of spin lock
> 
> tdx_cpu_enable() can be called by IPI handler.  The lockdep complains about
> spin lock as follows.  Use raw spin lock.
> 
> =============================
> [ BUG: Invalid wait context ]
> 6.3.0-rc1-tdx-kvm-upstream+ #389 Not tainted
> -----------------------------
> swapper/154/0 is trying to lock:
> ffffffffa7875e58 (tdx_global_init_lock){....}-{3:3}, at: tdx_cpu_enable+0x67/0x180
> other info that might help us debug this:
> context-{2:2}
> no locks held by swapper/154/0.
> stack backtrace:
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x64/0xb0
>  dump_stack+0x10/0x20
>  __lock_acquire+0x912/0xc30
>  lock_acquire.part.0+0x99/0x220
>  lock_acquire+0x60/0x170
>  _raw_spin_lock_irqsave+0x43/0x70
>  tdx_cpu_enable+0x67/0x180
>  vt_hardware_enable+0x3b/0x60
>  kvm_arch_hardware_enable+0xe7/0x2e0
>  hardware_enable_nolock+0x33/0x80
>  __flush_smp_call_function_queue+0xc4/0x590
>  generic_smp_call_function_single_interrupt+0x1a/0xb0
>  __sysvec_call_function+0x48/0x200
>  sysvec_call_function+0xad/0xd0
>  </IRQ>
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2ee37a5dedcf..e1c8ffad7406 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -41,7 +41,7 @@ static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
>  static unsigned int tdx_global_init_status;
> -static DEFINE_SPINLOCK(tdx_global_init_lock);
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
>  #define TDX_GLOBAL_INIT_DONE	_BITUL(0)
>  #define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
>  
> @@ -349,6 +349,7 @@ static void tdx_trace_seamcalls(u64 level)
>  
>  static int try_init_module_global(void)
>  {
> +	unsigned long flags;
>  	u64 tsx_ctrl;
>  	int ret;
>  
> @@ -356,7 +357,7 @@ static int try_init_module_global(void)
>  	 * The TDX module global initialization only needs to be done
>  	 * once on any cpu.
>  	 */
> -	spin_lock(&tdx_global_init_lock);
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);

As hardware_enable_all() uses cpus_read_lock(), irqsave isn't needed.
this line should be raw_spin_lock().
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-16  2:45     ` Isaku Yamahata
@ 2023-03-16  2:52       ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-16  2:52 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, bagasdotme, Hansen, Dave, Luck, Tony, david, ak, Wysocki,
	Rafael J, linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, linux-mm, tglx,
	kirill.shutemov, Yamahata, Isaku, peterz, Shahar, Sagi, imammedo,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J


> >  
> > @@ -356,7 +357,7 @@ static int try_init_module_global(void)
> >  	 * The TDX module global initialization only needs to be done
> >  	 * once on any cpu.
> >  	 */
> > -	spin_lock(&tdx_global_init_lock);
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> 
> As hardware_enable_all() uses cpus_read_lock(), irqsave isn't needed.
> this line should be raw_spin_lock().
> 

OK.  I missed that in PREEMPT_RT kernel the spinlock is converted to sleeping
lock.  So I'll change to use raw_spin_lock() as we talked.  Thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 00/16] TDX host kernel support
  2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
                   ` (16 preceding siblings ...)
  2023-03-08  1:11 ` [PATCH v10 00/16] TDX host kernel support Isaku Yamahata
@ 2023-03-16 12:35 ` David Hildenbrand
  2023-03-16 22:06   ` Huang, Kai
  17 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2023-03-16 12:35 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo

On 06.03.23 15:13, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  TDX specs are available in [1].

I'm afraid there is no [1], probably got lost while resending :)

> 
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests.  KVM support for TDX is being
> developed separately[2].  A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3].  The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.

Same with [2].

> 
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly.  For example, memory hotplug is handled
> in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> hotplug" sections below).
> 
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> help to provide comments.)
> 
> And TDX module metadata allocation just uses alloc_contig_pages() to
> allocate large chunk at runtime, thus it can fail.  It is imperfect now
> but _will_ be improved in the future.

Good enough for now I guess. Reserving it via memblock might be better, 
though.

> 
> Also, the patch to add the new kernel comline tdx="force" isn't included
> in this initial version, as Dave suggested it isn't mandatory.  But I
> _will_ add one once this initial version gets merged.

What would be the main purpose of that option?

> 
> All other optimizations will be posted as follow-up once this initial
> TDX support is upstreamed.
> 


[...]

> == Background ==
> 
> TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
> and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
> A CPU-attested software module called 'the TDX module' runs in the new
> isolated region as a trusted hypervisor to create/run protected VMs.
> 
> TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
> provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
> as TDX private KeyIDs, which are only accessible within the SEAM mode.
> 
> TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
> secure processor to provide crypto-protection.  The firmware runs on the
> secure processor acts a similar role as the TDX module.
> 
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction.  This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead.
> 
> Before being able to manage TD guests, the TDX module must be loaded
> and properly initialized.  This series assumes the TDX module is loaded
> by BIOS before the kernel boots.
> 
> How to initialize the TDX module is described at TDX module 1.0
> specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
> Initialization and Shutdown".
> 
> == Design Considerations ==
> 
> 1. Initialize the TDX module at runtime
> 
> There are basically two ways the TDX module could be initialized: either
> in early boot, or at runtime before the first TDX guest is run.  This
> series implements the runtime initialization.
> 
> This series adds a function tdx_enable() to allow the caller to initialize
> TDX at runtime:
> 
>          if (tdx_enable())
>                  goto no_tdx;
> 	// TDX is ready to create TD guests.
> 
> This approach has below pros:
> 
> 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> metadata.  Enabling TDX on demand allows only to consume this memory when
> TDX is truly needed (i.e. when KVM wants to create TD guests).

Let's be clear: nobody is going to run encrypted VMs "out of the blue".

You can expect a certain hypervisor setup to be required, for example, 
enabling it on the cmdline and then allocating that metadata from 
memblock during boot.

IIRC s390x handles it similarly with protected VMs and required metadata.

> 
> 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
> done).  So far, KVM is the only user of TDX, and it already handles VMXON.
> Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
> 
> 3) It is more flexible to support "TDX module runtime update" (not in
> this series).  After updating to the new module at runtime, kernel needs
> to go through the initialization process again.
> 
> 2. CPU hotplug
> 
> TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
> must be done on one cpu before any other SEAMCALLs can be made on that
> cpu, including those involved during the module initialization.
> 
> The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
> the user wants to use a new cpu for TDX task.
> 
> TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
> never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> event to the kernel.  This series doesn't handle physical (ACPI) CPU
> hotplug at all but depends on the BIOS to behave correctly.
> 
> Note TDX works with CPU logical online/offline, thus this series still
> allows to do logical CPU online/offline.
> 
> 3. Kernel policy on TDX memory
> 
> The TDX module reports a list of "Convertible Memory Region" (CMR) to
> indicate which memory regions are TDX-capable.  The TDX architecture
> allows the VMM to designate specific convertible memory regions as usable
> for TDX private memory.
> 
> The initial support of TDX guests will only allocate TDX private memory
> from the global page allocator.  This series chooses to designate _all_
> system RAM in the core-mm at the time of initializing TDX module as TDX
> memory to guarantee all pages in the page allocator are TDX pages.
> 
> 4. Memory Hotplug
> 
> After the kernel passes all "TDX-usable" memory regions to the TDX
> module, the set of "TDX-usable" memory regions are fixed during module's
> runtime.  No more "TDX-usable" memory can be added to the TDX module
> after that.
> 
> To achieve above "to guarantee all pages in the page allocator are TDX
> pages", this series simply choose to reject any non-TDX-usable memory in
> memory hotplug.
> 
> This _will_ be enhanced in the future after first submission.

What's the primary reason to enhance that? Are there reasonable use 
cases? Why would be expect to have other (!TDX capable) memory in the 
system?

> 
> A better solution, suggested by Kirill, is similar to the per-node memory
> encryption flag in this series [4].  We can allow adding/onlining non-TDX
> memory to separate NUMA nodes so that both "TDX-capable" nodes and
> "TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
> userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
> nodes via NUMA ABIs.
> 
> 5. Physical Memory Hotplug
> 
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros
  2023-03-06 14:13 ` [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
@ 2023-03-16 12:37   ` David Hildenbrand
  2023-03-16 22:41     ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2023-03-16 12:37 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo

On 06.03.23 15:13, Kai Huang wrote:
> TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
> defined by the TDX module spec and used as TDX module ABI.  Currently,
> they are used in try_accept_one() when the TDX guest tries to accept a
> page.  However currently try_accept_one() uses hard-coded magic values.
> 
> Define TDX supported page sizes as macros and get rid of the hard-coded
> values in try_accept_one().  TDX host support will need to use them too.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-03-06 14:13 ` [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2023-03-16 12:48   ` David Hildenbrand
  2023-03-16 22:37     ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2023-03-16 12:48 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo

On 06.03.23 15:13, Kai Huang wrote:
> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  A CPU-attested software module
> called 'the TDX module' runs inside a new isolated memory range as a
> trusted hypervisor to manage and run protected VMs.
> 
> Pre-TDX Intel hardware has support for a memory encryption architecture
> called MKTME.  The memory encryption hardware underpinning MKTME is also
> used for Intel TDX.  TDX ends up "stealing" some of the physical address
> space from the MKTME architecture for crypto-protection to VMs.  The
> BIOS is responsible for partitioning the "KeyID" space between legacy
> MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
> KeyIDs' or 'TDX KeyIDs' for short.
> 
> TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
> private KeyIDs are consistently and correctly programmed by the BIOS
> across all CPU packages before it enables TDX on any CPU core.  A valid
> TDX private KeyID range on BSP indicates TDX has been enabled by the
> BIOS, otherwise the BIOS is buggy.

So we don't trust the BIOS, but trust the BIOS that it won't hot-remove 
physical memory or hotplug physical CPUS (if I understood the cover 
letter correctly)? :)

> 
> The TDX module is expected to be loaded by the BIOS when it enables TDX,
> but the kernel needs to properly initialize it before it can be used to
> create and run any TDX guests.  The TDX module will be initialized by
> the KVM subsystem when KVM wants to use TDX.
> 
> Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
> private KeyIDs.  Also add a function to report whether TDX is enabled by
> the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
> cache flush is needed.
> 
> The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
> to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
> own protection.  Just use the first TDX KeyID as the global KeyID and
> leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
> disable TDX as initializing the TDX module alone is useless.

Does that really happen in practice that we care about that at all? 
Seems weird and rather like a broken firmware or sth like that ...

> 
> To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
> TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
> to opt-in TDX host kernel support (to distinguish with TDX guest kernel
> support).  So far only KVM uses TDX.  Make the new config option depend
> on KVM_INTEL.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>


[...]

> ---
>   arch/x86/Kconfig                 |  12 ++++
>   arch/x86/Makefile                |   2 +
>   arch/x86/include/asm/msr-index.h |   3 +
>   arch/x86/include/asm/tdx.h       |   7 +++
>   arch/x86/virt/Makefile           |   2 +
>   arch/x86/virt/vmx/Makefile       |   2 +
>   arch/x86/virt/vmx/tdx/Makefile   |   2 +
>   arch/x86/virt/vmx/tdx/tdx.c      | 105 +++++++++++++++++++++++++++++++
>   8 files changed, 135 insertions(+)
>   create mode 100644 arch/x86/virt/Makefile
>   create mode 100644 arch/x86/virt/vmx/Makefile
>   create mode 100644 arch/x86/virt/vmx/tdx/Makefile
>   create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3604074a878b..fc010973a6ff 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1952,6 +1952,18 @@ config X86_SGX
>   
>   	  If unsure, say N.
>   
> +config INTEL_TDX_HOST
> +	bool "Intel Trust Domain Extensions (TDX) host support"
> +	depends on CPU_SUP_INTEL
> +	depends on X86_64
> +	depends on KVM_INTEL
> +	help
> +	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> +	  host and certain physical attacks.  This option enables necessary TDX
> +	  support in host kernel to run protected VMs.

s/in host/in the host/ ?

Also, is "protected VMs" the right term to use here? "Encrypted VMs", 
"Confidential VMs" ... ?

> +
> +	  If unsure, say N.
> +
>   config EFI
>   	bool "EFI runtime service support"
>   	depends on ACPI
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 9cf07322875a..972b5a64ce38 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -252,6 +252,8 @@ archheaders:
>   
>   libs-y  += arch/x86/lib/
>   
> +core-y += arch/x86/virt/
> +
>   # drivers-y are linked after core-y
>   drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
>   drivers-$(CONFIG_PCI)            += arch/x86/pci/
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 37ff47552bcb..952374ddb167 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -512,6 +512,9 @@
>   #define MSR_RELOAD_PMC0			0x000014c1
>   #define MSR_RELOAD_FIXED_CTR0		0x00001309
>   
> +/* KeyID partitioning between MKTME and TDX */
> +#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
> +
>   /*
>    * AMD64 MSRs. Not complete. See the architecture manual for a more
>    * complete list.
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 25fd6070dc0b..4dfe2e794411 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>   	return -ENODEV;
>   }
>   #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +bool platform_tdx_enabled(void);
> +#else	/* !CONFIG_INTEL_TDX_HOST */
> +static inline bool platform_tdx_enabled(void) { return false; }
> +#endif	/* CONFIG_INTEL_TDX_HOST */
> +
>   #endif /* !__ASSEMBLY__ */
>   #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
> new file mode 100644
> index 000000000000..1e36502cd738
> --- /dev/null
> +++ b/arch/x86/virt/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y	+= vmx/
> diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
> new file mode 100644
> index 000000000000..feebda21d793
> --- /dev/null
> +++ b/arch/x86/virt/vmx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
> diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
> new file mode 100644
> index 000000000000..93ca8b73e1f1
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-y += tdx.o
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> new file mode 100644
> index 000000000000..a600b5d0879d
> --- /dev/null
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright(c) 2023 Intel Corporation.
> + *
> + * Intel Trusted Domain Extensions (TDX) support
> + */
> +
> +#define pr_fmt(fmt)	"tdx: " fmt
> +
> +#include <linux/types.h>
> +#include <linux/cache.h>
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/printk.h>
> +#include <asm/msr-index.h>
> +#include <asm/msr.h>
> +#include <asm/tdx.h>
> +
> +static u32 tdx_global_keyid __ro_after_init;
> +static u32 tdx_guest_keyid_start __ro_after_init;
> +static u32 tdx_nr_guest_keyids __ro_after_init;
> +
> +/*
> + * Use tdx_global_keyid to indicate that TDX is uninitialized.
> + * This is used in TDX initialization error paths to take it from
> + * initialized -> uninitialized.
> + */
> +static void __init clear_tdx(void)
> +{
> +	tdx_global_keyid = 0;
> +}

Why not set "tdx_global_keyid" last, such that you don't have to clear 
when anything goes wrong before that? Seems more straight forward.

> +
> +static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> +					    u32 *nr_tdx_keyids)
> +{
> +	u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
> +	int ret;
> +
> +	/*
> +	 * IA32_MKTME_KEYID_PARTIONING:
> +	 *   Bit [31:0]:	Number of MKTME KeyIDs.
> +	 *   Bit [63:32]:	Number of TDX private KeyIDs.
> +	 */
> +	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
> +			&_nr_tdx_keyids);
> +	if (ret)
> +		return -ENODEV;
> +
> +	if (!_nr_tdx_keyids)
> +		return -ENODEV;
> +
> +	/* TDX KeyIDs start after the last MKTME KeyID. */
> +	_tdx_keyid_start = _nr_mktme_keyids + 1;
> +
> +	*tdx_keyid_start = _tdx_keyid_start;
> +	*nr_tdx_keyids = _nr_tdx_keyids;
> +
> +	return 0;
> +}

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  2023-03-06 14:13 ` [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
@ 2023-03-16 12:57   ` David Hildenbrand
  0 siblings, 0 replies; 48+ messages in thread
From: David Hildenbrand @ 2023-03-16 12:57 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, dave.hansen, peterz, tglx, seanjc, pbonzini,
	dan.j.williams, rafael.j.wysocki, kirill.shutemov, ying.huang,
	reinette.chatre, len.brown, tony.luck, ak, isaku.yamahata,
	chao.gao, sathyanarayanan.kuppuswamy, bagasdotme, sagis,
	imammedo

On 06.03.23 15:13, Kai Huang wrote:
> TDX capable platforms are locked to X2APIC mode and cannot fall back to
> the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
> requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.
> 
> Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 00/16] TDX host kernel support
  2023-03-16 12:35 ` David Hildenbrand
@ 2023-03-16 22:06   ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-16 22:06 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-16 at 13:35 +0100, David Hildenbrand wrote:
> On 06.03.23 15:13, Kai Huang wrote:
> > Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks.  TDX specs are available in [1].
> 
> I'm afraid there is no [1], probably got lost while resending :)
> 
> > 
> > This series is the initial support to enable TDX with minimal code to
> > allow KVM to create and run TDX guests.  KVM support for TDX is being
> > developed separately[2].  A new "userspace inaccessible memfd" approach
> > to support TDX private memory is also being developed[3].  The KVM will
> > only support the new "userspace inaccessible memfd" as TDX guest memory.
> 
> Same with [2].

Hi David,

Thanks for your feedback!

Oh sorry, yes indeed they were stripped unintentionally when I was updating the
cover letter.  I added here for your reference:

[1]: TDX specs
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

2]: KVM TDX support
https://lore.kernel.org/lkml/cover.1678643051.git.isaku.yamahata@intel.com/

[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/

> 
> > 
> > This series doesn't aim to support all functionalities, and doesn't aim
> > to resolve all things perfectly.  For example, memory hotplug is handled
> > in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> > hotplug" sections below).
> > 
> > (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> > linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> > help to provide comments.)
> > 
> > And TDX module metadata allocation just uses alloc_contig_pages() to
> > allocate large chunk at runtime, thus it can fail.  It is imperfect now
> > but _will_ be improved in the future.
> 
> Good enough for now I guess. Reserving it via memblock might be better, 
> though.
> 
> > 
> > Also, the patch to add the new kernel comline tdx="force" isn't included
> > in this initial version, as Dave suggested it isn't mandatory.  But I
> > _will_ add one once this initial version gets merged.
> 
> What would be the main purpose of that option?

Initializing the TDX module needs to consume non-trivial memory that is given to
the TDX module as metadata to track page status, etc.  Currently, KVM
maintainers want to initialize TDX during KVM module loading time.  This
basically means TDX will get enabled by default even people don't want to use
it.  So Peter wanted to add a kernel boot parameter to disable TDX for all.

> 
> > 
> > All other optimizations will be posted as follow-up once this initial
> > TDX support is upstreamed.
> > 
> 
> 
> [...]
> 
> > == Background ==
> > 
> > TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
> > and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
> > A CPU-attested software module called 'the TDX module' runs in the new
> > isolated region as a trusted hypervisor to create/run protected VMs.
> > 
> > TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
> > provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
> > as TDX private KeyIDs, which are only accessible within the SEAM mode.
> > 
> > TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
> > secure processor to provide crypto-protection.  The firmware runs on the
> > secure processor acts a similar role as the TDX module.
> > 
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction.  This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead.
> > 
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized.  This series assumes the TDX module is loaded
> > by BIOS before the kernel boots.
> > 
> > How to initialize the TDX module is described at TDX module 1.0
> > specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
> > Initialization and Shutdown".
> > 
> > == Design Considerations ==
> > 
> > 1. Initialize the TDX module at runtime
> > 
> > There are basically two ways the TDX module could be initialized: either
> > in early boot, or at runtime before the first TDX guest is run.  This
> > series implements the runtime initialization.
> > 
> > This series adds a function tdx_enable() to allow the caller to initialize
> > TDX at runtime:
> > 
> >          if (tdx_enable())
> >                  goto no_tdx;
> > 	// TDX is ready to create TD guests.
> > 
> > This approach has below pros:
> > 
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata.  Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> 
> Let's be clear: nobody is going to run encrypted VMs "out of the blue".
> 
> You can expect a certain hypervisor setup to be required, for example, 
> enabling it on the cmdline and then allocating that metadata from 
> memblock during boot.

Yes KVM will also have a parameter to specifically enable TDX.

> 
> IIRC s390x handles it similarly with protected VMs and required metadata.
> 
> > 
> > 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
> > done).  So far, KVM is the only user of TDX, and it already handles VMXON.
> > Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
> > 
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series).  After updating to the new module at runtime, kernel needs
> > to go through the initialization process again.
> > 
> > 2. CPU hotplug
> > 
> > TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
> > must be done on one cpu before any other SEAMCALLs can be made on that
> > cpu, including those involved during the module initialization.
> > 
> > The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
> > the user wants to use a new cpu for TDX task.
> > 
> > TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
> > never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> > event to the kernel.  This series doesn't handle physical (ACPI) CPU
> > hotplug at all but depends on the BIOS to behave correctly.
> > 
> > Note TDX works with CPU logical online/offline, thus this series still
> > allows to do logical CPU online/offline.
> > 
> > 3. Kernel policy on TDX memory
> > 
> > The TDX module reports a list of "Convertible Memory Region" (CMR) to
> > indicate which memory regions are TDX-capable.  The TDX architecture
> > allows the VMM to designate specific convertible memory regions as usable
> > for TDX private memory.
> > 
> > The initial support of TDX guests will only allocate TDX private memory
> > from the global page allocator.  This series chooses to designate _all_
> > system RAM in the core-mm at the time of initializing TDX module as TDX
> > memory to guarantee all pages in the page allocator are TDX pages.
> > 
> > 4. Memory Hotplug
> > 
> > After the kernel passes all "TDX-usable" memory regions to the TDX
> > module, the set of "TDX-usable" memory regions are fixed during module's
> > runtime.  No more "TDX-usable" memory can be added to the TDX module
> > after that.
> > 
> > To achieve above "to guarantee all pages in the page allocator are TDX
> > pages", this series simply choose to reject any non-TDX-usable memory in
> > memory hotplug.
> > 
> > This _will_ be enhanced in the future after first submission.
> 
> What's the primary reason to enhance that? Are there reasonable use 
> cases? Why would be expect to have other (!TDX capable) memory in the 
> system?

Basically Kirill preferred this.  Please see below paragraph in my original
cover letter.

But there has been no consensus on whether we should do it especially with
community.  I probably should not use the word _will_ here (also kinda forgot to
keep this section up to date).

I think I'll just remove this and below paragraph entirely, or I will adjust the
words to say it perhaps is an enhancement we can do in the future.

> 
> > 
> > A better solution, suggested by Kirill, is similar to the per-node memory
> > encryption flag in this series [4].  We can allow adding/onlining non-TDX
> > memory to separate NUMA nodes so that both "TDX-capable" nodes and
> > "TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
> > userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
> > nodes via NUMA ABIs.

Also [4] was stripped:

[4]: per-node memory encryption flag
https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/

> > 
> > 5. Physical Memory Hotplug
> > 
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime.  A non-buggy BIOS should never support hot-removal of
> > any convertible memory.  This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-15 11:10             ` Huang, Kai
@ 2023-03-16 22:07               ` Huang, Kai
  2023-03-23 13:49               ` Dave Hansen
  1 sibling, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-16 22:07 UTC (permalink / raw)
  To: Hansen, Dave, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, linux-mm, kirill.shutemov, Yamahata, Isaku,
	Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Wed, 2023-03-15 at 11:10 +0000, Huang, Kai wrote:
> On Tue, 2023-03-14 at 08:48 -0700, Dave Hansen wrote:
> > On 3/13/23 18:50, Huang, Kai wrote:
> > > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > > > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > > 
> > > > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > > > +
> > > > > > > +static int try_init_module_global(void)
> > > > > > > +{
> > > > > > > +       int ret;
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * The TDX module global initialization only needs to be done
> > > > > > > +        * once on any cpu.
> > > > > > > +        */
> > > > > > > +       spin_lock(&tdx_global_init_lock);
> > > > > > > +
> > > > > > > +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > > > +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > > > +                       -EINVAL : 0;
> > > > > > > +               goto out;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       /* All '0's are just unused parameters. */
> > > > > > > +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > > > +
> > > > > > > +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > > > +       if (ret)
> > > > > > > +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > > > 
> > > > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > > > In such case, we should allow the caller to retry or make this function retry
> > > > > > instead of marking error stickily.
> > > > > 
> > > > > The spec says:
> > > > > 
> > > > > TDX_SYS_BUSY        The operation was invoked when another TDX module
> > > > >             operation was in progress. The operation may be retried.
> > > > > 
> > > > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > > > were mixing up with KEY.CONFIG?
> > > > 
> > > > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > > > strong stack protector enabled by clang and canary value needs to be
> > > > initialized.  By default, the canary value is stored at
> > > > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > > > 
> > > > Although this is a job for libc or language runtime, TDX modules has to do it
> > > > itself because it's stand alone.
> > > > 
> > > > From tdh_sys_init.c
> > > > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > > > {
> > > >     ia32_rflags_t rflags = {.raw = 0};
> > > >     uint64_t canary;
> > > >     if (!ia32_rdrand(&rflags, &canary))
> > > >     {
> > > >         return TDX_SYS_BUSY;
> > > >     }
> > > > ...
> > > >     last_page_ptr->stack_canary.canary = canary;
> > > > 
> > > > 
> > > 
> > > Then it is a hidden behaviour of the TDX module that is not reflected in the
> > > spec.
> > 
> > This is true.  Could you please go ask the TDX module folks to fix this up?
> 
> Sure will do.
> 
> To make sure, you mean we should ask TDX module guys to add the new
> TDX_RND_NO_ENTROPY error code to TDX module 1.0?
> 
> "another TDX module operation was in progress" and "running out of entropy" are
> different thing and should not be mixed together IMHO.
> 
> > 
> > > I am not sure whether we should handle because:
> > > 
> > > 1) This is an extremely rare case.  Kernel would be basically under attack if
> > > such error happened.  In the current series we don't handle such case in
> > > KEY.CONFIG either but just leave a comment (see patch 13).
> > 
> > Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
> > at the SDM:
> > 
> > > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> > > of random numbers by software processes/threads to exceed the rate at which the random number generator
> > > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> > > instruction indicates the occurrence of this rare situation by clearing the CF flag.
> > 
> > That doesn't talk about attacks.
> 
> Thanks for citing the documentation.  I checked the kernel code before and it
> seems currently there's no code to call RDRAND very frequently.  But yes we
> should not say "under attack".  I have some old memory that someone said so
> (maybe me?).
> 
> > 
> > > 2) Not sure whether this will be changed in the future.
> > > 
> > > So I think we should keep as is.
> > 
> > TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
> > RDRAND failures.  
> > 
> 
> OK.  Agreed.  Then I think the TDH.SYS.KEY.CONFIG should retry when running out
> of entropy too.
> 
> > But, if you have VMM locking and don't expect two
> > users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
> > is a bad (and probably fatal) signal.
> 
> Yes we have a lock to protect TDH.SYS.INIT from being called in parallel.  W/o
> this entropy thing TDX_SYS_BUSY should never happen.
> 
> > 
> > I suspect we should just throw a few retries in the seamcall()
> > infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
> > RDRAND failures.  If a retry loop fails to resolve it, then we should
> > probably dump a warning and return an error.
> > 
> > Just do this once, in common code.
> 
> I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> value)?
> 
> Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common 
> seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> SEAMCALL returns a different error code:
> 
> TDX_KEY_GENERATION_FAILED	Failed to generate a random key. This is 
> 				typically caused by an entropy error of the
> 				CPU's random number generator, and may
> 				be impacted by RDSEED, RDRAND or PCONFIG
> 				executing on other LPs. The operation should be
> 				retried.
> 

Hi Dave,

Sorry to ping.  Could you help to check whether my understanding is aligned with
what you suggested?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-03-16 12:48   ` David Hildenbrand
@ 2023-03-16 22:37     ` Huang, Kai
  2023-03-23 17:02       ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-03-16 22:37 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-16 at 13:48 +0100, David Hildenbrand wrote:
> On 06.03.23 15:13, Kai Huang wrote:
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks.  A CPU-attested software module
> > called 'the TDX module' runs inside a new isolated memory range as a
> > trusted hypervisor to manage and run protected VMs.
> > 
> > Pre-TDX Intel hardware has support for a memory encryption architecture
> > called MKTME.  The memory encryption hardware underpinning MKTME is also
> > used for Intel TDX.  TDX ends up "stealing" some of the physical address
> > space from the MKTME architecture for crypto-protection to VMs.  The
> > BIOS is responsible for partitioning the "KeyID" space between legacy
> > MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
> > KeyIDs' or 'TDX KeyIDs' for short.
> > 
> > TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
> > private KeyIDs are consistently and correctly programmed by the BIOS
> > across all CPU packages before it enables TDX on any CPU core.  A valid
> > TDX private KeyID range on BSP indicates TDX has been enabled by the
> > BIOS, otherwise the BIOS is buggy.
> 
> So we don't trust the BIOS, but trust the BIOS that it won't hot-remove 
> physical memory or hotplug physical CPUS (if I understood the cover 
> letter correctly)? :)

The "trust" in this context means security, but not functionality.  BIOS needs
to do the right thing in order to make things work correctly in terms of
functionality.  

For physical memory hotplug or CPU hotplug, we don't have patch to _explicitly_
distinguish them (from logical memory hotplug and logical cpu online/offline),
but actually they are kinda also handled:  For memory hotplug, and hot-added
memory is rejected to go online (because they cannot be in TDX's convertible
memory ranges).  For CPU hotplug, we have a function to do per-cpu
initialization (tdx_cpu_enable() in patch 5), and it will return error for hot-
added physical cpu.

> 
> > 
> > The TDX module is expected to be loaded by the BIOS when it enables TDX,
> > but the kernel needs to properly initialize it before it can be used to
> > create and run any TDX guests.  The TDX module will be initialized by
> > the KVM subsystem when KVM wants to use TDX.
> > 
> > Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
> > private KeyIDs.  Also add a function to report whether TDX is enabled by
> > the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
> > cache flush is needed.
> > 
> > The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
> > to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
> > own protection.  Just use the first TDX KeyID as the global KeyID and
> > leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
> > disable TDX as initializing the TDX module alone is useless.
> 
> Does that really happen in practice that we care about that at all? 
> Seems weird and rather like a broken firmware or sth like that ...

No it doesn't happen in practice, because the BIOS is sane enough.

But since the public spec doesn't explicitly say it is guaranteed this doesn't
happen when TDX is enabled, I just added this sanity check.

> 
> > 
> > To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
> > TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
> > to opt-in TDX host kernel support (to distinguish with TDX guest kernel
> > support).  So far only KVM uses TDX.  Make the new config option depend
> > on KVM_INTEL.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> 
> [...]
> 
> > ---
> >   arch/x86/Kconfig                 |  12 ++++
> >   arch/x86/Makefile                |   2 +
> >   arch/x86/include/asm/msr-index.h |   3 +
> >   arch/x86/include/asm/tdx.h       |   7 +++
> >   arch/x86/virt/Makefile           |   2 +
> >   arch/x86/virt/vmx/Makefile       |   2 +
> >   arch/x86/virt/vmx/tdx/Makefile   |   2 +
> >   arch/x86/virt/vmx/tdx/tdx.c      | 105 +++++++++++++++++++++++++++++++
> >   8 files changed, 135 insertions(+)
> >   create mode 100644 arch/x86/virt/Makefile
> >   create mode 100644 arch/x86/virt/vmx/Makefile
> >   create mode 100644 arch/x86/virt/vmx/tdx/Makefile
> >   create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3604074a878b..fc010973a6ff 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1952,6 +1952,18 @@ config X86_SGX
> >   
> >   	  If unsure, say N.
> >   
> > +config INTEL_TDX_HOST
> > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > +	depends on CPU_SUP_INTEL
> > +	depends on X86_64
> > +	depends on KVM_INTEL
> > +	help
> > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > +	  host and certain physical attacks.  This option enables necessary TDX
> > +	  support in host kernel to run protected VMs.
> 
> s/in host/in the host/ ?

Sure.

> 
> Also, is "protected VMs" the right term to use here? "Encrypted VMs", 
> "Confidential VMs" ... ?

"Encrypted VM" perhaps is not a good choice, because there are more things than
encryption.  I am also OK with "Confidential VMs", but "protected VMs" is also
used in the KVM series (not upstreamed yet), and also used by s390 by looking at
the git log.

So both "protected VM" and "confidential VM" work for me.

Not sure anyone else wants to comment?

> 
[...]

> > +static u32 tdx_global_keyid __ro_after_init;
> > +static u32 tdx_guest_keyid_start __ro_after_init;
> > +static u32 tdx_nr_guest_keyids __ro_after_init;
> > +
> > +/*
> > + * Use tdx_global_keyid to indicate that TDX is uninitialized.
> > + * This is used in TDX initialization error paths to take it from
> > + * initialized -> uninitialized.
> > + */
> > +static void __init clear_tdx(void)
> > +{
> > +	tdx_global_keyid = 0;
> > +}
> 
> Why not set "tdx_global_keyid" last, such that you don't have to clear 
> when anything goes wrong before that? Seems more straight forward.

My thinking was by reserving the global keyid and taking it out first, I can
check the remaining keyids for TDX guests easily:


+	if (!nr_tdx_keyids) {
+		pr_info("initialization failed: too few private KeyIDs
available.\n");
+		goto no_tdx;
+	}

Otherwise need to do:

	if (nr_tdx_keyids < 2) {
		...
	}

Also, in the later patch to handle memory hotplug we will add an additional step
to register_memory_notifier() which can also fail, so I just introduced
clear_tdx() here. 

But nothing is big deal, and yes we can set the global keyid at last and remove
clear_tdx().

I'll do what you suggested.

Thanks.

> 
> > +
> > +static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
> > +					    u32 *nr_tdx_keyids)
> > +{
> > +	u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
> > +	int ret;
> > +
> > +	/*
> > +	 * IA32_MKTME_KEYID_PARTIONING:
> > +	 *   Bit [31:0]:	Number of MKTME KeyIDs.
> > +	 *   Bit [63:32]:	Number of TDX private KeyIDs.
> > +	 */
> > +	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
> > +			&_nr_tdx_keyids);
> > +	if (ret)
> > +		return -ENODEV;
> > +
> > +	if (!_nr_tdx_keyids)
> > +		return -ENODEV;
> > +
> > +	/* TDX KeyIDs start after the last MKTME KeyID. */
> > +	_tdx_keyid_start = _nr_mktme_keyids + 1;
> > +
> > +	*tdx_keyid_start = _tdx_keyid_start;
> > +	*nr_tdx_keyids = _nr_tdx_keyids;
> > +
> > +	return 0;
> > +}
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros
  2023-03-16 12:37   ` David Hildenbrand
@ 2023-03-16 22:41     ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-16 22:41 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-16 at 13:37 +0100, David Hildenbrand wrote:
> On 06.03.23 15:13, Kai Huang wrote:
> > TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
> > defined by the TDX module spec and used as TDX module ABI.  Currently,
> > they are used in try_accept_one() when the TDX guest tries to accept a
> > page.  However currently try_accept_one() uses hard-coded magic values.
> > 
> > Define TDX supported page sizes as macros and get rid of the hard-coded
> > values in try_accept_one().  TDX host support will need to use them too.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> 

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-03-06 14:13 ` [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2023-03-21  7:44   ` Dong, Eddie
  2023-03-21  8:05     ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Dong, Eddie @ 2023-03-21  7:44 UTC (permalink / raw)
  To: Huang, Kai, linux-kernel, kvm
  Cc: linux-mm, Hansen, Dave, peterz, tglx, Christopherson,,
	Sean, pbonzini, Williams, Dan J, Wysocki, Rafael J,
	kirill.shutemov, Huang, Ying, Chatre, Reinette, Brown, Len, Luck,
	Tony, ak, Yamahata, Isaku, Gao, Chao, sathyanarayanan.kuppuswamy,
	david, bagasdotme, Shahar, Sagi, imammedo, Huang, Kai

> 
> +/*
> + * Calculate PAMT size given a TDMR and a page size.  The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> +				      u16 pamt_entry_size)
> +{
> +	unsigned long pamt_sz, nr_pamt_entries;
> +
> +	switch (pgsz) {
> +	case TDX_PS_4K:
> +		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> +		break;
> +	case TDX_PS_2M:
> +		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> +		break;
> +	case TDX_PS_1G:
> +		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}
> +
> +	pamt_sz = nr_pamt_entries * pamt_entry_size;
> +	/* TDX requires PAMT size must be 4K aligned */
> +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);

Should we ALIGN_UP for safe ?

> +
> +	return pamt_sz;
> +}
> +

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-03-21  7:44   ` Dong, Eddie
@ 2023-03-21  8:05     ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-21  8:05 UTC (permalink / raw)
  To: kvm, Dong, Eddie, linux-kernel
  Cc: Hansen, Dave, Luck, Tony, david, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	bagasdotme, linux-mm, peterz, Shahar, Sagi, imammedo, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J

On Tue, 2023-03-21 at 07:44 +0000, Dong, Eddie wrote:
> > 
> > +/*
> > + * Calculate PAMT size given a TDMR and a page size.  The returned
> > + * PAMT size is always aligned up to 4K page boundary.
> > + */
> > +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> > +				      u16 pamt_entry_size)
> > +{
> > +	unsigned long pamt_sz, nr_pamt_entries;
> > +
> > +	switch (pgsz) {
> > +	case TDX_PS_4K:
> > +		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> > +		break;
> > +	case TDX_PS_2M:
> > +		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> > +		break;
> > +	case TDX_PS_1G:
> > +		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		return 0;
> > +	}
> > +
> > +	pamt_sz = nr_pamt_entries * pamt_entry_size;
> > +	/* TDX requires PAMT size must be 4K aligned */
> > +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> 
> Should we ALIGN_UP for safe ?

Hi Eddie,

ALIGN() already does align up.

> 
> > +
> > +	return pamt_sz;
> > +}
> > +


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-15 11:10             ` Huang, Kai
  2023-03-16 22:07               ` Huang, Kai
@ 2023-03-23 13:49               ` Dave Hansen
  2023-03-23 22:09                 ` Huang, Kai
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Hansen @ 2023-03-23 13:49 UTC (permalink / raw)
  To: Huang, Kai, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, Yamahata, Isaku, kirill.shutemov, linux-mm,
	peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 3/15/23 04:10, Huang, Kai wrote:
> I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> value)?

I'll put it this way:

	Linux is going to treat TDX_SYS_BUSY like a Linux bug and assume
	Linux is doing something wrong.  It'll mostly mean that
	users will see something nasty and may even cause Linux to give
	up on TDX.  In other words, the TDX module shouldn't use
	TDX_SYS_BUSY for things that aren't Linux's fault.

> Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common
> seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> SEAMCALL returns a different error code:
> 
> TDX_KEY_GENERATION_FAILED       Failed to generate a random key. This is
>                                 typically caused by an entropy error of the
>                                 CPU's random number generator, and may
>                                 be impacted by RDSEED, RDRAND or PCONFIG
>                                 executing on other LPs. The operation should be
>                                 retried.

Sounds like we should just replace TDX_KEY_GENERATION_FAILED with
TDX_RND_NO_ENTROPY in cases where key generation fails because of a lack
of entropy.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-03-16 22:37     ` Huang, Kai
@ 2023-03-23 17:02       ` David Hildenbrand
  2023-03-23 22:15         ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2023-03-23 17:02 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 16.03.23 23:37, Huang, Kai wrote:
> On Thu, 2023-03-16 at 13:48 +0100, David Hildenbrand wrote:
>> On 06.03.23 15:13, Kai Huang wrote:
>>> Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>>> host and certain physical attacks.  A CPU-attested software module
>>> called 'the TDX module' runs inside a new isolated memory range as a
>>> trusted hypervisor to manage and run protected VMs.
>>>
>>> Pre-TDX Intel hardware has support for a memory encryption architecture
>>> called MKTME.  The memory encryption hardware underpinning MKTME is also
>>> used for Intel TDX.  TDX ends up "stealing" some of the physical address
>>> space from the MKTME architecture for crypto-protection to VMs.  The
>>> BIOS is responsible for partitioning the "KeyID" space between legacy
>>> MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
>>> KeyIDs' or 'TDX KeyIDs' for short.
>>>
>>> TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
>>> private KeyIDs are consistently and correctly programmed by the BIOS
>>> across all CPU packages before it enables TDX on any CPU core.  A valid
>>> TDX private KeyID range on BSP indicates TDX has been enabled by the
>>> BIOS, otherwise the BIOS is buggy.
>>

Sorry for the late reply!

>> So we don't trust the BIOS, but trust the BIOS that it won't hot-remove
>> physical memory or hotplug physical CPUS (if I understood the cover
>> letter correctly)? :)
> 
> The "trust" in this context means security, but not functionality.  BIOS needs
> to do the right thing in order to make things work correctly in terms of
> functionality.
> 
> For physical memory hotplug or CPU hotplug, we don't have patch to _explicitly_
> distinguish them (from logical memory hotplug and logical cpu online/offline),
> but actually they are kinda also handled:  For memory hotplug, and hot-added
> memory is rejected to go online (because they cannot be in TDX's convertible
> memory ranges).  For CPU hotplug, we have a function to do per-cpu
> initialization (tdx_cpu_enable() in patch 5), and it will return error for hot-
> added physical cpu.

Make sense, thanks!

> 
>>
>>>
>>> The TDX module is expected to be loaded by the BIOS when it enables TDX,
>>> but the kernel needs to properly initialize it before it can be used to
>>> create and run any TDX guests.  The TDX module will be initialized by
>>> the KVM subsystem when KVM wants to use TDX.
>>>
>>> Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
>>> private KeyIDs.  Also add a function to report whether TDX is enabled by
>>> the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
>>> cache flush is needed.
>>>
>>> The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
>>> to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
>>> own protection.  Just use the first TDX KeyID as the global KeyID and
>>> leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
>>> disable TDX as initializing the TDX module alone is useless.
>>
>> Does that really happen in practice that we care about that at all?
>> Seems weird and rather like a broken firmware or sth like that ...
> 
> No it doesn't happen in practice, because the BIOS is sane enough.
> 
> But since the public spec doesn't explicitly say it is guaranteed this doesn't
> happen when TDX is enabled, I just added this sanity check.

Okay!

> 
>>
>>>
>>> To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
>>> TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
>>> to opt-in TDX host kernel support (to distinguish with TDX guest kernel
>>> support).  So far only KVM uses TDX.  Make the new config option depend
>>> on KVM_INTEL.
>>>
>>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>
>>
>> [...]
>>
>>> ---
>>>    arch/x86/Kconfig                 |  12 ++++
>>>    arch/x86/Makefile                |   2 +
>>>    arch/x86/include/asm/msr-index.h |   3 +
>>>    arch/x86/include/asm/tdx.h       |   7 +++
>>>    arch/x86/virt/Makefile           |   2 +
>>>    arch/x86/virt/vmx/Makefile       |   2 +
>>>    arch/x86/virt/vmx/tdx/Makefile   |   2 +
>>>    arch/x86/virt/vmx/tdx/tdx.c      | 105 +++++++++++++++++++++++++++++++
>>>    8 files changed, 135 insertions(+)
>>>    create mode 100644 arch/x86/virt/Makefile
>>>    create mode 100644 arch/x86/virt/vmx/Makefile
>>>    create mode 100644 arch/x86/virt/vmx/tdx/Makefile
>>>    create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 3604074a878b..fc010973a6ff 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -1952,6 +1952,18 @@ config X86_SGX
>>>    
>>>    	  If unsure, say N.
>>>    
>>> +config INTEL_TDX_HOST
>>> +	bool "Intel Trust Domain Extensions (TDX) host support"
>>> +	depends on CPU_SUP_INTEL
>>> +	depends on X86_64
>>> +	depends on KVM_INTEL
>>> +	help
>>> +	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>>> +	  host and certain physical attacks.  This option enables necessary TDX
>>> +	  support in host kernel to run protected VMs.
>>
>> s/in host/in the host/ ?
> 
> Sure.
> 
>>
>> Also, is "protected VMs" the right term to use here? "Encrypted VMs",
>> "Confidential VMs" ... ?
> 
> "Encrypted VM" perhaps is not a good choice, because there are more things than
> encryption.  I am also OK with "Confidential VMs", but "protected VMs" is also
> used in the KVM series (not upstreamed yet), and also used by s390 by looking at
> the git log.
> 
> So both "protected VM" and "confidential VM" work for me.
> 
> Not sure anyone else wants to comment?

I'm fine as long as it's used consistently. "Protected VM" would have 
been the one out of the 3 alternatives that I have heard least frequently.

> 
>>
> [...]
> 
>>> +static u32 tdx_global_keyid __ro_after_init;
>>> +static u32 tdx_guest_keyid_start __ro_after_init;
>>> +static u32 tdx_nr_guest_keyids __ro_after_init;
>>> +
>>> +/*
>>> + * Use tdx_global_keyid to indicate that TDX is uninitialized.
>>> + * This is used in TDX initialization error paths to take it from
>>> + * initialized -> uninitialized.
>>> + */
>>> +static void __init clear_tdx(void)
>>> +{
>>> +	tdx_global_keyid = 0;
>>> +}
>>
>> Why not set "tdx_global_keyid" last, such that you don't have to clear
>> when anything goes wrong before that? Seems more straight forward.
> 
> My thinking was by reserving the global keyid and taking it out first, I can
> check the remaining keyids for TDX guests easily:
> 
> 
> +	if (!nr_tdx_keyids) {
> +		pr_info("initialization failed: too few private KeyIDs
> available.\n");
> +		goto no_tdx;
> +	}
> 
> Otherwise need to do:
> 
> 	if (nr_tdx_keyids < 2) {
> 		...
> 	}
> 
> Also, in the later patch to handle memory hotplug we will add an additional step
> to register_memory_notifier() which can also fail, so I just introduced
> clear_tdx() here.
> 
> But nothing is big deal, and yes we can set the global keyid at last and remove
> clear_tdx().

Good, that simplifies things, thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-23 13:49               ` Dave Hansen
@ 2023-03-23 22:09                 ` Huang, Kai
  2023-03-23 22:12                   ` Dave Hansen
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-03-23 22:09 UTC (permalink / raw)
  To: Hansen, Dave, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, linux-mm, kirill.shutemov, Yamahata, Isaku,
	Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-23 at 06:49 -0700, Hansen, Dave wrote:
> On 3/15/23 04:10, Huang, Kai wrote:
> > I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> > TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> > value)?
> 
> I'll put it this way:
> 
> 	Linux is going to treat TDX_SYS_BUSY like a Linux bug and assume
> 	Linux is doing something wrong.  It'll mostly mean that
> 	users will see something nasty and may even cause Linux to give
> 	up on TDX.  In other words, the TDX module shouldn't use
> 	TDX_SYS_BUSY for things that aren't Linux's fault.
> 
> > Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common
> > seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> > SEAMCALL returns a different error code:
> > 
> > TDX_KEY_GENERATION_FAILED       Failed to generate a random key. This is
> >                                 typically caused by an entropy error of the
> >                                 CPU's random number generator, and may
> >                                 be impacted by RDSEED, RDRAND or PCONFIG
> >                                 executing on other LPs. The operation should be
> >                                 retried.
> 
> Sounds like we should just replace TDX_KEY_GENERATION_FAILED with
> TDX_RND_NO_ENTROPY in cases where key generation fails because of a lack
> of entropy.

Thanks for feedback.

I'll do following, please let me know for any comments in case I have any
misunderstanding.

1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
TDX_SYS_BUSY when running out of entropy. 

2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
(because it looks running concurrent PCONFIGs is also related).

3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
keep this behaviour for future TDX modules too.

4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.

In terms of how many times to retry, I will use a fixed value for now, similar
to the kernel code below:

#define RDRAND_RETRY_LOOPS      10                                             
                                                                                                                                                   
/* Unconditional execution of RDRAND and RDSEED */                             
                                                                                                                                                   
static inline bool __must_check rdrand_long(unsigned long *v)                  
{                                                                              
        bool ok;                                                               
        unsigned int retry = RDRAND_RETRY_LOOPS;                               
        do {                                                                   
                asm volatile("rdrand %[out]"                                   
                             CC_SET(c)                                         
                             : CC_OUT(c) (ok), [out] "=r" (*v));               
                if (ok)                                                        
                        return true;                                           
        } while (--retry);                                                     
        return false;                                                          
}   

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-23 22:09                 ` Huang, Kai
@ 2023-03-23 22:12                   ` Dave Hansen
  2023-03-23 22:42                     ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Hansen @ 2023-03-23 22:12 UTC (permalink / raw)
  To: Huang, Kai, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, linux-mm, kirill.shutemov, Yamahata, Isaku,
	Shahar, Sagi, peterz, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On 3/23/23 15:09, Huang, Kai wrote:
> 1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
> TDX_SYS_BUSY when running out of entropy.
> 
> 2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
> TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
> TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
> (because it looks running concurrent PCONFIGs is also related).
> 
> 3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
> keep this behaviour for future TDX modules too.

Yes, that's all fine.

> 4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.
> 
> In terms of how many times to retry, I will use a fixed value for now, similar
> to the kernel code below:
> 
> #define RDRAND_RETRY_LOOPS      10

Heck, you could even just use RDRAND_RETRY_LOOPS directly.  It's
hard(er) to bikeshed your choice of a random number that you didn't even
pick.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot
  2023-03-23 17:02       ` David Hildenbrand
@ 2023-03-23 22:15         ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-23 22:15 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, Yamahata, Isaku,
	linux-mm, Shahar, Sagi, imammedo, peterz, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-23 at 18:02 +0100, David Hildenbrand wrote:
> On 16.03.23 23:37, Huang, Kai wrote:
> > On Thu, 2023-03-16 at 13:48 +0100, David Hildenbrand wrote:
> > > On 06.03.23 15:13, Kai Huang wrote:
> > > > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > > host and certain physical attacks.  A CPU-attested software module
> > > > called 'the TDX module' runs inside a new isolated memory range as a
> > > > trusted hypervisor to manage and run protected VMs.
> > > > 
> > > > Pre-TDX Intel hardware has support for a memory encryption architecture
> > > > called MKTME.  The memory encryption hardware underpinning MKTME is also
> > > > used for Intel TDX.  TDX ends up "stealing" some of the physical address
> > > > space from the MKTME architecture for crypto-protection to VMs.  The
> > > > BIOS is responsible for partitioning the "KeyID" space between legacy
> > > > MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
> > > > KeyIDs' or 'TDX KeyIDs' for short.
> > > > 
> > > > TDX doesn't trust the BIOS.  During machine boot, TDX verifies the TDX
> > > > private KeyIDs are consistently and correctly programmed by the BIOS
> > > > across all CPU packages before it enables TDX on any CPU core.  A valid
> > > > TDX private KeyID range on BSP indicates TDX has been enabled by the
> > > > BIOS, otherwise the BIOS is buggy.
> > > 
> 
> Sorry for the late reply!

Not late for me :)  Thanks!

[...]


> > > >    
> > > > +config INTEL_TDX_HOST
> > > > +	bool "Intel Trust Domain Extensions (TDX) host support"
> > > > +	depends on CPU_SUP_INTEL
> > > > +	depends on X86_64
> > > > +	depends on KVM_INTEL
> > > > +	help
> > > > +	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > > +	  host and certain physical attacks.  This option enables necessary TDX
> > > > +	  support in host kernel to run protected VMs.
> > > 
> > > s/in host/in the host/ ?
> > 
> > Sure.
> > 
> > > 
> > > Also, is "protected VMs" the right term to use here? "Encrypted VMs",
> > > "Confidential VMs" ... ?
> > 
> > "Encrypted VM" perhaps is not a good choice, because there are more things than
> > encryption.  I am also OK with "Confidential VMs", but "protected VMs" is also
> > used in the KVM series (not upstreamed yet), and also used by s390 by looking at
> > the git log.
> > 
> > So both "protected VM" and "confidential VM" work for me.
> > 
> > Not sure anyone else wants to comment?
> 
> I'm fine as long as it's used consistently. "Protected VM" would have 
> been the one out of the 3 alternatives that I have heard least frequently.
> > 

Yes I'll make sure it is used consistently.  Thanks!

I am also glad to change to "Confidential VMs" if anyone else believes it is
better.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-03-23 22:12                   ` Dave Hansen
@ 2023-03-23 22:42                     ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-03-23 22:42 UTC (permalink / raw)
  To: Hansen, Dave, isaku.yamahata
  Cc: kvm, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, tglx, linux-mm, kirill.shutemov, Yamahata, Isaku,
	peterz, Shahar, Sagi, imammedo, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Thu, 2023-03-23 at 15:12 -0700, Hansen, Dave wrote:
> On 3/23/23 15:09, Huang, Kai wrote:
> > 1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
> > TDX_SYS_BUSY when running out of entropy.
> > 
> > 2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
> > TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
> > TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
> > (because it looks running concurrent PCONFIGs is also related).
> > 
> > 3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
> > keep this behaviour for future TDX modules too.
> 
> Yes, that's all fine.
> 
> > 4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.
> > 
> > In terms of how many times to retry, I will use a fixed value for now, similar
> > to the kernel code below:
> > 
> > #define RDRAND_RETRY_LOOPS      10
> 
> Heck, you could even just use RDRAND_RETRY_LOOPS directly.  It's
> hard(er) to bikeshed your choice of a random number that you didn't even
> pick.

Yes I'll just include the header and use it.  Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-03-23 22:42 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-06 14:13 [PATCH v10 00/16] TDX host kernel support Kai Huang
2023-03-06 14:13 ` [PATCH v10 01/16] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-03-16 12:37   ` David Hildenbrand
2023-03-16 22:41     ` Huang, Kai
2023-03-06 14:13 ` [PATCH v10 02/16] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-03-16 12:48   ` David Hildenbrand
2023-03-16 22:37     ` Huang, Kai
2023-03-23 17:02       ` David Hildenbrand
2023-03-23 22:15         ` Huang, Kai
2023-03-06 14:13 ` [PATCH v10 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-03-16 12:57   ` David Hildenbrand
2023-03-06 14:13 ` [PATCH v10 04/16] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
2023-03-06 14:13 ` [PATCH v10 05/16] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
2023-03-08 22:27   ` Isaku Yamahata
2023-03-12 23:08     ` Huang, Kai
2023-03-13 23:49       ` Isaku Yamahata
2023-03-14  1:50         ` Huang, Kai
2023-03-14  4:02           ` Isaku Yamahata
2023-03-14  5:45             ` Dave Hansen
2023-03-14 17:16               ` Isaku Yamahata
2023-03-14 17:38                 ` Dave Hansen
2023-03-14 15:48           ` Dave Hansen
2023-03-15 11:10             ` Huang, Kai
2023-03-16 22:07               ` Huang, Kai
2023-03-23 13:49               ` Dave Hansen
2023-03-23 22:09                 ` Huang, Kai
2023-03-23 22:12                   ` Dave Hansen
2023-03-23 22:42                     ` Huang, Kai
2023-03-16  0:31   ` Isaku Yamahata
2023-03-16  2:45     ` Isaku Yamahata
2023-03-16  2:52       ` Huang, Kai
2023-03-06 14:13 ` [PATCH v10 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-03-06 14:13 ` [PATCH v10 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-03-09  1:38   ` Isaku Yamahata
2023-03-06 14:13 ` [PATCH v10 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-03-06 14:13 ` [PATCH v10 09/16] x86/virt/tdx: Fill out " Kai Huang
2023-03-06 14:13 ` [PATCH v10 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-03-21  7:44   ` Dong, Eddie
2023-03-21  8:05     ` Huang, Kai
2023-03-06 14:13 ` [PATCH v10 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-03-06 14:13 ` [PATCH v10 12/16] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
2023-03-06 14:13 ` [PATCH v10 13/16] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-03-06 14:13 ` [PATCH v10 14/16] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-03-06 14:14 ` [PATCH v10 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
2023-03-06 14:14 ` [PATCH v10 16/16] Documentation/x86: Add documentation for TDX host support Kai Huang
2023-03-08  1:11 ` [PATCH v10 00/16] TDX host kernel support Isaku Yamahata
2023-03-16 12:35 ` David Hildenbrand
2023-03-16 22:06   ` Huang, Kai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.