All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
	len.brown@intel.com, tony.luck@intel.com,
	rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
	dan.j.williams@intel.com, peterz@infradead.org,
	ak@linux.intel.com, kirill.shutemov@linux.intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v4 22/22] Documentation/x86: Add documentation for TDX host support
Date: Wed,  1 Jun 2022 07:39:45 +1200	[thread overview]
Message-ID: <decf2e069132da6e3c0d0561bad53e94b5da264e.1654025431.git.kai.huang@intel.com> (raw)
In-Reply-To: <cover.1654025430.git.kai.huang@intel.com>

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 190 +++++++++++++++++++++++++++++++++++---
 1 file changed, 179 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index b8fa4329e1a5..6c6b09ca6ba4 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,174 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+To enable TDX, BIOS configures SEAMRR and TDX private KeyIDs consistently
+across all CPU packages.  TDX doesn't trust BIOS.  The MCHECK verifies
+all configurations from BIOS are correct and enables SEAMRR.
+
+After TDX is enabled in BIOS, the TDX module needs to be loaded into the
+SEAMRR range and properly initialized, before it can be used to create
+and run protected VMs.
+
+The TDX architecture doesn't require BIOS to load the TDX module, but
+current kernel assumes it is loaded by BIOS (i.e. either directly or by
+some UEFI shell tool) before booting to the kernel.  Current kernel
+detects TDX and initializes the TDX module.
+
+TDX boot-time detection
+-----------------------
+
+Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
+can see below dmesg if TDX is enabled by BIOS:
+
+|  [..] tdx: SEAMRR enabled.
+|  [..] tdx: TDX private KeyID range: [16, 64).
+|  [..] tdx: TDX enabled by BIOS.
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect whether the TDX module.  The kernel
+detects the TDX module by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  Current kernel doesn't choose to always initialize the
+TDX module during kernel boot, but provides a function tdx_init() to
+allow the caller to initialize TDX when it truly wants to use TDX:
+
+        ret = tdx_init();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online and
+are in VMX operation (requirement of making SEAMCALL) during tdx_init().
+Currently, KVM is the only user of TDX.  KVM always guarantees all online
+CPUs are in VMX operation when there's any VM.  Current kernel doesn't
+handle entering VMX operation in tdx_init() but leaves this to the
+caller.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below:
+
+|  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below:
+
+|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+|  [..] tdx: 65667 pages allocated for PAMT.
+|  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below:
+
+|  [..] tdx: Failed to initialize TDX module.  Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't work with ACPI CPU hotplug.  To guarantee the security MCHECK
+verifies all logical CPUs for all packages during platform boot.  Any
+hot-added CPU is not verified thus cannot support TDX.  A non-buggy BIOS
+should never deliver ACPI CPU hot-add event to the kernel.  Such event is
+reported as BIOS bug and the hot-added CPU is rejected.
+
+TDX requires all boot-time verified logical CPUs being present until
+machine reset.  If kernel receives ACPI CPU hot-removal event, assume the
+kernel cannot continue to work normally so just BUG().
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Memory Hotplug
+~~~~~~~~~~~~~~
+
+The TDX module reports a list of "Convertible Memory Region" (CMR) to
+indicate which memory regions are TDX-capable.  Those regions are
+generated by BIOS and verified by the MCHECK so that they are truly
+present during platform boot and can meet security guarantee.
+
+This means TDX doesn't work with ACPI memory hot-add.  A non-buggy BIOS
+should never deliver ACPI memory hot-add event to the kernel.  Such event
+is reported as BIOS bug and the hot-added memory is rejected.
+
+TDX also doesn't work with ACPI memory hot-removal.  If kernel receives
+ACPI memory hot-removal event, assume the kernel cannot continue to work
+normally so just BUG().
+
+Also, the kernel needs to choose which TDX-capable regions to use as TDX
+memory and pass those regions to the TDX module when it gets initialized.
+Once they are passed to the TDX module, the TDX-usable memory regions are
+fixed during module's lifetime.
+
+To avoid having to modify the page allocator to distinguish TDX and
+non-TDX memory allocation, current kernel guarantees all pages managed by
+the page allocator are TDX memory.  This means any hot-added memory to
+the page allocator will break such guarantee thus should be prevented.
+
+There are basically two memory hot-add cases that need to be prevented:
+ACPI memory hot-add and driver managed memory hot-add.  The kernel
+rejectes the driver managed memory hot-add too when TDX is enabled by
+BIOS.  For instance, dmesg shows below error when using kmem driver to
+add a legacy PMEM as system RAM:
+
+|  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
+|  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
+
+However, adding new memory to ZONE_DEVICE should not be prevented as
+those pages are not managed by the page allocator.  Therefore,
+memremap_pages() variants are still allowed although they internally
+also uses memory hotplug functions.
+
+Kexec()
+~~~~~~~
+
+TDX (and MKTME) doesn't guarantee cache coherency among different KeyIDs.
+If the TDX module is ever initialized, the kernel needs to flush dirty
+cachelines associated with any TDX private KeyID, otherwise they may
+slightly corrupt the new kernel.
+
+Similar to SME support, the kernel uses wbinvd() to flush cache in
+stop_this_cpu().
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel.  Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module.  The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module.  Current kernel just leaves the
+module in open state.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +188,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +198,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +209,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +220,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +241,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +261,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +275,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +295,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +313,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +335,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +357,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.35.3


      parent reply	other threads:[~2022-05-31 19:42 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-31 19:39 [PATCH v4 00/22] TDX host kernel support Kai Huang
2022-05-31 19:39 ` [PATCH v4 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2022-05-31 19:39 ` [PATCH v4 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
2022-05-31 19:39 ` [PATCH v4 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
2022-05-31 19:39 ` [PATCH v4 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
2022-05-31 19:39 ` [PATCH v4 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
2022-05-31 19:39 ` [PATCH v4 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2022-05-31 19:39 ` [PATCH v4 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
2022-05-31 19:39 ` [PATCH v4 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-05-31 19:39 ` [PATCH v4 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization Kai Huang
2022-05-31 19:39 ` [PATCH v4 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-05-31 19:39 ` [PATCH v4 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2022-05-31 19:39 ` [PATCH v4 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
2022-05-31 19:39 ` [PATCH v4 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock Kai Huang
2022-05-31 19:39 ` [PATCH v4 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions Kai Huang
2022-05-31 19:39 ` [PATCH v4 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-05-31 19:39 ` [PATCH v4 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-05-31 19:39 ` [PATCH v4 17/22] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-05-31 19:39 ` [PATCH v4 18/22] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-05-31 19:39 ` [PATCH v4 19/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-05-31 19:39 ` [PATCH v4 20/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-05-31 19:39 ` [PATCH v4 21/22] x86/virt/tdx: Support kexec() Kai Huang
2022-05-31 19:39 ` Kai Huang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=decf2e069132da6e3c0d0561bad53e94b5da264e.1654025431.git.kai.huang@intel.com \
    --to=kai.huang@intel.com \
    --cc=ak@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=reinette.chatre@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.