linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
	len.brown@intel.com, tony.luck@intel.com,
	rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
	dan.j.williams@intel.com, peterz@infradead.org,
	ak@linux.intel.com, kirill.shutemov@linux.intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support
Date: Wed, 22 Jun 2022 23:17:50 +1200	[thread overview]
Message-ID: <0712bc0b05a0c6c42437fba68f82d9268ab3113e.1655894131.git.kai.huang@intel.com> (raw)
In-Reply-To: <cover.1655894131.git.kai.huang@intel.com>

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 190 +++++++++++++++++++++++++++++++++++---
 1 file changed, 179 insertions(+), 11 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index b8fa4329e1a5..6c6b09ca6ba4 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,174 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+To enable TDX, BIOS configures SEAMRR and TDX private KeyIDs consistently
+across all CPU packages.  TDX doesn't trust BIOS.  The MCHECK verifies
+all configurations from BIOS are correct and enables SEAMRR.
+
+After TDX is enabled in BIOS, the TDX module needs to be loaded into the
+SEAMRR range and properly initialized, before it can be used to create
+and run protected VMs.
+
+The TDX architecture doesn't require BIOS to load the TDX module, but
+current kernel assumes it is loaded by BIOS (i.e. either directly or by
+some UEFI shell tool) before booting to the kernel.  Current kernel
+detects TDX and initializes the TDX module.
+
+TDX boot-time detection
+-----------------------
+
+Kernel detects TDX and the TDX private KeyIDs during kernel boot.  User
+can see below dmesg if TDX is enabled by BIOS:
+
+|  [..] tdx: SEAMRR enabled.
+|  [..] tdx: TDX private KeyID range: [16, 64).
+|  [..] tdx: TDX enabled by BIOS.
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect whether the TDX module.  The kernel
+detects the TDX module by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  Current kernel doesn't choose to always initialize the
+TDX module during kernel boot, but provides a function tdx_init() to
+allow the caller to initialize TDX when it truly wants to use TDX:
+
+        ret = tdx_init();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online and
+are in VMX operation (requirement of making SEAMCALL) during tdx_init().
+Currently, KVM is the only user of TDX.  KVM always guarantees all online
+CPUs are in VMX operation when there's any VM.  Current kernel doesn't
+handle entering VMX operation in tdx_init() but leaves this to the
+caller.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below:
+
+|  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below:
+
+|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+|  [..] tdx: 65667 pages allocated for PAMT.
+|  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below:
+
+|  [..] tdx: Failed to initialize TDX module.  Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't work with ACPI CPU hotplug.  To guarantee the security MCHECK
+verifies all logical CPUs for all packages during platform boot.  Any
+hot-added CPU is not verified thus cannot support TDX.  A non-buggy BIOS
+should never deliver ACPI CPU hot-add event to the kernel.  Such event is
+reported as BIOS bug and the hot-added CPU is rejected.
+
+TDX requires all boot-time verified logical CPUs being present until
+machine reset.  If kernel receives ACPI CPU hot-removal event, assume the
+kernel cannot continue to work normally so just BUG().
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Memory Hotplug
+~~~~~~~~~~~~~~
+
+The TDX module reports a list of "Convertible Memory Region" (CMR) to
+indicate which memory regions are TDX-capable.  Those regions are
+generated by BIOS and verified by the MCHECK so that they are truly
+present during platform boot and can meet security guarantee.
+
+This means TDX doesn't work with ACPI memory hot-add.  A non-buggy BIOS
+should never deliver ACPI memory hot-add event to the kernel.  Such event
+is reported as BIOS bug and the hot-added memory is rejected.
+
+TDX also doesn't work with ACPI memory hot-removal.  If kernel receives
+ACPI memory hot-removal event, assume the kernel cannot continue to work
+normally so just BUG().
+
+Also, the kernel needs to choose which TDX-capable regions to use as TDX
+memory and pass those regions to the TDX module when it gets initialized.
+Once they are passed to the TDX module, the TDX-usable memory regions are
+fixed during module's lifetime.
+
+To avoid having to modify the page allocator to distinguish TDX and
+non-TDX memory allocation, current kernel guarantees all pages managed by
+the page allocator are TDX memory.  This means any hot-added memory to
+the page allocator will break such guarantee thus should be prevented.
+
+There are basically two memory hot-add cases that need to be prevented:
+ACPI memory hot-add and driver managed memory hot-add.  The kernel
+rejectes the driver managed memory hot-add too when TDX is enabled by
+BIOS.  For instance, dmesg shows below error when using kmem driver to
+add a legacy PMEM as system RAM:
+
+|  [..] tdx: Unable to add memory [0x580000000, 0x600000000) on TDX enabled platform.
+|  [..] kmem dax0.0: mapping0: 0x580000000-0x5ffffffff memory add failed
+
+However, adding new memory to ZONE_DEVICE should not be prevented as
+those pages are not managed by the page allocator.  Therefore,
+memremap_pages() variants are still allowed although they internally
+also uses memory hotplug functions.
+
+Kexec()
+~~~~~~~
+
+TDX (and MKTME) doesn't guarantee cache coherency among different KeyIDs.
+If the TDX module is ever initialized, the kernel needs to flush dirty
+cachelines associated with any TDX private KeyID, otherwise they may
+slightly corrupt the new kernel.
+
+Similar to SME support, the kernel uses wbinvd() to flush cache in
+stop_this_cpu().
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel.  Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module.  The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module.  Current kernel just leaves the
+module in open state.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +188,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +198,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +209,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +220,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +241,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +261,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +275,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +295,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +313,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +335,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +357,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.36.1


  parent reply	other threads:[~2022-06-22 11:19 UTC|newest]

Thread overview: 114+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-22 11:15 [PATCH v5 00/22] TDX host kernel support Kai Huang
2022-06-22 11:15 ` [PATCH v5 01/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2022-06-23  5:57   ` Chao Gao
2022-06-23  9:23     ` Kai Huang
2022-08-02  2:01   ` [PATCH v5 1/22] " Wu, Binbin
2022-08-03  9:25     ` Kai Huang
2022-06-22 11:15 ` [PATCH v5 02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug Kai Huang
2022-06-22 11:42   ` Rafael J. Wysocki
2022-06-23  0:01     ` Kai Huang
2022-06-27  8:01       ` Igor Mammedov
2022-06-28 10:04         ` Kai Huang
2022-06-28 11:52           ` Igor Mammedov
2022-06-28 17:33           ` Rafael J. Wysocki
2022-06-28 23:41             ` Kai Huang
2022-06-24 18:57   ` Dave Hansen
2022-06-27  5:05     ` Kai Huang
2022-07-13 11:09       ` Kai Huang
2022-07-19 17:46         ` Dave Hansen
2022-07-19 23:54           ` Kai Huang
2022-08-03  3:40       ` Binbin Wu
2022-08-03  9:20         ` Kai Huang
2022-06-29  5:33   ` Christoph Hellwig
2022-06-29  9:09     ` Kai Huang
2022-08-03  3:55   ` Binbin Wu
2022-08-03  9:21     ` Kai Huang
2022-06-22 11:15 ` [PATCH v5 03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug Kai Huang
2022-06-22 11:45   ` Rafael J. Wysocki
2022-06-23  0:08     ` Kai Huang
2022-06-28 17:55       ` Rafael J. Wysocki
2022-06-28 12:01     ` Igor Mammedov
2022-06-28 23:49       ` Kai Huang
2022-06-29  8:48         ` Igor Mammedov
2022-06-29  9:13           ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and " Kai Huang
2022-06-24  1:41   ` Chao Gao
2022-06-24 11:21     ` Kai Huang
2022-06-29  8:35       ` Yuan Yao
2022-06-29  9:17         ` Kai Huang
2022-06-29 14:22       ` Dave Hansen
2022-06-29 23:02         ` Kai Huang
2022-06-30 15:44           ` Dave Hansen
2022-06-30 22:45             ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 05/22] x86/virt/tdx: Prevent hot-add driver managed memory Kai Huang
2022-06-24  2:12   ` Chao Gao
2022-06-24 11:23     ` Kai Huang
2022-06-24 19:01   ` Dave Hansen
2022-06-27  5:27     ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2022-06-24  2:39   ` Chao Gao
2022-06-24 11:27     ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Kai Huang
2022-06-24 18:38   ` Dave Hansen
2022-06-27  5:23     ` Kai Huang
2022-06-27 20:58       ` Dave Hansen
2022-06-27 22:10         ` Kai Huang
2022-07-19 19:39           ` Dan Williams
2022-07-19 23:28             ` Kai Huang
2022-07-20 10:18           ` Kai Huang
2022-07-20 16:48             ` Dave Hansen
2022-07-21  1:52               ` Kai Huang
2022-07-27  0:34                 ` Kai Huang
2022-07-27  0:50                   ` Dave Hansen
2022-07-27 12:46                     ` Kai Huang
2022-08-03  2:37                 ` Kai Huang
2022-08-03 14:20                   ` Dave Hansen
2022-08-03 22:35                     ` Kai Huang
2022-08-04 10:06                       ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 08/22] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-06-24 18:50   ` Dave Hansen
2022-06-27  5:26     ` Kai Huang
2022-06-27 20:46       ` Dave Hansen
2022-06-27 22:34         ` Kai Huang
2022-06-27 22:56           ` Dave Hansen
2022-06-27 23:59             ` Kai Huang
2022-06-28  0:03               ` Dave Hansen
2022-06-28  0:11                 ` Kai Huang
2022-06-22 11:16 ` [PATCH v5 09/22] x86/virt/tdx: Detect TDX module by doing module global initialization Kai Huang
2022-06-22 11:16 ` [PATCH v5 10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-06-22 11:17 ` [PATCH v5 11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2022-06-22 11:17 ` [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory Kai Huang
2022-06-24 19:40   ` Dave Hansen
2022-06-27  6:16     ` Kai Huang
2022-07-07  2:37       ` Kai Huang
2022-07-07 14:26       ` Dave Hansen
2022-07-07 14:36         ` Juergen Gross
2022-07-07 23:42           ` Kai Huang
2022-07-07 23:34         ` Kai Huang
2022-08-03  1:30           ` Kai Huang
2022-08-03 14:22             ` Dave Hansen
2022-08-03 22:14               ` Kai Huang
2022-06-22 11:17 ` [PATCH v5 13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock Kai Huang
2022-06-22 11:17 ` [PATCH v5 14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions Kai Huang
2022-06-22 11:17 ` [PATCH v5 15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-06-24 20:13   ` Dave Hansen
2022-06-27 10:31     ` Kai Huang
2022-06-27 20:41       ` Dave Hansen
2022-06-27 22:50         ` Kai Huang
2022-06-27 22:57           ` Dave Hansen
2022-06-27 23:05             ` Kai Huang
2022-06-28  0:48         ` Xiaoyao Li
2022-06-28 17:03           ` Dave Hansen
2022-08-17 22:46   ` Sagi Shahar
2022-08-17 23:43     ` Huang, Kai
2022-06-22 11:17 ` [PATCH v5 16/22] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-06-22 11:17 ` [PATCH v5 17/22] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-06-22 11:17 ` [PATCH v5 18/22] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-06-22 11:17 ` [PATCH v5 19/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-06-22 11:17 ` [PATCH v5 20/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-06-22 11:17 ` [PATCH v5 21/22] x86/virt/tdx: Support kexec() Kai Huang
2022-06-22 11:17 ` Kai Huang [this message]
2022-08-18  4:07   ` [PATCH v5 22/22] Documentation/x86: Add documentation for TDX host support Bagas Sanjaya
2022-08-18  9:33     ` Huang, Kai
2022-06-24 19:47 ` [PATCH v5 00/22] TDX host kernel support Dave Hansen
2022-06-27  4:09   ` Kai Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0712bc0b05a0c6c42437fba68f82d9268ab3113e.1655894131.git.kai.huang@intel.com \
    --to=kai.huang@intel.com \
    --cc=ak@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=reinette.chatre@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).