From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: linux-mm@kvack.org, seanjc@google.com, pbonzini@redhat.com,
dave.hansen@intel.com, dan.j.williams@intel.com,
rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com,
ying.huang@intel.com, reinette.chatre@intel.com,
len.brown@intel.com, tony.luck@intel.com, peterz@infradead.org,
ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com,
sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com,
sagis@google.com, imammedo@redhat.com, kai.huang@intel.com
Subject: [PATCH v7 20/20] Documentation/x86: Add documentation for TDX host support
Date: Mon, 21 Nov 2022 13:26:42 +1300 [thread overview]
Message-ID: <661183935202155894bb669930d483a555a73a7b.1668988357.git.kai.huang@intel.com> (raw)
In-Reply-To: <cover.1668988357.git.kai.huang@intel.com>
Add documentation for TDX host kernel support. There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals. Also reuse it for TDX host kernel support.
Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
v6 -> v7:
- Changed "TDX Memory Policy" and "Kexec()" sections.
---
Documentation/x86/tdx.rst | 181 +++++++++++++++++++++++++++++++++++---
1 file changed, 170 insertions(+), 11 deletions(-)
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index dc8d9fd2c3f7..35092e7c60f7 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -10,6 +10,165 @@ encrypting the guest memory. In TDX, a special module running in a special
mode sits between the host and the guest and manages the guest/host
separation.
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized. The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot. Below dmesg shows when TDX is enabled by BIOS::
+
+ [..] tdx: TDX enabled by BIOS. TDX private KeyID range: [16, 64).
+
+TDX module detection and initialization
+---------------------------------------
+
+There is no CPUID or MSR to detect the TDX module. The kernel detects it
+by initializing it.
+
+The kernel talks to the TDX module via the new SEAMCALL instruction. The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory. It also takes additional CPU
+time to initialize those metadata along with the TDX module itself. Both
+are not trivial. The kernel initializes the TDX module at runtime on
+demand. The caller to call tdx_enable() to initialize the TDX module::
+
+ ret = tdx_enable();
+ if (ret)
+ goto no_tdx;
+ // TDX is ready to use
+
+Initializing the TDX module requires all logical CPUs being online.
+tdx_enable() internally temporarily disables CPU hotplug to prevent any
+CPU from going offline, but the caller still needs to guarantee all
+present CPUs are online before calling tdx_enable().
+
+Also, tdx_enable() requires all CPUs are already in VMX operation
+(requirement of making SEAMCALL). Currently, tdx_enable() doesn't handle
+VMXON internally, but depends on the caller to guarantee that. So far
+KVM is the only user of TDX and KVM already handles VMXON.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+ [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+ [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+ [..] tdx: 65667 pages allocated for PAMT.
+ [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg shows below::
+
+ [..] tdx: Failed to initialize TDX module. Shut it down.
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
+memory regions that can possibly be used by the TDX module, but they are
+not automatically usable to the TDX module. As a step of initializing
+the TDX module, the kernel needs to choose a list of memory regions (out
+from convertible memory regions) that the TDX module can use and pass
+those regions to the TDX module. Once this is done, those "TDX-usable"
+memory regions are fixed during module's lifetime. No more TDX-usable
+memory can be added to the TDX module after that.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory. Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and at the meantime, refuses to add any non-TDX-memory in
+the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node. In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime. A non-buggy BIOS should never support hot-removal of
+any convertible memory. This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX. A non-buggy BIOS should never support hot-add/removal of
+physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages (i.e. metadata used by the TDX module, and any
+TDX guest memory if kexec() is executed when there's live TDX guests).
+2) There might be dirty cachelines associated with TDX private pages.
+
+Because the hardware doesn't guarantee cache coherency among different
+KeyIDs, the old kernel needs to flush cache (of TDX private pages)
+before booting to the new kernel. Also, the kernel doesn't convert all
+TDX private pages back to normal because of below considerations:
+
+1) The kernel doesn't have existing infrastructure to track which pages
+ are TDX private page.
+2) The number of TDX private pages can be large, and converting all of
+ them (cache flush + using MOVDIR64B to clear the page) can be time
+ consuming.
+3) The new kernel will almost only use KeyID 0 to access memory. KeyID
+ 0 doesn't support integrity-check, so it's OK.
+4) The kernel doesn't (and may never) support MKTME. If any 3rd party
+ kernel ever supports MKTME, it should do MOVDIR64B to clear the page
+ with the new MKTME KeyID (just like TDX does) before using it.
+
+The current TDX module architecture doesn't play nicely with kexec().
+The TDX module can only be initialized once during its lifetime, and
+there is no SEAMCALL to reset the module to give a new clean slate to
+the new kernel. Therefore, ideally, if the module is ever initialized,
+it's better to shut down the module. The new kernel won't be able to
+use TDX anyway (as it needs to go through the TDX module initialization
+process which will fail immediately at the first step).
+
+However, there's no guarantee CPU is in VMX operation during kexec(), so
+it's impractical to shut down the module. Currently, the kernel just
+leaves the module in open state.
+
+TDX Guest Support
+=================
Since the host cannot directly access guest registers or memory, much
normal functionality of a hypervisor must be moved into the guest. This is
implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +179,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
guest to the hypervisor or the TDX module.
New TDX Exceptions
-==================
+------------------
TDX guests behave differently from bare-metal and traditional VMX guests.
In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +189,7 @@ Instructions marked with an '*' conditionally cause exceptions. The
details for these instructions are discussed below.
Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- Port I/O (INS, OUTS, IN, OUT)
- HLT
@@ -41,7 +200,7 @@ Instruction-based #VE
- CPUID*
Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +211,7 @@ Instruction-based #GP
- RDMSR*,WRMSR*
RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
MSR access behavior falls into three categories:
@@ -73,7 +232,7 @@ trapping and handling in the TDX module. Other than possibly being slow,
these MSRs appear to function just as they would on bare metal.
CPUID Behavior
---------------
+~~~~~~~~~~~~~~
For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +252,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
value with a hypercall.
#VE on Memory Accesses
-======================
+----------------------
There are essentially two classes of TDX memory: private and shared.
Private memory receives full TDX protections. Its content is protected
@@ -107,7 +266,7 @@ entries. This helps ensure that a guest does not place sensitive
information in shared memory, exposing it to the untrusted hypervisor.
#VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
Access to shared mappings can cause a #VE. The hypervisor ultimately
controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +286,7 @@ be careful not to access device MMIO regions unless it is also prepared to
handle a #VE.
#VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
An access to private mappings can also cause a #VE. Since all kernel
memory is also private memory, the kernel might theoretically need to
@@ -145,7 +304,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
to handle the exception.
Linux #VE handler
-=================
+-----------------
Just like page faults or #GP's, #VE exceptions can be either handled or be
fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +326,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
which is not recoverable.
MMIO handling
-=============
+-------------
In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +348,7 @@ MMIO access via other means (like structure overlays) may result in an
oops.
Shared Memory Conversions
-=========================
+-------------------------
All TDX guest memory starts out as private at boot. This memory can not
be accessed by the hypervisor. However, some kernel users like device
--
2.38.1
prev parent reply other threads:[~2022-11-21 0:28 UTC|newest]
Thread overview: 163+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-21 0:26 [PATCH v7 00/20] TDX host kernel support Kai Huang
2022-11-21 0:26 ` [PATCH v7 01/20] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2022-11-21 2:52 ` Sathyanarayanan Kuppuswamy
2022-11-21 9:15 ` Huang, Kai
2022-11-21 17:23 ` Sathyanarayanan Kuppuswamy
2022-11-21 18:12 ` Dave Hansen
2022-11-21 23:48 ` Dave Hansen
2022-11-22 0:01 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 02/20] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2022-11-21 3:07 ` Sathyanarayanan Kuppuswamy
2022-11-21 9:37 ` Huang, Kai
2022-11-21 23:57 ` Sathyanarayanan Kuppuswamy
2022-11-22 0:10 ` Dave Hansen
2022-11-22 11:28 ` Huang, Kai
2022-11-22 16:50 ` Dave Hansen
2022-11-22 23:21 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 03/20] x86/virt/tdx: Disable TDX if X2APIC is not enabled Kai Huang
2022-11-21 3:51 ` Sathyanarayanan Kuppuswamy
2022-11-21 9:44 ` Huang, Kai
2022-11-21 22:00 ` Sathyanarayanan Kuppuswamy
2022-11-21 23:40 ` Huang, Kai
2022-11-21 23:46 ` Dave Hansen
2022-11-22 0:30 ` Huang, Kai
2022-11-22 0:44 ` Dave Hansen
2022-11-22 0:58 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 04/20] x86/virt/tdx: Add skeleton to initialize TDX on demand Kai Huang
2022-11-22 9:02 ` Peter Zijlstra
2022-11-22 10:31 ` Thomas Gleixner
2022-11-22 15:35 ` Dave Hansen
2022-11-22 20:03 ` Thomas Gleixner
2022-11-22 20:11 ` Sean Christopherson
2022-11-23 0:30 ` Huang, Kai
2022-11-23 1:12 ` Huang, Kai
2022-11-23 11:05 ` Thomas Gleixner
2022-11-23 12:22 ` Huang, Kai
2022-11-22 18:05 ` Dave Hansen
2022-11-23 10:18 ` Huang, Kai
2022-11-23 16:58 ` Dave Hansen
2022-11-23 21:58 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 05/20] x86/virt/tdx: Implement functions to make SEAMCALL Kai Huang
2022-11-22 9:06 ` Peter Zijlstra
2022-11-23 8:53 ` Huang, Kai
2022-11-22 18:20 ` Dave Hansen
2022-11-23 10:43 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 06/20] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-11-22 9:10 ` Peter Zijlstra
2022-11-22 9:13 ` Peter Zijlstra
2022-11-22 15:14 ` Dave Hansen
2022-11-22 19:13 ` Peter Zijlstra
2022-11-22 19:24 ` Dave Hansen
2022-11-22 19:33 ` Peter Zijlstra
2022-11-23 1:14 ` Huang, Kai
2022-11-29 21:40 ` Dave Hansen
2022-11-30 11:09 ` Thomas Gleixner
2022-11-23 0:58 ` Huang, Kai
2022-11-23 1:04 ` Dave Hansen
2022-11-23 1:22 ` Huang, Kai
2022-11-23 16:20 ` Sean Christopherson
2022-11-23 16:41 ` Dave Hansen
2022-11-23 17:37 ` Sean Christopherson
2022-11-23 18:18 ` Dave Hansen
2022-11-23 19:03 ` Sean Christopherson
2022-11-22 9:20 ` Peter Zijlstra
2022-11-22 15:06 ` Thomas Gleixner
2022-11-22 19:06 ` Peter Zijlstra
2022-11-22 19:31 ` Sean Christopherson
2022-11-23 9:39 ` Huang, Kai
2022-11-22 15:20 ` Dave Hansen
2022-11-22 16:52 ` Thomas Gleixner
2022-11-22 18:57 ` Dave Hansen
2022-11-22 19:14 ` Peter Zijlstra
2022-11-23 1:24 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 07/20] x86/virt/tdx: Do TDX module global initialization Kai Huang
2022-11-22 19:14 ` Dave Hansen
2022-11-23 11:45 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 08/20] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-11-21 0:26 ` [PATCH v7 09/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2022-11-22 23:39 ` Dave Hansen
2022-11-23 11:40 ` Huang, Kai
2022-11-23 16:44 ` Dave Hansen
2022-11-23 22:53 ` Huang, Kai
2022-12-02 11:19 ` Huang, Kai
2022-12-02 17:25 ` Dave Hansen
2022-12-02 21:57 ` Huang, Kai
2022-12-02 11:11 ` Huang, Kai
2022-12-02 17:06 ` Dave Hansen
2022-12-02 21:56 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2022-11-21 5:37 ` Huang, Ying
2022-11-21 9:09 ` Huang, Kai
2022-11-22 1:54 ` Huang, Ying
2022-11-22 9:16 ` Huang, Kai
2022-11-24 0:47 ` Huang, Ying
2022-11-22 10:10 ` Peter Zijlstra
2022-11-22 11:40 ` Huang, Kai
2022-11-23 0:21 ` Dave Hansen
2022-11-23 9:29 ` Peter Zijlstra
2022-11-24 1:04 ` Huang, Kai
2022-11-24 1:22 ` Dave Hansen
2022-11-24 2:27 ` Huang, Kai
2022-11-24 1:50 ` Dan Williams
2022-11-24 9:06 ` Huang, Kai
2022-11-25 9:28 ` David Hildenbrand
2022-11-28 8:38 ` Huang, Kai
2022-11-28 8:43 ` David Hildenbrand
2022-11-28 9:21 ` Huang, Kai
2022-11-28 9:26 ` David Hildenbrand
2022-11-28 9:50 ` Huang, Kai
2022-11-24 9:26 ` Peter Zijlstra
2022-11-24 10:02 ` Huang, Kai
2022-11-30 22:26 ` Dave Hansen
2022-11-21 0:26 ` [PATCH v7 11/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2022-11-23 22:17 ` Dave Hansen
2022-11-24 9:51 ` Huang, Kai
2022-11-24 12:02 ` Huang, Kai
2022-11-28 15:59 ` Dave Hansen
2022-11-28 22:13 ` Huang, Kai
2022-11-28 22:19 ` Dave Hansen
2022-11-28 22:50 ` Huang, Kai
2022-12-07 11:47 ` Huang, Kai
2022-12-08 12:56 ` Huang, Kai
2022-12-08 14:58 ` Dave Hansen
2022-12-08 23:29 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 12/20] x86/virt/tdx: Create " Kai Huang
2022-11-23 22:41 ` Dave Hansen
2022-11-24 11:29 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-11-23 22:57 ` Dave Hansen
2022-11-24 11:46 ` Huang, Kai
2022-11-28 16:39 ` Dave Hansen
2022-11-28 22:48 ` Huang, Kai
2022-11-28 22:56 ` Dave Hansen
2022-11-28 23:14 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 14/20] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-11-23 23:39 ` Dave Hansen
2022-11-28 9:14 ` Huang, Kai
2022-11-28 13:18 ` Dave Hansen
2022-11-28 22:24 ` Huang, Kai
2022-11-28 22:58 ` Dave Hansen
2022-11-28 23:10 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 15/20] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-11-23 23:40 ` Dave Hansen
2022-11-24 22:39 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 16/20] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-11-23 23:56 ` Dave Hansen
2022-11-25 0:59 ` Huang, Kai
2022-11-25 1:18 ` Dave Hansen
2022-11-25 1:44 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 17/20] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-11-24 0:28 ` Dave Hansen
2022-11-24 22:28 ` Huang, Kai
2022-11-25 0:08 ` Huang, Kai
2022-11-30 3:35 ` Binbin Wu
2022-11-30 8:34 ` Huang, Kai
2022-11-30 14:04 ` kirill.shutemov
2022-11-30 15:13 ` Dave Hansen
2022-11-30 20:17 ` Huang, Kai
2022-11-30 17:37 ` Dave Hansen
2022-11-21 0:26 ` [PATCH v7 18/20] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-11-24 0:42 ` Dave Hansen
2022-11-25 2:27 ` Huang, Kai
2022-11-21 0:26 ` [PATCH v7 19/20] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Kai Huang
2022-11-21 0:26 ` Kai Huang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=661183935202155894bb669930d483a555a73a7b.1668988357.git.kai.huang@intel.com \
--to=kai.huang@intel.com \
--cc=ak@linux.intel.com \
--cc=bagasdotme@gmail.com \
--cc=chao.gao@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=imammedo@redhat.com \
--cc=isaku.yamahata@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=reinette.chatre@intel.com \
--cc=sagis@google.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=seanjc@google.com \
--cc=tony.luck@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).