linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kai Huang <kai.huang@intel.com>
To: Dave Hansen <dave.hansen@intel.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, len.brown@intel.com,
	tony.luck@intel.com, rafael.j.wysocki@intel.com,
	reinette.chatre@intel.com, dan.j.williams@intel.com,
	peterz@infradead.org, ak@linux.intel.com,
	kirill.shutemov@linux.intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	isaku.yamahata@intel.com
Subject: Re: [PATCH v3 00/21] TDX host kernel support
Date: Wed, 27 Apr 2022 13:15:02 +1200	[thread overview]
Message-ID: <ecf718abf864bbb2366209f00d4315ada090aedc.camel@intel.com> (raw)
In-Reply-To: <522e37eb-68fc-35db-44d5-479d0088e43f@intel.com>

On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > SEAM VMX root operation is designed to host a CPU-attested, software
> > module called the 'TDX module' which implements functions to manage
> > crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
> 
> "crypto protected"?  What the heck is that?

How about "crypto-protected"?  I googled and it seems it is used by someone
else.

> 
> > designed to host a CPU-attested, software module called the 'Intel
> > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> > 
> > Host kernel transits to either the P-SEAMLDR or the TDX module via a new
> 
>  ^ The

Thanks.

> 
> > SEAMCALL instruction.  SEAMCALLs are host-side interface functions
> > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> > instruction.  They are similar to a hypercall, except they are made by
> > host kernel to the SEAM software modules.
> 
> This is still missing some important high-level things, like that the
> TDX module is protected from the untrusted VMM.  Heck, it forgets to
> mention that the VMM itself is untrusted and the TDX module replaces
> things that the VMM usually does.
> 
> It would also be nice to mention here how this compares with SEV-SNP.
> Where is the TDX module in that design?  Why doesn't SEV need all this code?
> 
> > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> > protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
> > KeyIDs, which can only be used by software runs in SEAM.  The physical
> 
> 					    ^ which

Thanks.

> 
> > address bits for encoding TDX private KeyID are treated as reserved bits
> > when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
> > private KeyIDs is configured by BIOS.
> > 
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized using SEAMCALLs defined by TDX architecture.
> > This series assumes both the P-SEAMLDR and the TDX module are loaded by
> > BIOS before the kernel boots.
> > 
> > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> > SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
> > P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
> > SEAMCALL further tells whether TDX module is loaded.
> 
> There's a bit of information missing here.  The kernel might not know
> the state of things being loaded.  A previous kernel might have loaded
> it and left it in an unknown state.
> 
> > The TDX module is initialized in multiple steps:
> > 
> >         1) Global initialization;
> >         2) Logical-CPU scope initialization;
> >         3) Enumerate the TDX module capabilities;
> >         4) Configure the TDX module about usable memory ranges and
> >            global KeyID information;
> >         5) Package-scope configuration for the global KeyID;
> >         6) Initialize TDX metadata for usable memory ranges based on 4).
> > 
> > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> > table) logical cpus, otherwise step 4) will fail.  Step 5) requires
> > calling SEAMCALL on at least one cpu on all packages.
> > 
> > TDX module can also be shut down at any time during module's lifetime, by
> > calling SEAMCALL on all "BIOS-enabled" logical cpus.
> > 
> > == Design Considerations ==
> > 
> > 1. Lazy TDX module initialization on-demand by caller
> 
> This doesn't really tell us what "lazy" is or what the alternatives are.
> 
> There are basically two ways the TDX module could be loaded.  Either:
>   * In early boot
> or
>   * At runtime just before the first TDX guest is run
> 
> This series implements the runtime loading.

OK will do.

> 
> > None of the steps in the TDX module initialization process must be done
> > during kernel boot.  This series doesn't initialize TDX at boot time, but
> > instead, provides two functions to allow caller to detect and initialize
> > TDX on demand:
> > 
> >         if (tdx_detect())
> >                 goto no_tdx;
> >         if (tdx_init())
> >                 goto no_tdx;
> > 
> > This approach has below pros:
> > 
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata.  Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> > 
> > 2) Both detecting and initializing the TDX module require calling
> > SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
> > (VMXON has been done).  So far, KVM is the only user of TDX, and it
> > already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
> > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> > core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
> > approach is needed since more kernel components will need to handle
> > VMXON/VMXONFF.
> > 
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series).  After updating to the new module at runtime, kernel needs
> > to go through the initialization process again.  For the new module,
> > it's possible the metadata allocated for the old module cannot be reused
> > for the new module, and needs to be re-allocated again.
> > 
> > 2. Kernel policy on TDX memory
> > 
> > Host kernel is responsible for choosing which memory regions can be used
> > as TDX memory, and configuring those memory regions to the TDX module by
> > using an array of "TD Memory Regions" (TDMR), which is a data structure
> > defined by TDX architecture.
> 
> 
> This is putting the cart before the horse.  Don't define the details up
> front.
> 
> 	The TDX architecture allows the VMM to designate specific memory
> 	as usable for TDX private memory.  This series chooses to
> 	designate _all_ system RAM as TDX to avoid having to modify the
> 	page allocator to distinguish TDX and non-TDX-capable memory
> 
> ... then go on to explain the details.

Thanks.  Will update.

> 
> > The first generation of TDX essentially guarantees that all system RAM
> > memory regions (excluding the memory below 1MB) can be used as TDX
> > memory.  To avoid having to modify the page allocator to distinguish TDX
> > and non-TDX allocation, this series chooses to use all system RAM as TDX
> > memory.
> > 
> > E820 table is used to find all system RAM entries.  Following
> > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> > types are treated as TDX memory, and contiguous ranges in the same NUMA
> > node are merged together (similar to memblock_add()) before trimming the
> > non-page-aligned part.
> 
> This e820 cruft is too much detail for a cover letter.  In general, once
> you start talking about individual functions, you've gone too far in the
> cover letter.

Will remove.

> 
> > 3. Memory hotplug
> > 
> > The first generation of TDX architecturally doesn't support memory
> > hotplug.  And the first generation of TDX-capable platforms don't support
> > physical memory hotplug.  Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> > 
> > A special case of memory hotplug is adding NVDIMM as system RAM using
> > kmem driver.  However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
> 
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

I forgot to add below (which is in the documentation patch):

"This can be enhanced when future generation of TDX starts to support ACPI
memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
same platform."

Is this acceptable?

> 
> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM.  To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > 
> > 4. CPU hotplug
> > 
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either.  Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.
> > 
> > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > online.  After the initialization, any logical cpu can be brought down
> > and brought up to online again later.  Therefore this series doesn't
> > change logical CPU hotplug either.
> > 
> > 5. TDX interaction with kexec()
> > 
> > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > of TDX private memory, including PAMTs, used by TDX module need to be
> > flushed before transiting to the new kernel otherwise they may silently
> > corrupt the new kernel.  Similar to SME, this series flushes cache in
> > stop_this_cpu().
> 
> What does this have to do with kexec()?  What's a PAMT?

The point is the dirty cachelines of TDX private memory must be flushed
otherwise they may slightly corrupt the new kexec()-ed kernel.

Will use "TDX metadata" instead of "PAMT".  The former has already been
mentioned above.

> 
> > The TDX module can be initialized only once during its lifetime.  The
> > first generation of TDX doesn't have interface to reset TDX module to
> 
> 				      ^ an

Thanks.

> 
> > uninitialized state so it can be initialized again.
> > 
> > This implies:
> > 
> >   - If the old kernel fails to initialize TDX, the new kernel cannot
> >     use TDX too unless the new kernel fixes the bug which leads to
> >     initialization failure in the old kernel and can resume from where
> >     the old kernel stops. This requires certain coordination between
> >     the two kernels.
> 
> OK, but what does this *MEAN*?

This means we need to extend the information which the old kernel passes to the
new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
to make it more concise next version.

> 
> >   - If the old kernel has initialized TDX successfully, the new kernel
> >     may be able to use TDX if the two kernels have the exactly same
> >     configurations on the TDX module. It further requires the new kernel
> >     to reserve the TDX metadata pages (allocated by the old kernel) in
> >     its page allocator. It also requires coordination between the two
> >     kernels.  Furthermore, if kexec() is done when there are active TD
> >     guests running, the new kernel cannot use TDX because it's extremely
> >     hard for the old kernel to pass all TDX private pages to the new
> >     kernel.
> > 
> > Given that, this series doesn't support TDX after kexec() (except the
> > old kernel doesn't attempt to initialize TDX at all).
> > 
> > And this series doesn't shut down TDX module but leaves it open during
> > kexec().  It is because shutting down TDX module requires CPU being in
> > VMX operation but there's no guarantee of this during kexec().  Leaving
> > the TDX module open is not the best case, but it is OK since the new
> > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > at all).
> 
> tl;dr: kexec() doesn't work with this code.
> 
> Right?
> 
> That doesn't seem good.

It can work in my understanding.  We just need to flush cache before booting to
the new kernel.


-- 
Thanks,
-Kai



  reply	other threads:[~2022-04-27  1:15 UTC|newest]

Thread overview: 156+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06  4:49 [PATCH v3 00/21] TDX host kernel support Kai Huang
2022-04-06  4:49 ` [PATCH v3 01/21] x86/virt/tdx: Detect SEAM Kai Huang
2022-04-18 22:29   ` Sathyanarayanan Kuppuswamy
2022-04-18 22:50     ` Sean Christopherson
2022-04-19  3:38     ` Kai Huang
2022-04-26 20:21   ` Dave Hansen
2022-04-26 23:12     ` Kai Huang
2022-04-26 23:28       ` Dave Hansen
2022-04-26 23:49         ` Kai Huang
2022-04-27  0:22           ` Sean Christopherson
2022-04-27  0:44             ` Kai Huang
2022-04-27 14:22           ` Dave Hansen
2022-04-27 22:39             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs Kai Huang
2022-04-19  5:39   ` Sathyanarayanan Kuppuswamy
2022-04-19  9:41     ` Kai Huang
2022-04-19  5:42   ` Sathyanarayanan Kuppuswamy
2022-04-19 10:07     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function Kai Huang
2022-04-19 14:07   ` Sathyanarayanan Kuppuswamy
2022-04-20  4:16     ` Kai Huang
2022-04-20  7:29       ` Sathyanarayanan Kuppuswamy
2022-04-20 10:39         ` Kai Huang
2022-04-26 20:37   ` Dave Hansen
2022-04-26 23:29     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand Kai Huang
2022-04-19 14:53   ` Sathyanarayanan Kuppuswamy
2022-04-20  4:37     ` Kai Huang
2022-04-20  5:21       ` Dave Hansen
2022-04-20 14:30       ` Sathyanarayanan Kuppuswamy
2022-04-20 22:35         ` Kai Huang
2022-04-26 20:53   ` Dave Hansen
2022-04-27  0:43     ` Kai Huang
2022-04-27 14:49       ` Dave Hansen
2022-04-28  0:00         ` Kai Huang
2022-04-28 14:27           ` Dave Hansen
2022-04-28 23:44             ` Kai Huang
2022-04-28 23:53               ` Dave Hansen
2022-04-29  0:11                 ` Kai Huang
2022-04-29  0:26                   ` Dave Hansen
2022-04-29  0:59                     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module Kai Huang
2022-04-26 20:56   ` Dave Hansen
2022-04-27  0:01     ` Kai Huang
2022-04-27 14:24       ` Dave Hansen
2022-04-27 21:30         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error Kai Huang
2022-04-23 15:39   ` Sathyanarayanan Kuppuswamy
2022-04-25 23:41     ` Kai Huang
2022-04-26  1:48       ` Sathyanarayanan Kuppuswamy
2022-04-26  2:12         ` Kai Huang
2022-04-26 20:59   ` Dave Hansen
2022-04-27  0:06     ` Kai Huang
2022-05-18 16:19       ` Sagi Shahar
2022-05-18 23:51         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization Kai Huang
2022-04-20 22:27   ` Sathyanarayanan Kuppuswamy
2022-04-20 22:37     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization Kai Huang
2022-04-24  1:27   ` Sathyanarayanan Kuppuswamy
2022-04-25 23:55     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and convertible memory Kai Huang
2022-04-25  2:58   ` Sathyanarayanan Kuppuswamy
2022-04-26  0:05     ` Kai Huang
2022-04-27 22:15   ` Dave Hansen
2022-04-28  0:15     ` Kai Huang
2022-04-28 14:06       ` Dave Hansen
2022-04-28 23:14         ` Kai Huang
2022-04-29 17:47           ` Dave Hansen
2022-05-02  5:04             ` Kai Huang
2022-05-25  4:47             ` Kai Huang
2022-05-25  4:57               ` Kai Huang
2022-05-25 16:00                 ` Kai Huang
2022-05-18 22:30       ` Sagi Shahar
2022-05-18 23:56         ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory Kai Huang
2022-04-20 20:48   ` Isaku Yamahata
2022-04-20 22:38     ` Kai Huang
2022-04-27 22:24   ` Dave Hansen
2022-04-28  0:53     ` Kai Huang
2022-04-28  1:07       ` Dave Hansen
2022-04-28  1:35         ` Kai Huang
2022-04-28  3:40           ` Dave Hansen
2022-04-28  3:55             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 11/21] x86/virt/tdx: Choose to use " Kai Huang
2022-04-20 20:55   ` Isaku Yamahata
2022-04-20 22:39     ` Kai Huang
2022-04-28 15:54   ` Dave Hansen
2022-04-29  7:32     ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM Kai Huang
2022-04-28 16:22   ` Dave Hansen
2022-04-29  7:24     ` Kai Huang
2022-04-29 13:52       ` Dave Hansen
2022-04-06  4:49 ` [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2022-04-28 17:12   ` Dave Hansen
2022-04-29  7:46     ` Kai Huang
2022-04-29 14:20       ` Dave Hansen
2022-04-29 14:30         ` Sean Christopherson
2022-04-29 17:46           ` Dave Hansen
2022-04-29 18:19             ` Sean Christopherson
2022-04-29 18:32               ` Dave Hansen
2022-05-02  5:59         ` Kai Huang
2022-05-02 14:17           ` Dave Hansen
2022-05-02 21:55             ` Kai Huang
2022-04-06  4:49 ` [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs Kai Huang
2022-04-06  4:49 ` [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID Kai Huang
2022-04-06  4:49 ` [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and " Kai Huang
2022-04-06  4:49 ` [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2022-04-06  4:49 ` [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs Kai Huang
2022-04-06  4:49 ` [PATCH v3 19/21] x86: Flush cache of TDX private memory during kexec() Kai Huang
2022-04-06  4:49 ` [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support Kai Huang
2022-04-28 17:25   ` Dave Hansen
2022-04-06  4:49 ` [PATCH v3 21/21] Documentation/x86: Add documentation for " Kai Huang
2022-04-14 10:19 ` [PATCH v3 00/21] TDX host kernel support Kai Huang
2022-04-26 20:13 ` Dave Hansen
2022-04-27  1:15   ` Kai Huang [this message]
2022-04-27 21:59     ` Dave Hansen
2022-04-28  0:37       ` Kai Huang
2022-04-28  0:50         ` Dave Hansen
2022-04-28  0:58           ` Kai Huang
2022-04-29  1:40             ` Kai Huang
2022-04-29  3:04               ` Dan Williams
2022-04-29  5:35                 ` Kai Huang
2022-05-03 23:59               ` Kai Huang
2022-05-04  0:25                 ` Dave Hansen
2022-05-04  1:15                   ` Kai Huang
2022-05-05  9:54                     ` Kai Huang
2022-05-05 13:51                       ` Dan Williams
2022-05-05 22:14                         ` Kai Huang
2022-05-06  0:22                           ` Dan Williams
2022-05-06  0:45                             ` Kai Huang
2022-05-06  1:15                               ` Dan Williams
2022-05-06  1:46                                 ` Kai Huang
2022-05-06 15:57                                   ` Dan Williams
2022-05-09  2:46                                     ` Kai Huang
2022-05-10 10:25                                       ` Kai Huang
2022-05-07  0:09                         ` Mike Rapoport
2022-05-08 10:00                           ` Kai Huang
2022-05-09 10:33                             ` Mike Rapoport
2022-05-09 23:27                               ` Kai Huang
2022-05-04 14:31                 ` Dan Williams
2022-05-04 22:50                   ` Kai Huang
2022-04-28  1:01   ` Dan Williams
2022-04-28  1:21     ` Kai Huang
2022-04-29  2:58       ` Dan Williams
2022-04-29  5:43         ` Kai Huang
2022-04-29 14:39         ` Dave Hansen
2022-04-29 15:18           ` Dan Williams
2022-04-29 17:18             ` Dave Hansen
2022-04-29 17:48               ` Dan Williams
2022-04-29 18:34                 ` Dave Hansen
2022-04-29 18:47                   ` Dan Williams
2022-04-29 19:20                     ` Dave Hansen
2022-04-29 21:20                       ` Dan Williams
2022-04-29 21:27                         ` Dave Hansen
2022-05-02 10:18                   ` Kai Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ecf718abf864bbb2366209f00d4315ada090aedc.camel@intel.com \
    --to=kai.huang@intel.com \
    --cc=ak@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=reinette.chatre@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).