Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init

From: "Huang, Kai" <kai.huang@intel.com>
To: "kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	"mhkelley58@gmail.com" <mhkelley58@gmail.com>,
	"Cui, Dexuan" <decui@microsoft.com>,
	"jpiotrowski@linux.microsoft.com"
	<jpiotrowski@linux.microsoft.com>
Cc: "cascardo@canonical.com" <cascardo@canonical.com>,
	"tim.gardner@canonical.com" <tim.gardner@canonical.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	"roxana.nicolescu@canonical.com" <roxana.nicolescu@canonical.com>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"haiyangz@microsoft.com" <haiyangz@microsoft.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"stefan.bader@canonical.com" <stefan.bader@canonical.com>,
	"nik.borisov@suse.com" <nik.borisov@suse.com>,
	"kys@microsoft.com" <kys@microsoft.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	"sashal@kernel.org" <sashal@kernel.org>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>, "x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init
Date: Tue, 5 Dec 2023 13:26:04 +0000	[thread overview]
Message-ID: <7b725783f1f9102c176737667bfec12f75099961.camel@intel.com> (raw)
In-Reply-To: <02e079e8-cc72-49d8-9191-8a753526eb18@linux.microsoft.com>

> 
> > > > 
> > > > Hm. Okay.
> > > > 
> > > > Can we take a step back? What is bigger picture here? What enlightenment
> > > > do you expect from the guest when everything is in-place?
> > > > 
> > > 
> > > All the functional enlightenment are already in place in the kernel and
> > > everything works (correct me if I'm wrong Dexuan/Michael). The enlightenments
> > > are that TDX VMCALLs are needed for MSR manipulation and vmbus operations,
> > > encrypted bit needs to be manipulated in the page tables and page
> > > visibility propagated to VMM.
> > 
> > Not quite family with hyperv enlightenments, but are these enlightenments TDX
> > guest specific?  Because if they are not, then they should be able to be
> > emulated by the normal hyperv, thus the hyperv as L1 (which is TDX guest) can
> > emulate them w/o letting the L2 know the hypervisor it runs on is actually a TDX
> > guest.
> 
> I would say that these hyperv enlightenments are confidential guest specific
> (TDX/SNP) when running with TD-partitioning/VMPL. In both cases there are TDX/SNP
> specific ways to exit directly to L0 (when needed) and native privileged instructions
> trap to the paravisor.
> 
> L1 is not hyperv and no one wants to emulate the I/O path. The L2 guest knows that
> it's confidential so that it can explicitly use swiotlb, toggle page visibility
> and notify the host (L0) on the I/O path without incurring additional emulation
> overhead.
> 
> > 
> > Btw, even if there's performance concern here, as you mentioned the TDVMCALL is
> > actually made to the L0 which means L0 must be aware such VMCALL is from L2 and
> > needs to be injected to L1 to handle, which IMHO not only complicates the L0 but
> > also may not have any performance benefits.
> 
> The TDVMCALLs are related to the I/O path (networking/block io) into the L2 guest, and
> so they intentionally go straight to L0 and are never injected to L1. L1 is not
> involved in that path at all.
> 
> Using something different than TDVMCALLs here would lead to additional traps to L1 and
> just add latency/complexity.

Looks by default you assume we should use TDX partitioning as "paravisor L1" +
"L0 device I/O emulation".

I think we are lacking background of this usage model and how it works.  For
instance, typically L2 is created by L1, and L1 is responsible for L2's device
I/O emulation.  I don't quite understand how could L0 emulate L2's device I/O?

Can you provide more information?

> 
> > 
> > > 
> > > Whats missing is the tdx_guest flag is not exposed to userspace in /proc/cpuinfo,
> > > and as a result dmesg does not currently display:
> > > "Memory Encryption Features active: Intel TDX".
> > > 
> > > That's what I set out to correct.
> > > 
> > > > So far I see that you try to get kernel think that it runs as TDX guest,
> > > > but not really. This is not very convincing model.
> > > > 
> > > 
> > > No that's not accurate at all. The kernel is running as a TDX guest so I
> > > want the kernel to know that. 
> > > 
> > 
> > But it isn't.  It runs on a hypervisor which is a TDX guest, but this doesn't
> > make itself a TDX guest.> 
> 
> That depends on your definition of "TDX guest". The TDX 1.5 TD partitioning spec
> talks of TDX-enlightened L1 VMM, (optionally) TDX-enlightened L2 VM and Unmodified
> Legacy L2 VM. Here we're dealing with a TDX-enlightened L2 VM.
> 
> If a guest runs inside an Intel TDX protected TD, is aware of memory encryption and
> issues TDVMCALLs - to me that makes it a TDX guest.

The thing I don't quite understand is what enlightenment(s) requires L2 to issue
TDVMCALL and know "encryption bit".

The reason that I can think of is:

If device I/O emulation of L2 is done by L0 then I guess it's reasonable to make
L2 aware of the "encryption bit" because L0 can only write emulated data to
shared buffer.  The shared buffer must be initially converted by the L2 by using
MAP_GPA TDVMCALL to L0 (to zap private pages in S-EPT etc), and L2 needs to know
the "encryption bit" to set up its page table properly.  L1 must be aware of
such private <-> shared conversion too to setup page table properly so L1 must
also be notified.

The concern I am having is whether there's other usage model(s) that we need to
consider.  For instance, running both unmodified L2 and enlightened L2.  Or some
L2 only needs TDVMCALL enlightenment but no "encryption bit".

In other words, that seems pretty much L1 hypervisor/paravisor implementation
specific.  I am wondering whether we can completely hide the enlightenment(s)
logic to hypervisor/paravisor specific code but not generically mark L2 as TDX
guest but still need to disable TDCALL sort of things.

Hope we are getting closer to be on the same page.