linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
To: "Huang, Kai" <kai.huang@intel.com>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	"mhkelley58@gmail.com" <mhkelley58@gmail.com>,
	"Cui, Dexuan" <decui@microsoft.com>
Cc: "cascardo@canonical.com" <cascardo@canonical.com>,
	"tim.gardner@canonical.com" <tim.gardner@canonical.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	"roxana.nicolescu@canonical.com" <roxana.nicolescu@canonical.com>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"haiyangz@microsoft.com" <haiyangz@microsoft.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"stefan.bader@canonical.com" <stefan.bader@canonical.com>,
	"nik.borisov@suse.com" <nik.borisov@suse.com>,
	"kys@microsoft.com" <kys@microsoft.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	"sashal@kernel.org" <sashal@kernel.org>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>, "x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init
Date: Wed, 6 Dec 2023 19:47:12 +0100	[thread overview]
Message-ID: <fa86fbd1-998b-456b-971f-a5a94daeca28@linux.microsoft.com> (raw)
In-Reply-To: <7b725783f1f9102c176737667bfec12f75099961.camel@intel.com>

On 05/12/2023 14:26, Huang, Kai wrote:
>>
>>>>>
>>>>> Hm. Okay.
>>>>>
>>>>> Can we take a step back? What is bigger picture here? What enlightenment
>>>>> do you expect from the guest when everything is in-place?
>>>>>
>>>>
>>>> All the functional enlightenment are already in place in the kernel and
>>>> everything works (correct me if I'm wrong Dexuan/Michael). The enlightenments
>>>> are that TDX VMCALLs are needed for MSR manipulation and vmbus operations,
>>>> encrypted bit needs to be manipulated in the page tables and page
>>>> visibility propagated to VMM.
>>>
>>> Not quite family with hyperv enlightenments, but are these enlightenments TDX
>>> guest specific?  Because if they are not, then they should be able to be
>>> emulated by the normal hyperv, thus the hyperv as L1 (which is TDX guest) can
>>> emulate them w/o letting the L2 know the hypervisor it runs on is actually a TDX
>>> guest.
>>
>> I would say that these hyperv enlightenments are confidential guest specific
>> (TDX/SNP) when running with TD-partitioning/VMPL. In both cases there are TDX/SNP
>> specific ways to exit directly to L0 (when needed) and native privileged instructions
>> trap to the paravisor.
>>
>> L1 is not hyperv and no one wants to emulate the I/O path. The L2 guest knows that
>> it's confidential so that it can explicitly use swiotlb, toggle page visibility
>> and notify the host (L0) on the I/O path without incurring additional emulation
>> overhead.
>>
>>>
>>> Btw, even if there's performance concern here, as you mentioned the TDVMCALL is
>>> actually made to the L0 which means L0 must be aware such VMCALL is from L2 and
>>> needs to be injected to L1 to handle, which IMHO not only complicates the L0 but
>>> also may not have any performance benefits.
>>
>> The TDVMCALLs are related to the I/O path (networking/block io) into the L2 guest, and
>> so they intentionally go straight to L0 and are never injected to L1. L1 is not
>> involved in that path at all.
>>
>> Using something different than TDVMCALLs here would lead to additional traps to L1 and
>> just add latency/complexity.
> 
> Looks by default you assume we should use TDX partitioning as "paravisor L1" +
> "L0 device I/O emulation".
> 

I don't actually want to impose this model on anyone, but this is the one that
could use some refactoring. I intend to rework these patches to not use a single
"td_partitioning_active" for decisions.

> I think we are lacking background of this usage model and how it works.  For
> instance, typically L2 is created by L1, and L1 is responsible for L2's device
> I/O emulation.  I don't quite understand how could L0 emulate L2's device I/O?
> 
> Can you provide more information?

Let's differentiate between fast and slow I/O. The whole point of the paravisor in
L1 is to provide device emulation for slow I/O: TPM, RTC, NVRAM, IO-APIC, serial ports.

But fast I/O is designed to bypass it and go straight to L0. Hyper-V uses paravirtual
vmbus devices for fast I/O (net/block). The vmbus protocol has awareness of page visibility
built-in and uses native (GHCI on TDX, GHCB on SNP) mechanisms for notifications. So once
everything is set up (rings/buffers in swiotlb), the I/O for fast devices does not
involve L1. This is only possible when the VM manages C-bit itself.

I think the same thing could work for virtio if someone would "enlighten" vring
notification calls (instead of I/O or MMIO instructions).

> 
>>
>>>
>>>>
>>>> Whats missing is the tdx_guest flag is not exposed to userspace in /proc/cpuinfo,
>>>> and as a result dmesg does not currently display:
>>>> "Memory Encryption Features active: Intel TDX".
>>>>
>>>> That's what I set out to correct.
>>>>
>>>>> So far I see that you try to get kernel think that it runs as TDX guest,
>>>>> but not really. This is not very convincing model.
>>>>>
>>>>
>>>> No that's not accurate at all. The kernel is running as a TDX guest so I
>>>> want the kernel to know that. 
>>>>
>>>
>>> But it isn't.  It runs on a hypervisor which is a TDX guest, but this doesn't
>>> make itself a TDX guest.> 
>>
>> That depends on your definition of "TDX guest". The TDX 1.5 TD partitioning spec
>> talks of TDX-enlightened L1 VMM, (optionally) TDX-enlightened L2 VM and Unmodified
>> Legacy L2 VM. Here we're dealing with a TDX-enlightened L2 VM.
>>
>> If a guest runs inside an Intel TDX protected TD, is aware of memory encryption and
>> issues TDVMCALLs - to me that makes it a TDX guest.
> 
> The thing I don't quite understand is what enlightenment(s) requires L2 to issue
> TDVMCALL and know "encryption bit".
> 
> The reason that I can think of is:
> 
> If device I/O emulation of L2 is done by L0 then I guess it's reasonable to make
> L2 aware of the "encryption bit" because L0 can only write emulated data to
> shared buffer.  The shared buffer must be initially converted by the L2 by using
> MAP_GPA TDVMCALL to L0 (to zap private pages in S-EPT etc), and L2 needs to know
> the "encryption bit" to set up its page table properly.  L1 must be aware of
> such private <-> shared conversion too to setup page table properly so L1 must
> also be notified.

Your description is correct, except that L2 uses a hypercall (hv_mark_gpa_visibility())
to notify L1 and L1 issues the MAP_GPA TDVMCALL to L0.

C-bit awareness is necessary to setup the whole swiotlb pool to be host visible for
DMA.

> 
> The concern I am having is whether there's other usage model(s) that we need to
> consider.  For instance, running both unmodified L2 and enlightened L2.  Or some
> L2 only needs TDVMCALL enlightenment but no "encryption bit".
> 

Presumably unmodified L2 and enlightened L2 are already covered by current code but
require excessive trapping to L1.

I can't see a usecase for TDVMCALLs but no "encryption bit". 

> In other words, that seems pretty much L1 hypervisor/paravisor implementation
> specific.  I am wondering whether we can completely hide the enlightenment(s)
> logic to hypervisor/paravisor specific code but not generically mark L2 as TDX
> guest but still need to disable TDCALL sort of things.

That's how it currently works - all the enlightenments are in hypervisor/paravisor
specific code in arch/x86/hyperv and drivers/hv and the vm is not marked with
X86_FEATURE_TDX_GUEST.

But without X86_FEATURE_TDX_GUEST userspace has no unified way to discover that an
environment is protected by TDX and also the VM gets classified as "AMD SEV" in dmesg.
This is due to CC_ATTR_GUEST_MEM_ENCRYPT being set but X86_FEATURE_TDX_GUEST not.

> 
> Hope we are getting closer to be on the same page.
> 

I feel we are getting there

  reply	other threads:[~2023-12-06 18:48 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-22 17:01 [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init Jeremi Piotrowski
2023-11-22 17:01 ` [PATCH v1 2/3] x86/coco: Disable TDX module calls when TD partitioning is active Jeremi Piotrowski
2023-11-23 14:13   ` Kirill A. Shutemov
2023-11-24 10:38     ` Jeremi Piotrowski
2023-11-29 10:37       ` Huang, Kai
2023-12-01 15:27         ` Jeremi Piotrowski
2023-11-22 17:01 ` [PATCH v1 3/3] x86/tdx: Provide stub tdx_accept_memory() for non-TDX configs Jeremi Piotrowski
2023-11-23 14:11   ` Kirill A. Shutemov
2023-11-24 10:00     ` Jeremi Piotrowski
2023-11-22 17:19 ` [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init Jeremi Piotrowski
2023-11-29 16:40   ` Borislav Petkov
2023-11-30  7:08     ` Reshetova, Elena
2023-11-30  7:55       ` Borislav Petkov
2023-11-30  8:31         ` Reshetova, Elena
2023-11-30  9:21           ` Borislav Petkov
2023-12-04 16:44             ` Jeremi Piotrowski
2023-12-04 13:39           ` Jeremi Piotrowski
2023-12-04 19:37     ` Jeremi Piotrowski
2023-11-23 13:58 ` Kirill A. Shutemov
2023-11-24 10:31   ` Jeremi Piotrowski
2023-11-24 10:43     ` Kirill A. Shutemov
2023-11-24 11:04       ` Jeremi Piotrowski
2023-11-24 13:33         ` Kirill A. Shutemov
2023-11-24 16:19           ` Jeremi Piotrowski
2023-11-29  4:36             ` Huang, Kai
2023-12-01 16:16               ` Jeremi Piotrowski
2023-12-05 13:26                 ` Huang, Kai
2023-12-06 18:47                   ` Jeremi Piotrowski [this message]
2023-12-07 12:58                     ` Huang, Kai
2023-12-07 17:21                       ` Jeremi Piotrowski
2023-12-07 19:35                         ` Jeremi Piotrowski
2023-12-08 10:51                           ` Huang, Kai
2023-12-07 17:36                     ` Reshetova, Elena
2023-12-08 12:45                       ` Jeremi Piotrowski
2023-12-04  9:17 ` Reshetova, Elena
2023-12-04 19:07   ` Jeremi Piotrowski
2023-12-05 10:54     ` Kirill A. Shutemov
2023-12-06 17:49       ` Jeremi Piotrowski
2023-12-06 22:54         ` Kirill A. Shutemov
2023-12-07 17:06           ` Jeremi Piotrowski
2023-12-07 20:56             ` Kirill A. Shutemov
2023-12-05 13:24     ` Reshetova, Elena

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa86fbd1-998b-456b-971f-a5a94daeca28@linux.microsoft.com \
    --to=jpiotrowski@linux.microsoft.com \
    --cc=bp@alien8.de \
    --cc=cascardo@canonical.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=hpa@zytor.com \
    --cc=kai.huang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhkelley58@gmail.com \
    --cc=mingo@redhat.com \
    --cc=nik.borisov@suse.com \
    --cc=peterz@infradead.org \
    --cc=roxana.nicolescu@canonical.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=stefan.bader@canonical.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tim.gardner@canonical.com \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).