Re: [PATCH v1 00/15] Add support for Nitro Enclaves

From: Liran Alon <liran.alon@oracle.com>
To: "Paraschiv, Andra-Irina" <andraprs@amazon.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	linux-kernel@vger.kernel.org
Cc: Anthony Liguori <aliguori@amazon.com>,
	Benjamin Herrenschmidt <benh@amazon.com>,
	Colm MacCarthaigh <colmmacc@amazon.com>,
	Bjoern Doebel <doebel@amazon.de>,
	David Woodhouse <dwmw@amazon.co.uk>,
	Frank van der Linden <fllinden@amazon.com>,
	Alexander Graf <graf@amazon.de>,
	Martin Pohlack <mpohlack@amazon.de>, Matt Wilson <msw@amazon.com>,
	Balbir Singh <sblbir@amazon.com>,
	Stewart Smith <trawets@amazon.com>,
	Uwe Dannowski <uwed@amazon.de>,
	kvm@vger.kernel.org, ne-devel-upstream@amazon.com
Subject: Re: [PATCH v1 00/15] Add support for Nitro Enclaves
Date: Mon, 27 Apr 2020 14:44:09 +0300	[thread overview]
Message-ID: <26111e31-8ff5-8358-1e05-6d7df0441ab1@oracle.com> (raw)
In-Reply-To: <eb92ba4e-113e-d7ec-4633-f6b5ac54796b@amazon.com>

On 27/04/2020 10:56, Paraschiv, Andra-Irina wrote:
>
> On 25/04/2020 18:25, Liran Alon wrote:
>>
>> On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote:
>>>
>>> The memory and CPUs are carved out of the primary VM, they are 
>>> dedicated for the enclave. The Nitro hypervisor running on the host 
>>> ensures memory and CPU isolation between the primary VM and the 
>>> enclave VM.
>> I hope you properly take into consideration Hyper-Threading 
>> speculative side-channel vulnerabilities here.
>> i.e. Usually cloud providers designate each CPU core to be assigned 
>> to run only vCPUs of specific guest. To avoid sharing a single CPU 
>> core between multiple guests.
>> To handle this properly, you need to use some kind of core-scheduling 
>> mechanism (Such that each CPU core either runs only vCPUs of enclave 
>> or only vCPUs of primary VM at any given point in time).
>>
>> In addition, can you elaborate more on how the enclave memory is 
>> carved out of the primary VM?
>> Does this involve performing a memory hot-unplug operation from 
>> primary VM or just unmap enclave-assigned guest physical pages from 
>> primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT?
>
> Correct, we take into consideration the HT setup. The enclave gets 
> dedicated physical cores. The primary VM and the enclave VM don't run 
> on CPU siblings of a physical core.
The way I would imagine this to work is that Primary-VM just specifies 
how many vCPUs will the Enclave-VM have and those vCPUs will be set with 
affinity to run on same physical CPU cores as Primary-VM.
But with the exception that scheduler is modified to not run vCPUs of 
Primary-VM and Enclave-VM as sibling on the same physical CPU core 
(core-scheduling). i.e. This is different than primary-VM losing
physical CPU cores permanently as long as the Enclave-VM is running.
Or maybe this should even be controlled by a knob in virtual PCI device 
interface to allow flexibility to customer to decide if Enclave-VM needs 
dedicated CPU cores or is it ok to share them with Primary-VM
as long as core-scheduling is used to guarantee proper isolation.
>
> Regarding the memory carve out, the logic includes page table entries 
> handling.
As I thought. Thanks for conformation.
>
> IIRC, memory hot-unplug can be used for the memory blocks that were 
> previously hot-plugged.
>
> https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html__;!!GqivPVa7Brio!MubgaBjJabDtNzNpdOxxbSKtLbqXHbsEpTtZ1mj-rnfLvMIbLW1nZ8cK10GhYJQ$ 
>
>>
>> I don't quite understand why Enclave VM needs to be 
>> provisioned/teardown during primary VM's runtime.
>>
>> For example, an alternative could have been to just provision both 
>> primary VM and Enclave VM on primary VM startup.
>> Then, wait for primary VM to setup a communication channel with 
>> Enclave VM (E.g. via virtio-vsock).
>> Then, primary VM is free to request Enclave VM to perform various 
>> tasks when required on the isolated environment.
>>
>> Such setup will mimic a common Enclave setup. Such as Microsoft 
>> Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also 
>> similar to TEEs running on ARM TrustZone.
>> i.e. In my alternative proposed solution, the Enclave VM is similar 
>> to VTL1/TrustZone.
>> It will also avoid requiring introducing a new PCI device and driver.
>
> True, this can be another option, to provision the primary VM and the 
> enclave VM at launch time.
>
> In the proposed setup, the primary VM starts with the initial 
> allocated resources (memory, CPUs). The launch path of the enclave VM, 
> as it's spawned on the same host, is done via the ioctl interface - 
> PCI device - host hypervisor path. Short-running or long-running 
> enclave can be bootstrapped during primary VM lifetime. Depending on 
> the use case, a custom set of resources (memory and CPUs) is set for 
> an enclave and then given back when the enclave is terminated; these 
> resources can be used for another enclave spawned later on or the 
> primary VM tasks.
>
Yes, I already understood this is how the mechanism work. I'm 
questioning whether this is indeed a good approach that should also be 
taken by upstream.

The use-case of using Nitro Enclaves is for a Confidential-Computing 
service. i.e. The ability to provision a compute instance that can be 
trusted to perform a bunch of computation on sensitive
information with high confidence that it cannot be compromised as it's 
highly isolated. Some technologies such as Intel SGX and AMD SEV 
attempted to achieve this even with guarantees that
the computation is isolated from the hardware and hypervisor itself.

I would have expected that for the vast majority of real customer 
use-cases, the customer will provision a compute instance that runs some 
confidential-computing task in an enclave which it
keeps running for the entire life-time of the compute instance. As the 
sole purpose of the compute instance is to just expose a service that 
performs some confidential-computing task.
For those cases, it should have been sufficient to just pre-provision a 
single Enclave-VM that performs this task, together with the compute 
instance and connect them via virtio-vsock.
Without introducing any new virtual PCI device, guest PCI driver and 
unique semantics of stealing resources (CPUs and Memory) from primary-VM 
at runtime.

In this Nitro Enclave architecture, we de-facto put Compute 
control-plane abilities in the hands of the guest VM. Instead of 
introducing new control-plane primitives that allows building
the data-plane architecture desired by the customer in a flexible manner.
* What if the customer prefers to have it's Enclave VM polling S3 bucket 
for new tasks and produce results to S3 as-well? Without having any 
"Primary-VM" or virtio-vsock connection of any kind?
* What if for some use-cases customer wants Enclave-VM to have dedicated 
compute power (i.e. Not share physical CPU cores with primary-VM. Not 
even with core-scheduling) but for other
use-cases, customer prefers to share physical CPU cores with Primary-VM 
(Together with core-scheduling guarantees)? (Although this could be 
addressed by extending the virtual PCI device
interface with a knob to control this)

An alternative would have been to have the following new control-plane 
primitives:
* Ability to provision a VM without boot-volume, but instead from an 
Image that is used to boot from memory. Allowing to provision disk-less VMs.
   (E.g. Can be useful for other use-cases such as VMs not requiring EBS 
at all which could allow cheaper compute instance)
* Ability to provision a group of VMs together as a group such that they 
are guaranteed to launch as sibling VMs on the same host.
* Ability to create a fast-path connection between sibling VMs on the 
same host with virtio-vsock. Or even also other shared-memory mechanism.
* Extend AWS Fargate with ability to run multiple microVMs as a group 
(Similar to above) connected with virtio-vsock. To allow on-demand scale 
of confidential-computing task.

Having said that, I do see a similar architecture to Nitro Enclaves 
virtual PCI device used for a different purpose: For hypervisor-based 
security isolation (Such as Windows VBS).
E.g. Linux boot-loader can detect the presence of this virtual PCI 
device and use it to provision multiple VM security domains. Such that 
when a security domain is created,
it is specified what is the hardware resources it have access to (Guest 
memory pages, IOPorts, MSRs and etc.) and the blob it should run to 
bootstrap. Similar, but superior than,
Hyper-V VSM. In addition, some security domains will be given special 
abilities to control other security domains (For example, to control the 
+XS,+XU EPT bits of other security
domains to enforce code-integrity. Similar to Windows VBS HVCI). Just an 
idea... :)

-Liran