[RFC PATCH 00/10] Basic KVM SGX Virtualization support

* [RFC PATCH 00/10] Basic KVM SGX Virtualization support
@ 2017-05-08  5:24 Kai Huang
  2017-05-08  5:24 ` [PATCH 01/10] x86: add SGX Launch Control definition to cpufeature Kai Huang
                   ` (10 more replies)
  0 siblings, 11 replies; 78+ messages in thread
From: Kai Huang @ 2017-05-08  5:24 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm

Hi,

This RFC series are KVM part of Basic KVM SGX virtualization support (KVM SGX
EPC static partitioning + Launch Control + SGX2 support). Qemu also needs to
be changed to support KVM SGX virtualization and Qemu part will be sent out
separately in the future.

You can also find this series and Qemu changes at below github repos:

https://github.com/01org/kvm-sgx.git
https://github.com/01org/qemu-sgx.git

KVM SGX virtualization needs to work with host SGX driver (explained below,
which has not been upstreamed yet), therefore part of this series will depend
on SGX driver. You can find the SGX driver at below repo on github.

https://github.com/jsakkine-intel/linux-sgx

The SGX specification can be found in latest Intel SDM as Volume D(below).

https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf

SGX is relatively more complicated on specification (entire Volume D) and it is
unrealistic to list all hardware details here. Below is the brief SGX overview
(which I think is mandatory to talk about design) and high level design. Please
help to review and give comments. Thanks!

============================   SGX Overview   ===========================

- Enclave

Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms
for memory accesses in order to provide security accesses for sensitive
applications and data. SGX allows an application to use it's pariticular address
space as an *enclave*, which is a protected area provides confidentiality and
integrity even in the presence of privileged malware. Accesses to the enclave
memory area from any software not resident in the enclave are prevented,
including those from privileged software. Below diagram illustrates the presence
of Enclave in application.

	|-----------------------|
	|                       |
	|   |---------------|   |
	|   |   OS kernel   |   |       |-----------------------|
	|   |---------------|   |       |                       |
	|   |               |   |       |   |---------------|   |
	|   |---------------|   |       |   | Entry table   |   |
	|   |   Enclave     |---|-----> |   |---------------|   |
	|   |---------------|   |       |   | Enclave stack |   |
	|   |   App code    |   |       |   |---------------|   |
	|   |---------------|   |       |   | Enclave heap  |   |
	|   |   Enclave     |   |       |   |---------------|   |
	|   |---------------|   |       |   | Enclave code  |   |
	|   |   App code    |   |       |   |---------------|   |
	|   |---------------|   |       |                       |
	|           |           |       |-----------------------|
	|-----------------------|

SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support,
and SGX2 allows additional flexibility in runtime management of enclave
resources and thread execution within an enclave.

- Enclave Page Cache

Enclave Page Cache (EPC) is the physical resource used to commit to enclave.
EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K
boundary. Hardware performs additional access control checks to restrict access
to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which
holds one entry for each EPC page, and is used by hardware to track the status
of each EPC page (invisibe to software). Typically EPC and EPCM are reserved
by BIOS as Processor Reserved Memory but the actual amount, size, and layout
of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated
via new SGX CPUID, and is reported as reserved memory.

EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1:
regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control
Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page.
Each enclave is associated with one SECS page. Each thread in enclave is
associated with one TCS page. VA page is used in EPC page eviction and reload.
Trimmed EPC page is used when particular 4K page in enclave is going to be
freed (trimmed).

- ENCLS and ENCLU

Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC.
ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and
ENCLU have multiple leaf functions, with EAX indicating the specific leaf
function. Specification of ENCLS and ENCLU can be found at SDM Chapter 41 SGX
Instruction References.

- Discovering SGX capability

CPUID.0x7.0:EBX.SGX[bit 2] reports the availability of SGX, and detailed SGX
info can be enumerated via new CPUID 0x12.

CPUID.0x12.0x0 enumerates SGX capablity (ex, SGX1, SGX2), including enclave
instruction opcode support. CPUID.0x12.0x1 enumerates SGX capability of
processor state configuration and enclave configuration in the SECS structure.
CPUID.0x12.0x2 (and following indexes if they are valid) enumerates EPC
resources. Starting from CPUID.0x12.0x2, each index reports one valid EPC
section (base, size), until CPUID reports invalid EPC. Typically multiple EPC
sections only exist on multiple sockets server machines (which currently don't
exist), and client machine or single socket server just reports one EPC.

Please refer to Chapter 37.7.2 Intel SGX Resource Enumeration Leaves for
detailed info of SGX CPUID.

On processor that supports SGX, SGX can also be opt-in{out} via SGX_ENABLE bit
(bit 18) of IA32_FEATURE_CONTROL MSR. If SGX_ENABLE bit is cleared while
IA32_FEATURE_CONTROL is locked then SGX is disabled on processor. The SGX CPUID
0x12 is still available if SGX is opted out via IA32_FEATURE_CONTROL. The SDM
doesn't specify the exact info that SGX CPUID 0x12 will report in this case,
but likely they will report invalid SGX info. If SGX is opted in, then SGX CPUID
0x12 reports valid SGX info. ENCLS and ENCLU will either #UD or #GP, depending
on the value of CPUID.0x7.0:EBX.SGX, IA32_FEATURE_CONTROL.SGX_ENABLE and
IA32_FEATURE_CONTROL.LOCK.

Please refer to Chapter 37.7.1 Intel SGX Opt-in Configuration for detailed info.

- SGX Launch Control

On processor that supports SGX, IA32_SGXLEPUBKEYHASH[0-3] MSRs contains the
hash of RSA public key. The Launch Enclave (LE) can be only run if it is signed
with the related RSA private key. Without SGX Launch Control, hardware can only
run Launch Enclave (LE) that signed with Intel's RSA key. SGX Launch Control
allows software to be able to change IA32_SGXLEPUBKEYHASHn at runtime, allowing
processor to run 3rd party's LE.

SGX Launch Control adds a new SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) to
IA32_FEATURE_CONTROL MSR. If SGX_LAUNCH_CONTROL_ENABLE[bit 17] is 1,
IA32_SGXLEPUBKEYHASHn are writable at runtime after IA32_FEATURE_CONTROL.LOCK
is set. Otherwise they are readonly. Typically BIOS allows user to setup
3rd party's IA32_SGXLEPUBKEYHASHn before IA32_FEATURE_CONTROL is locked, and
allows user to choose whether to allow IA32_SGXLEPUBKEYHASHn to be changed
at runtime as well. However this depends on BIOS's implementation.

The CPUID.0x7.0:ECX[bit 30] reports availability of bit 17 of
IA32_FEATURE_CONTROL, meaning processor only support SGX Launch Policy when
CPUID.0x7.0:ECX[bit 30] is 1.

- SGX interaction with VMX

A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding
0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS
exiting" control bit (bit 15) is defined in secondary processor based vm
execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting
bitmap control.

Support for the 1-setting of "Enable ENCLS exiting" control is enumrated from
IA32_VMX_PROCBASED_CTLS2[bit 47]. IA32_VMX_PROCBASED_CTLS2[bit 47] monitors
CPUID.[EAX=0x7,ECX=0].EBX.SGX.

A new ENCLS VM exit reason (60) is also defined to Basic Exit Reason.

Below code shows how above execution control works:

    IF ( (in VMX non-root operation) and ( Enable_ENCLS_EXITING = 1) )
        Then
            IF ( ((EAX < 63) and (ENCLS_EXITING_Bitmap[EAX] = 1)) or
                    (EAX> 62 and ENCLS_EXITING_Bitmap[63] = 1) )
            Then
                Set VMCS.EXIT_REASON = ENCLS;
                Deliver VM exit;
        FI;
    FI;

VM exits that originate within an enclave set the following two bits before
delivering the VM exit to the VMM:
    - Bit 27 in the Exit reason filed of Basic VM-exit information.
    - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS.

Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and
27.3.4 Saving Non-Register.

=========================   High Level Design   ==========================

- Qemu Changes

EPC is limited resource. Typically the EPC and EPCM together are 32M, 64M, or
128M configurable in BIOS. In order to use EPC more efficiently between
different KVM guests, we add additional Qemu parameters to allow administrator
to specify guest's EPC size when it is created. we also add additional two
parameters for SGX Launch Control. Specifically, below SGX parameters are added:

	# qemu-system-x86_64 -sgx epc=<size>,lehash='256-bit value string',lewr

In which 'epc' parameter specifies guest's EPC size. Any MB aligned value is
supported. 'lehash' is used to specify guest's IA32_SGXLEPUBKEYHASHn initial
value, and 'lewr' is used to specify whether guest's IA32_SGXLEPUBKEYHASHn are
writable for guest OS. 'epc' is mandatory and both 'lehash' and 'lewr' are
optional. Normally with 'lewr' specified, 'lehash' is not needed (and default
value is Intel's hash) as guest OS is able to change IA32_SGXLEPUBKEYHASHn as
it wishs.

With 'epc' parameter, Qemu is responsible for notifying KVM guest's EPC base
and size. EPC base address will be calculated by Qemu internally (according to
chip type, memory size, etc).

With 'lehash' specified, Qemu sets guest's IA32_SGXLEPUBKEYHASHn to the value
specified. With 'lewr' specified, Qemu sets guest's IA32_FEATURE_CONTROL bit 17
to be 1.

- Expose SGX to guest

SGX feature is exposed to guest via SGX CPUID. Looking at SGX CPUID, we can
report the same CPUID info to guest as on native for most of SGX CPUID. With
reporting the same CPUID guest is able to use full capacity of SGX, and KVM
doesn't need to emulate those info.

There are two exceptions: the first is obviously KVM cannot report physical
EPC to guest, but should report guest's (virtual) EPC base and size (which
will be notified from Qemu as we mentioned above). The second one is
SECS.ATTRIBUTES, which is reported by CPUID.0x12.1:EAX-EDX. Particularly,
it is SECS.ATTRIBUTES.XFRM(bit 127:64] that needs emulation. It reports which
XFRM bits can be set when creating enclave by using ENCLS[ECREATE]. As guest
may not support all XFRM bits that supported by hardware,
CPUID.0x12.0x1:[ECX-EDX] should also only reports guest's supported XFRM bits.

All other CPUID info can be reported to guest just as the same as on native.

And we only report one EPC section to guset (only CPUID.0x12.0x2 is valid).

- Initializing SGX for guest

As mentioned above guest's EPC base and size are determined by Qemu, and KVM
needs Qemu to notify such info to it before it can initialize SGX for guest.
To avoid new IOCTL for such purpose (ex, KVM_SET_EPC), KVM will initialize
guest's SGX in KVM_SET_CPUID2, where Qemu will pass guest's SGX CPUID where
guest's EPC base and size will be included.

Also the SDM says SGX CPUID is actually thread-specific. Software cannot assume
all logical processor will report the same SGX CPUID. Initializing guest's SGX
in KVM_SET_CPUID2 provides an opportunity for KVM to check whether SGX CPUID
passed by Qemu are valid and consistent within for all VCPUs.

- EPC management

On host side there's SGX driver which serves host SGX applications from
userspace. It detects SGX features and manages all EPC pages. To work with SGX
driver simultaneously, we have to use 'unified model', in which SGX driver
still manages EPC and KVM calls driver's APIs to allocate/free EPC page, etc.
However KVM cannot call driver's APIs directly, as on machines without SGX
feature, SGX driver won't be loaded, and calling driver's APIs directly will
make KVM unable to be loaded either. Instead, KVM uses symbol_get to get
driver's APIs at runtime to avoids this issue.

For KVM guests, there are two approaches in terms of managing EPC: static
partitioning and oversubscription. In static partitioning all EPC pages are
are allocated to guest when it is created and are freed only when guest is
destroyed. In oversubscription, EPC pages are allocated to guest on demand,
and EPC pages allocated to guest can be evicted out by KVM, and reassigned
to other guests. Accessing to guest EPC page where there's no physical EPC
mapped causes EPT violation (or PF in case of shadowing), in which physical
EPC page will be allocated to guest (and reloaded to enclave if required).

-- Static partitioning

Static partitioning is an simple appproach. KVM only needs to allocate all
EPC pages when guest is created and set up mapping. All ENCLS leaf functions
will run perfectly in guest, so KVM doesn't need to turn on ENCLS VMEXIT.

However KVM needs to turn on ENCLS VMEXIT if KVM doesn't expose SGX to guest,
or guest has turned off SGX via IA32_FEATURE_CONTROL.SGX_ENABLE, as in such
cases ENCLS run in guest may have different behavior from on native, as on
hardware SGX is indeed enabled, but accroding to SDM, running ENCLS in guest
while SGX environment is abnormal in guest should cause #UD or #GP. KVM needs
to trap ENCLS to emulate such behavior.

-- Oversubscription

While oversubscription is better in terms of functionality, it needs more
complicated implementation. Below is the brief explanation of what needs to
be done in order to support EPC oversubscription between guests.

Below is the sequence to evict regular EPC page:

	1) Select one or multiple regular EPC pages from one enclave
	2) Remove EPT/PT mapping for selected EPC pages
	3) Send IPIs to remote CPUs to flush TLB of selected EPC pages
	4) EBLOCK on selected EPC pages
	5) ETRACK on enclave's SECS page
	6) allocate one available slot (8-byte) in VA page
	7) EWB on selected EPC pages

With EWB taking:

	- VA slot, to restore eviction version info.
	- one normal 4K page in memory, to store encrypted content of EPC page.
	- one struct PCMD in memory, to store meta data.

And below is the sequence to evict an SECS page or VA page:

	1) locate SECS (or VA) page
	2) remove EPT/PT mapping for SECS (or VA) page
	3) Send IPIs to remote CPUs
	6) allocate one available slot (8-byte) in VA page
	4) EWB on SECS (or) page

And for evicting SECS page, all regular EPC pages that belongs to that SECS
must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error.

And to reload an EPC page:

	1) ELDU/ELDB on EPC page
	2) setup EPT/PT mapping

With ELDU/ELDB taking:

	- location of SECS page
	- linear address of enclave's 4K page (that we are going to reload to)
	- VA slot (used in EWB)
	- 4K page in memory (used in EWB)
	- struct PCMD in memory (used in EWB)

Therefore, to support EPC oversubscription for guests, KVM needs to know: 

	1) EPC page type (SECS, regular page, VA page, etc)
	2) EPC status (whether blocked) -- guest may already have run EBLOCK
	3) location of SECS page -- both eviction & reload need it.

Besides above, KVM also needs to manage allocation of VA slot, which itself is
also EPC page and could potentially trigger EPC oversubscription.

To get above info, KVM needs to trap ENCLS from all guests, and maintain info
of all EPC pages and all enclaves from all guests. Specifically, KVM needs to
turn on ENCLS VMEXIT for all guests, and upon ENCLS VMEXIT, KVM needs to parse
ENCLS parameters (so that we can update EPC/enclave info according to which
ENCLS leaf guest is running, and it's parameters). KVM also needs to either
run ENCLS on behalf of guest (and skip this ENCLS), or using MTF to return to
guest and let guest run this ENCLS again. For the formar, KVM needs to
reconstruct guest's ENCLS parameters and remap guest's virtual address to KVM
kernel address (as all addresses in guest's ENCLS parameters are guest virtual
address), and run ENCLS in KVM on behalf for guest. For the latter, upon ENCLS
VMEXIT, KVM needs to temporary turn off ENCLS VMEXIT, turn on MTF VMEXIT, and
enter guest to allow guest run this ENCLS again. This time ENCLS VMEXIT won't
happen and MTF VMEXIT will happen after ENCLS is executed. Upon MTF VMEXIT, we
turn on ENCLS VMEXIT and turn off MTF VMEXIT again.

Below diagrams compares the two approaches: Run ENCLS in KVM, and Using MTF.

	--------------------------------------------------------------
				|	ENCLS		|
	--------------------------------------------------------------
				|	   	    /|\
		ENCLS VMEXIT	|			| VMENTRY
				|			|
			       \|/			|

		1) parse ENCLS parameters
		2) reconstruct(remap) guest's ENCLS parameters
		3) run ENCLS on behalf of guest (and skip ENCLS)
		4) on success, update EPC/enclave info, or inject error

		   	1) Run ENCLS in KVM

	--------------------------------------------------------------
			 	 |	ENCLS		|
	--------------------------------------------------------------
				| /|\		       |/|\
			ENCLS	|  | VMENTRY    MTF    | | VMENTRY
		       VMEXIT	|  |	       VMEXIT  | |
		       	       \|/ |		      \|/|
	1) Turn off EMCLS VMEXIT	   1) Turn off MTF VMEXIT
	2) turn on MTF VMEXIT		   2) Turn on ENCLS VMEXIT
	3) cache ENCLS parameters          3) check ENCLS succeeds or not, and
	   (as ENCLS will change RAX-RDX)     only on success, parse cached
	   				      ENCLS parameters, and update
					      EPC/enclave info

			2) Using MTF

Note in using MTF, checking ENCLS status (whether succeeds or not) is tricky,
as ENCLS can both return error via EAX register, or just cause #UD or #GP. For
the formar case it's relatively easier for KVM to check but for the latter KVM
needs to trap #UD and #GP from guest, and also needs to check whether the #UD
or #GP happened while running ENCLS.

In this patch series we only support 'static partitioning'. 'oversubscription'
can be supported when it is required. Currently we do support nested SGX
(mentioned below) and in 'oversubscription' supporting nested SGX will be very
complicated.

- Guest's EPC memory slot implementation

Guest's (virtual) EPC is implemented as private memory slot in KVM. Qemu will
not be aware the existence of such EPC slot. Using private slot, we can avoid
mmap in Qemu for getting EPC slot's host virtual address, and KVM doesn't need
to handle such mmap from Qemu for EPC slot. And we don't want to implement such
mmap support in SGX driver either.

A dedicated kvm_epc_ops is added for VMA of EPC slot, and EPC page will be
allocated via vma->vm_ops->fault. This is the natual way to support
'oversubscription' (if we need to support in the future) and works for
'static partitioning' nicely as well.

- Nested SGX

Currently for 'static partitioning' nested SGX is also supported. As mentioned
above in normal case KVM (L0) doesn't need to turn on ENCLS VMEXIT, but KVM
cannot assume L1 hypervisor's behavior, so if ENCLS VMEXIT is turned on in L1,
KVM (L0) must also turn on ENCLS VMEXIT but let L1 to handle such ENCLS VMEXIT
from L2 guest.

Supporting nested SGX in 'oversubscription' will be very complicated, as both
L0 and L1 may turn on ENCLS VMEXIT, and both L0 and L1 needs to maintain and
update EPC/enclave info from guests as explained above.

Kai Huang (10):
  x86: add SGX Launch Control definition to cpufeature
  kvm: vmx: add ENCLS VMEXIT detection
  kvm: vmx: detect presence of host SGX driver
  kvm: sgx: new functions to init and destory SGX for guest
  kvm: x86: add KVM_GET_SUPPORTED_CPUID SGX support
  kvm: x86: add KVM_SET_CPUID2 SGX support
  kvm: vmx: add SGX IA32_FEATURE_CONTROL MSR emulation
  kvm: vmx: add guest's IA32_SGXLEPUBKEYHASHn runtime switch support
  kvm: vmx: handle ENCLS VMEXIT
  kvm: vmx: handle VMEXIT from SGX Enclave

 arch/x86/include/asm/cpufeatures.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   9 +-
 arch/x86/include/asm/msr-index.h   |   7 +
 arch/x86/include/asm/vmx.h         |   4 +
 arch/x86/include/uapi/asm/vmx.h    |   5 +-
 arch/x86/kvm/Makefile              |   2 +-
 arch/x86/kvm/cpuid.c               |  21 +-
 arch/x86/kvm/cpuid.h               |  22 ++
 arch/x86/kvm/sgx.c                 | 463 +++++++++++++++++++++++
 arch/x86/kvm/sgx.h                 | 105 ++++++
 arch/x86/kvm/svm.c                 |  11 +-
 arch/x86/kvm/vmx.c                 | 752 +++++++++++++++++++++++++++++++++++--
 12 files changed, 1362 insertions(+), 40 deletions(-)
 create mode 100644 arch/x86/kvm/sgx.c
 create mode 100644 arch/x86/kvm/sgx.h

-- 
2.11.0

^ permalink raw reply	[flat|nested] 78+ messages in thread