RFC on deprivileged x86 hypervisor device models

* RFC on deprivileged x86 hypervisor device models
@ 2015-07-17 10:09 Ben Catterall
  2015-07-17 10:27 ` Fabio Fantoni
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ben Catterall @ 2015-07-17 10:09 UTC (permalink / raw)
  To: xen-devel, Andrew Cooper, JBeulich

Hi all,

I'm working on an x86 proof-of-concept series to evaluate if it is 
feasible to move device models currently running in the hypervisor and 
x86 emulation code for HVM guests into a deprivileged context.

I've put together the following document as I have been considering 
several different ways this could be achieved and was hoping to get 
feedback from maintainers before I go ahead.

Many thanks in advance,
Ben

Context
-------
The aim is to run device models, which are already running inside the 
hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests, 
using suitably mapped page tables. A simple hypercall convention is 
needed to pass data between these two modes of operation and a mechanism 
to move between them.

This is intended as a proof-of-concept, with the aim of determining if 
this idea is feasible within performance constraints.

Motivation
----------
The motivation for moving the device models and x86 emulation code into 
ring 3 is to mitigate a system compromise due a bug in any of these 
systems. These systems are currently part of the hypervisor and, 
consequently, a bug in any of these could allow an attacker to gain 
control (or perform a DOS) of Xen and/or guests.

Moving between privilege levels
--------------------------------
The general process is to determine if we need to run a device model (or 
similar) and then, if so, switch into deprivileged mode. The operation 
is performed by deprivileged code which calls into the hypervisor as and 
when needed. After the operation completes, we return to the hypervisor.

If deprivileged mode needs to make any hypervisor requests, it can do 
these using a syscall interface, possibly placing an operation code into 
a register to indicate the operation. This would allow it to get data 
to/from the hypervisor.

I am currently considering three different methods regarding the context 
switch and would be grateful of any feedback.

Method One
----------
This method works by building on top of the QEMU emulation path code. 
This currently operates as a state machine, using flags to determine the 
current emulation state. These states are used to determine which code 
paths to take when calling out of and into the hypervisor before and 
after emulation.

The intention would be to add new states and then follow the same 
process as existing code does except, rather than blocking the vcpu, we 
switch into deprivileged mode and process the request on the current 
vcpu. This is different to QEMU which blocks the current vcpu so 
additional code is needed to support this context switch. There may be 
other code paths which have not been written in this way which would 
require rewriting.

When moving into deprivileged mode, we need to be careful to ensure that 
when we leave, we can redo the call into the hypervisor after the device 
model completes without causing problems. Thus, we need to be _certain_ 
that the same call path is followed on the re-entry and that the 
system's state can handle this. This may mean undoing operations such as 
memory allocations.

Method Two
----------
At the point of detecting the need to perform a deprivileged operation, 
we take a copy of the current stack from the current stack position up 
to the point where the guest entered Xen and save it. Subsequently, we 
move the stack pointer back.

This effectively gives us a clean stack as though we had just entered 
Xen. We then put the deprivileged context onto this new stack and enter 
deprivileged mode.

Upon returning, we restore the previous stack with the guest's and Xen's 
context then jump to the saved rip and continue execution. Xen will then 
perform the necessary processing, determining if the operation was 
successful or not.

We are effectively context switching out Xen for deprivileged code and 
then bringing Xen back in once we're done.

As Xen is non-preemptive, the Xen stack won't be updated whilst we're in 
deprivileged mode. If it may be updated (I'm speculating here), e.g. an 
interrupt, then we can pause deprivileged mode by hooking the interrupt 
and restoring the Xen stack, then handle the interrupt and finally go 
back to deprivileged mode.

Problem: If the device model or emulator edit the saved guest registers 
and these are touched by Xen on the return path after finishing 
servicing the deprivileged operation, then the guest will use these 
values not those the deprivileged mode provided.

This is not a problem if the code doesn't do this. If it does, we could 
give higher precedence to deprivileged changes. So, deprivileged mode 
pushes the changes into the hypervisor which caches them and then, just 
before guest context is restored, makes those changes, thus discarding 
any Xen made.

Method Three
------------
A per vcpu stack is maintained for user mode and supervisor mode. We 
then don't need to do any copying, just switch to user mode at the point 
when deprivileged code needs to run.

When deprivileged mode is done, we move back to supervisor mode, restore 
the previous context and continue execution of the code path that 
followed the call to move into deprivileged mode.

Method Evaluation
-----------------
In method one, similarly to the QEMU path, we need to move up and down 
the call stack twice. We pay the cost of running the entry and exit 
code, which all methods will. Then we pay the cost of the code paths for 
moving into deprivileged mode from the call site and for moving from 
deprivileged mode back to the call site to handle the result. This means 
that we also destroy and then rebuild the stack. We also pay any 
allocation and deallocation costs twice, unless we can re-write the code 
paths so that these can be avoided. A potential issue would be if any 
changes are made to Xen's state on the first entry which mean that on 
the second entry (returning from deprivileged mode), we take a different 
call path.

As mentioned, QEMU appears to do something similar so we can reuse much 
of this. The call tree is quite deep and broad so great care will need 
to be taken when making these changes to examine state-changing calls. 
Furthermore, such a change will be needed for each device, although this 
will be simpler after the first device is added.

The second method requires copying the stack and then restoring it. It 
doesn't pay the costs of following a return path into deprivileged mode 
or moving back to the call site as it, effectively, skips all of this. 
Memory accesses on the stack are roughly the same as the first method 
but, we do need enough storage to hold a copy of the stack for each 
vcpu. The edits to intermediate callers are likely to be simpler than 
method one, as we don't need to worry about there being two different 
return paths. Adding a new device model would most likely be easier than 
method one.

Method two appears to require fewer edits to the original source code 
and I suspect would be more efficient computationally than moving up and 
down the stack twice with multiple flag tests breaking code up. However, 
this has already been done for QEMU call paths so this may prove less 
troublesome/ expensive than expected.

The third method _may_ require significant code refactoring as 
currently, there is only one stack per pcpu so this may be a large change.

Summary
-------
Just to reiterate, this is intended as a proof-of-concept to measure how 
feasible such a feature is.

I'm currently on the fence between method one and method two.

Method one will require more attention to existing code paths and is 
less like a context-switch approach.

Method two will require less attention to existing code paths and is 
more like a context-switching approach.

I am unsure of method three as I suspect it would be a significant change.

Are there any potential issues or things which I have overlooked? 
Additionally, which (if any) of the above would you recommend pursuing 
or do you have any ideas regarding alternatives?

^ permalink raw reply	[flat|nested] 10+ messages in thread