xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* RFC on deprivileged x86 hypervisor device models
@ 2015-07-17 10:09 Ben Catterall
  2015-07-17 10:27 ` Fabio Fantoni
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ben Catterall @ 2015-07-17 10:09 UTC (permalink / raw)
  To: xen-devel, Andrew Cooper, JBeulich

Hi all,

I'm working on an x86 proof-of-concept series to evaluate if it is 
feasible to move device models currently running in the hypervisor and 
x86 emulation code for HVM guests into a deprivileged context.

I've put together the following document as I have been considering 
several different ways this could be achieved and was hoping to get 
feedback from maintainers before I go ahead.

Many thanks in advance,
Ben

Context
-------
The aim is to run device models, which are already running inside the 
hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests, 
using suitably mapped page tables. A simple hypercall convention is 
needed to pass data between these two modes of operation and a mechanism 
to move between them.

This is intended as a proof-of-concept, with the aim of determining if 
this idea is feasible within performance constraints.

Motivation
----------
The motivation for moving the device models and x86 emulation code into 
ring 3 is to mitigate a system compromise due a bug in any of these 
systems. These systems are currently part of the hypervisor and, 
consequently, a bug in any of these could allow an attacker to gain 
control (or perform a DOS) of Xen and/or guests.


Moving between privilege levels
--------------------------------
The general process is to determine if we need to run a device model (or 
similar) and then, if so, switch into deprivileged mode. The operation 
is performed by deprivileged code which calls into the hypervisor as and 
when needed. After the operation completes, we return to the hypervisor.

If deprivileged mode needs to make any hypervisor requests, it can do 
these using a syscall interface, possibly placing an operation code into 
a register to indicate the operation. This would allow it to get data 
to/from the hypervisor.

I am currently considering three different methods regarding the context 
switch and would be grateful of any feedback.

Method One
----------
This method works by building on top of the QEMU emulation path code. 
This currently operates as a state machine, using flags to determine the 
current emulation state. These states are used to determine which code 
paths to take when calling out of and into the hypervisor before and 
after emulation.

The intention would be to add new states and then follow the same 
process as existing code does except, rather than blocking the vcpu, we 
switch into deprivileged mode and process the request on the current 
vcpu. This is different to QEMU which blocks the current vcpu so 
additional code is needed to support this context switch. There may be 
other code paths which have not been written in this way which would 
require rewriting.

When moving into deprivileged mode, we need to be careful to ensure that 
when we leave, we can redo the call into the hypervisor after the device 
model completes without causing problems. Thus, we need to be _certain_ 
that the same call path is followed on the re-entry and that the 
system's state can handle this. This may mean undoing operations such as 
memory allocations.


Method Two
----------
At the point of detecting the need to perform a deprivileged operation, 
we take a copy of the current stack from the current stack position up 
to the point where the guest entered Xen and save it. Subsequently, we 
move the stack pointer back.

This effectively gives us a clean stack as though we had just entered 
Xen. We then put the deprivileged context onto this new stack and enter 
deprivileged mode.

Upon returning, we restore the previous stack with the guest's and Xen's 
context then jump to the saved rip and continue execution. Xen will then 
perform the necessary processing, determining if the operation was 
successful or not.

We are effectively context switching out Xen for deprivileged code and 
then bringing Xen back in once we're done.

As Xen is non-preemptive, the Xen stack won't be updated whilst we're in 
deprivileged mode. If it may be updated (I'm speculating here), e.g. an 
interrupt, then we can pause deprivileged mode by hooking the interrupt 
and restoring the Xen stack, then handle the interrupt and finally go 
back to deprivileged mode.

Problem: If the device model or emulator edit the saved guest registers 
and these are touched by Xen on the return path after finishing 
servicing the deprivileged operation, then the guest will use these 
values not those the deprivileged mode provided.

This is not a problem if the code doesn't do this. If it does, we could 
give higher precedence to deprivileged changes. So, deprivileged mode 
pushes the changes into the hypervisor which caches them and then, just 
before guest context is restored, makes those changes, thus discarding 
any Xen made.



Method Three
------------
A per vcpu stack is maintained for user mode and supervisor mode. We 
then don't need to do any copying, just switch to user mode at the point 
when deprivileged code needs to run.

When deprivileged mode is done, we move back to supervisor mode, restore 
the previous context and continue execution of the code path that 
followed the call to move into deprivileged mode.



Method Evaluation
-----------------
In method one, similarly to the QEMU path, we need to move up and down 
the call stack twice. We pay the cost of running the entry and exit 
code, which all methods will. Then we pay the cost of the code paths for 
moving into deprivileged mode from the call site and for moving from 
deprivileged mode back to the call site to handle the result. This means 
that we also destroy and then rebuild the stack. We also pay any 
allocation and deallocation costs twice, unless we can re-write the code 
paths so that these can be avoided. A potential issue would be if any 
changes are made to Xen's state on the first entry which mean that on 
the second entry (returning from deprivileged mode), we take a different 
call path.

As mentioned, QEMU appears to do something similar so we can reuse much 
of this. The call tree is quite deep and broad so great care will need 
to be taken when making these changes to examine state-changing calls. 
Furthermore, such a change will be needed for each device, although this 
will be simpler after the first device is added.

The second method requires copying the stack and then restoring it. It 
doesn't pay the costs of following a return path into deprivileged mode 
or moving back to the call site as it, effectively, skips all of this. 
Memory accesses on the stack are roughly the same as the first method 
but, we do need enough storage to hold a copy of the stack for each 
vcpu. The edits to intermediate callers are likely to be simpler than 
method one, as we don't need to worry about there being two different 
return paths. Adding a new device model would most likely be easier than 
method one.

Method two appears to require fewer edits to the original source code 
and I suspect would be more efficient computationally than moving up and 
down the stack twice with multiple flag tests breaking code up. However, 
this has already been done for QEMU call paths so this may prove less 
troublesome/ expensive than expected.

The third method _may_ require significant code refactoring as 
currently, there is only one stack per pcpu so this may be a large change.


Summary
-------
Just to reiterate, this is intended as a proof-of-concept to measure how 
feasible such a feature is.

I'm currently on the fence between method one and method two.

Method one will require more attention to existing code paths and is 
less like a context-switch approach.

Method two will require less attention to existing code paths and is 
more like a context-switching approach.

I am unsure of method three as I suspect it would be a significant change.

Are there any potential issues or things which I have overlooked? 
Additionally, which (if any) of the above would you recommend pursuing 
or do you have any ideas regarding alternatives?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 10:09 RFC on deprivileged x86 hypervisor device models Ben Catterall
@ 2015-07-17 10:27 ` Fabio Fantoni
  2015-07-17 10:29   ` Andrew Cooper
  2015-07-17 12:32 ` Paul Durrant
  2015-07-17 14:20 ` Jan Beulich
  2 siblings, 1 reply; 10+ messages in thread
From: Fabio Fantoni @ 2015-07-17 10:27 UTC (permalink / raw)
  To: Ben Catterall, xen-devel, Andrew Cooper, JBeulich

Il 17/07/2015 12:09, Ben Catterall ha scritto:
> Hi all,
>
> I'm working on an x86 proof-of-concept series to evaluate if it is 
> feasible to move device models currently running in the hypervisor and 
> x86 emulation code for HVM guests into a deprivileged context.
>
> I've put together the following document as I have been considering 
> several different ways this could be achieved and was hoping to get 
> feedback from maintainers before I go ahead.

Have you already take a look to this patch serie?
http://lists.xen.org/archives/html/xen-devel/2015-07/msg00108.html

>
> Many thanks in advance,
> Ben
>
> Context
> -------
> The aim is to run device models, which are already running inside the 
> hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM 
> guests, using suitably mapped page tables. A simple hypercall 
> convention is needed to pass data between these two modes of operation 
> and a mechanism to move between them.
>
> This is intended as a proof-of-concept, with the aim of determining if 
> this idea is feasible within performance constraints.
>
> Motivation
> ----------
> The motivation for moving the device models and x86 emulation code 
> into ring 3 is to mitigate a system compromise due a bug in any of 
> these systems. These systems are currently part of the hypervisor and, 
> consequently, a bug in any of these could allow an attacker to gain 
> control (or perform a DOS) of Xen and/or guests.
>
>
> Moving between privilege levels
> --------------------------------
> The general process is to determine if we need to run a device model 
> (or similar) and then, if so, switch into deprivileged mode. The 
> operation is performed by deprivileged code which calls into the 
> hypervisor as and when needed. After the operation completes, we 
> return to the hypervisor.
>
> If deprivileged mode needs to make any hypervisor requests, it can do 
> these using a syscall interface, possibly placing an operation code 
> into a register to indicate the operation. This would allow it to get 
> data to/from the hypervisor.
>
> I am currently considering three different methods regarding the 
> context switch and would be grateful of any feedback.
>
> Method One
> ----------
> This method works by building on top of the QEMU emulation path code. 
> This currently operates as a state machine, using flags to determine 
> the current emulation state. These states are used to determine which 
> code paths to take when calling out of and into the hypervisor before 
> and after emulation.
>
> The intention would be to add new states and then follow the same 
> process as existing code does except, rather than blocking the vcpu, 
> we switch into deprivileged mode and process the request on the 
> current vcpu. This is different to QEMU which blocks the current vcpu 
> so additional code is needed to support this context switch. There may 
> be other code paths which have not been written in this way which 
> would require rewriting.
>
> When moving into deprivileged mode, we need to be careful to ensure 
> that when we leave, we can redo the call into the hypervisor after the 
> device model completes without causing problems. Thus, we need to be 
> _certain_ that the same call path is followed on the re-entry and that 
> the system's state can handle this. This may mean undoing operations 
> such as memory allocations.
>
>
> Method Two
> ----------
> At the point of detecting the need to perform a deprivileged 
> operation, we take a copy of the current stack from the current stack 
> position up to the point where the guest entered Xen and save it. 
> Subsequently, we move the stack pointer back.
>
> This effectively gives us a clean stack as though we had just entered 
> Xen. We then put the deprivileged context onto this new stack and 
> enter deprivileged mode.
>
> Upon returning, we restore the previous stack with the guest's and 
> Xen's context then jump to the saved rip and continue execution. Xen 
> will then perform the necessary processing, determining if the 
> operation was successful or not.
>
> We are effectively context switching out Xen for deprivileged code and 
> then bringing Xen back in once we're done.
>
> As Xen is non-preemptive, the Xen stack won't be updated whilst we're 
> in deprivileged mode. If it may be updated (I'm speculating here), 
> e.g. an interrupt, then we can pause deprivileged mode by hooking the 
> interrupt and restoring the Xen stack, then handle the interrupt and 
> finally go back to deprivileged mode.
>
> Problem: If the device model or emulator edit the saved guest 
> registers and these are touched by Xen on the return path after 
> finishing servicing the deprivileged operation, then the guest will 
> use these values not those the deprivileged mode provided.
>
> This is not a problem if the code doesn't do this. If it does, we 
> could give higher precedence to deprivileged changes. So, deprivileged 
> mode pushes the changes into the hypervisor which caches them and 
> then, just before guest context is restored, makes those changes, thus 
> discarding any Xen made.
>
>
>
> Method Three
> ------------
> A per vcpu stack is maintained for user mode and supervisor mode. We 
> then don't need to do any copying, just switch to user mode at the 
> point when deprivileged code needs to run.
>
> When deprivileged mode is done, we move back to supervisor mode, 
> restore the previous context and continue execution of the code path 
> that followed the call to move into deprivileged mode.
>
>
>
> Method Evaluation
> -----------------
> In method one, similarly to the QEMU path, we need to move up and down 
> the call stack twice. We pay the cost of running the entry and exit 
> code, which all methods will. Then we pay the cost of the code paths 
> for moving into deprivileged mode from the call site and for moving 
> from deprivileged mode back to the call site to handle the result. 
> This means that we also destroy and then rebuild the stack. We also 
> pay any allocation and deallocation costs twice, unless we can 
> re-write the code paths so that these can be avoided. A potential 
> issue would be if any changes are made to Xen's state on the first 
> entry which mean that on the second entry (returning from deprivileged 
> mode), we take a different call path.
>
> As mentioned, QEMU appears to do something similar so we can reuse 
> much of this. The call tree is quite deep and broad so great care will 
> need to be taken when making these changes to examine state-changing 
> calls. Furthermore, such a change will be needed for each device, 
> although this will be simpler after the first device is added.
>
> The second method requires copying the stack and then restoring it. It 
> doesn't pay the costs of following a return path into deprivileged 
> mode or moving back to the call site as it, effectively, skips all of 
> this. Memory accesses on the stack are roughly the same as the first 
> method but, we do need enough storage to hold a copy of the stack for 
> each vcpu. The edits to intermediate callers are likely to be simpler 
> than method one, as we don't need to worry about there being two 
> different return paths. Adding a new device model would most likely be 
> easier than method one.
>
> Method two appears to require fewer edits to the original source code 
> and I suspect would be more efficient computationally than moving up 
> and down the stack twice with multiple flag tests breaking code up. 
> However, this has already been done for QEMU call paths so this may 
> prove less troublesome/ expensive than expected.
>
> The third method _may_ require significant code refactoring as 
> currently, there is only one stack per pcpu so this may be a large 
> change.
>
>
> Summary
> -------
> Just to reiterate, this is intended as a proof-of-concept to measure 
> how feasible such a feature is.
>
> I'm currently on the fence between method one and method two.
>
> Method one will require more attention to existing code paths and is 
> less like a context-switch approach.
>
> Method two will require less attention to existing code paths and is 
> more like a context-switching approach.
>
> I am unsure of method three as I suspect it would be a significant 
> change.
>
> Are there any potential issues or things which I have overlooked? 
> Additionally, which (if any) of the above would you recommend pursuing 
> or do you have any ideas regarding alternatives?
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 10:27 ` Fabio Fantoni
@ 2015-07-17 10:29   ` Andrew Cooper
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Cooper @ 2015-07-17 10:29 UTC (permalink / raw)
  To: Fabio Fantoni, Ben Catterall, xen-devel, JBeulich

On 17/07/15 11:27, Fabio Fantoni wrote:
> Il 17/07/2015 12:09, Ben Catterall ha scritto:
>> Hi all,
>>
>> I'm working on an x86 proof-of-concept series to evaluate if it is
>> feasible to move device models currently running in the hypervisor
>> and x86 emulation code for HVM guests into a deprivileged context.
>>
>> I've put together the following document as I have been considering
>> several different ways this could be achieved and was hoping to get
>> feedback from maintainers before I go ahead.
>
> Have you already take a look to this patch serie?
> http://lists.xen.org/archives/html/xen-devel/2015-07/msg00108.html
>

Running qemu as non-root has nothing to do whatsoever with this proposal.

This proposal concerns the bits which are emulated *in the hypervisor*.

~Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 10:09 RFC on deprivileged x86 hypervisor device models Ben Catterall
  2015-07-17 10:27 ` Fabio Fantoni
@ 2015-07-17 12:32 ` Paul Durrant
  2015-07-17 14:20 ` Jan Beulich
  2 siblings, 0 replies; 10+ messages in thread
From: Paul Durrant @ 2015-07-17 12:32 UTC (permalink / raw)
  To: Ben Catterall (Intern), xen-devel, Andrew Cooper, JBeulich

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ben Catterall
> Sent: 17 July 2015 11:10
> To: xen-devel@lists.xen.org; Andrew Cooper; JBeulich@suse.com
> Subject: [Xen-devel] RFC on deprivileged x86 hypervisor device models
> 
> Hi all,
> 
> I'm working on an x86 proof-of-concept series to evaluate if it is
> feasible to move device models currently running in the hypervisor and
> x86 emulation code for HVM guests into a deprivileged context.
> 

Why is that better than, say, moving the device models into a dedicated monolithic VM (like a global stub domain) and running them there? It gives you the depriv aspect and there's prior art.

  Paul

> I've put together the following document as I have been considering
> several different ways this could be achieved and was hoping to get
> feedback from maintainers before I go ahead.
> 
> Many thanks in advance,
> Ben
> 
> Context
> -------
> The aim is to run device models, which are already running inside the
> hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests,
> using suitably mapped page tables. A simple hypercall convention is
> needed to pass data between these two modes of operation and a
> mechanism
> to move between them.
> 
> This is intended as a proof-of-concept, with the aim of determining if
> this idea is feasible within performance constraints.
> 
> Motivation
> ----------
> The motivation for moving the device models and x86 emulation code into
> ring 3 is to mitigate a system compromise due a bug in any of these
> systems. These systems are currently part of the hypervisor and,
> consequently, a bug in any of these could allow an attacker to gain
> control (or perform a DOS) of Xen and/or guests.
> 
> 
> Moving between privilege levels
> --------------------------------
> The general process is to determine if we need to run a device model (or
> similar) and then, if so, switch into deprivileged mode. The operation
> is performed by deprivileged code which calls into the hypervisor as and
> when needed. After the operation completes, we return to the hypervisor.
> 
> If deprivileged mode needs to make any hypervisor requests, it can do
> these using a syscall interface, possibly placing an operation code into
> a register to indicate the operation. This would allow it to get data
> to/from the hypervisor.
> 
> I am currently considering three different methods regarding the context
> switch and would be grateful of any feedback.
> 
> Method One
> ----------
> This method works by building on top of the QEMU emulation path code.
> This currently operates as a state machine, using flags to determine the
> current emulation state. These states are used to determine which code
> paths to take when calling out of and into the hypervisor before and
> after emulation.
> 
> The intention would be to add new states and then follow the same
> process as existing code does except, rather than blocking the vcpu, we
> switch into deprivileged mode and process the request on the current
> vcpu. This is different to QEMU which blocks the current vcpu so
> additional code is needed to support this context switch. There may be
> other code paths which have not been written in this way which would
> require rewriting.
> 
> When moving into deprivileged mode, we need to be careful to ensure that
> when we leave, we can redo the call into the hypervisor after the device
> model completes without causing problems. Thus, we need to be _certain_
> that the same call path is followed on the re-entry and that the
> system's state can handle this. This may mean undoing operations such as
> memory allocations.
> 
> 
> Method Two
> ----------
> At the point of detecting the need to perform a deprivileged operation,
> we take a copy of the current stack from the current stack position up
> to the point where the guest entered Xen and save it. Subsequently, we
> move the stack pointer back.
> 
> This effectively gives us a clean stack as though we had just entered
> Xen. We then put the deprivileged context onto this new stack and enter
> deprivileged mode.
> 
> Upon returning, we restore the previous stack with the guest's and Xen's
> context then jump to the saved rip and continue execution. Xen will then
> perform the necessary processing, determining if the operation was
> successful or not.
> 
> We are effectively context switching out Xen for deprivileged code and
> then bringing Xen back in once we're done.
> 
> As Xen is non-preemptive, the Xen stack won't be updated whilst we're in
> deprivileged mode. If it may be updated (I'm speculating here), e.g. an
> interrupt, then we can pause deprivileged mode by hooking the interrupt
> and restoring the Xen stack, then handle the interrupt and finally go
> back to deprivileged mode.
> 
> Problem: If the device model or emulator edit the saved guest registers
> and these are touched by Xen on the return path after finishing
> servicing the deprivileged operation, then the guest will use these
> values not those the deprivileged mode provided.
> 
> This is not a problem if the code doesn't do this. If it does, we could
> give higher precedence to deprivileged changes. So, deprivileged mode
> pushes the changes into the hypervisor which caches them and then, just
> before guest context is restored, makes those changes, thus discarding
> any Xen made.
> 
> 
> 
> Method Three
> ------------
> A per vcpu stack is maintained for user mode and supervisor mode. We
> then don't need to do any copying, just switch to user mode at the point
> when deprivileged code needs to run.
> 
> When deprivileged mode is done, we move back to supervisor mode,
> restore
> the previous context and continue execution of the code path that
> followed the call to move into deprivileged mode.
> 
> 
> 
> Method Evaluation
> -----------------
> In method one, similarly to the QEMU path, we need to move up and down
> the call stack twice. We pay the cost of running the entry and exit
> code, which all methods will. Then we pay the cost of the code paths for
> moving into deprivileged mode from the call site and for moving from
> deprivileged mode back to the call site to handle the result. This means
> that we also destroy and then rebuild the stack. We also pay any
> allocation and deallocation costs twice, unless we can re-write the code
> paths so that these can be avoided. A potential issue would be if any
> changes are made to Xen's state on the first entry which mean that on
> the second entry (returning from deprivileged mode), we take a different
> call path.
> 
> As mentioned, QEMU appears to do something similar so we can reuse much
> of this. The call tree is quite deep and broad so great care will need
> to be taken when making these changes to examine state-changing calls.
> Furthermore, such a change will be needed for each device, although this
> will be simpler after the first device is added.
> 
> The second method requires copying the stack and then restoring it. It
> doesn't pay the costs of following a return path into deprivileged mode
> or moving back to the call site as it, effectively, skips all of this.
> Memory accesses on the stack are roughly the same as the first method
> but, we do need enough storage to hold a copy of the stack for each
> vcpu. The edits to intermediate callers are likely to be simpler than
> method one, as we don't need to worry about there being two different
> return paths. Adding a new device model would most likely be easier than
> method one.
> 
> Method two appears to require fewer edits to the original source code
> and I suspect would be more efficient computationally than moving up and
> down the stack twice with multiple flag tests breaking code up. However,
> this has already been done for QEMU call paths so this may prove less
> troublesome/ expensive than expected.
> 
> The third method _may_ require significant code refactoring as
> currently, there is only one stack per pcpu so this may be a large change.
> 
> 
> Summary
> -------
> Just to reiterate, this is intended as a proof-of-concept to measure how
> feasible such a feature is.
> 
> I'm currently on the fence between method one and method two.
> 
> Method one will require more attention to existing code paths and is
> less like a context-switch approach.
> 
> Method two will require less attention to existing code paths and is
> more like a context-switching approach.
> 
> I am unsure of method three as I suspect it would be a significant change.
> 
> Are there any potential issues or things which I have overlooked?
> Additionally, which (if any) of the above would you recommend pursuing
> or do you have any ideas regarding alternatives?
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 10:09 RFC on deprivileged x86 hypervisor device models Ben Catterall
  2015-07-17 10:27 ` Fabio Fantoni
  2015-07-17 12:32 ` Paul Durrant
@ 2015-07-17 14:20 ` Jan Beulich
  2015-07-17 15:19   ` Ben Catterall
  2 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2015-07-17 14:20 UTC (permalink / raw)
  To: Ben Catterall; +Cc: Andrew Cooper, xen-devel

>>> On 17.07.15 at 12:09, <Ben.Catterall@citrix.com> wrote:
> Moving between privilege levels
> --------------------------------
> The general process is to determine if we need to run a device model (or 
> similar) and then, if so, switch into deprivileged mode. The operation 
> is performed by deprivileged code which calls into the hypervisor as and 
> when needed. After the operation completes, we return to the hypervisor.
> 
> If deprivileged mode needs to make any hypervisor requests, it can do 
> these using a syscall interface, possibly placing an operation code into 
> a register to indicate the operation. This would allow it to get data 
> to/from the hypervisor.

What I didn't understand from this as well as the individual models'
descriptions is in whose address space the device model is to be
run. Since you're hijacking the vCPU, it sounds like you're intending
Xen's address space to be re-used, just such that the code gets
run at CPL 3. Which would potentially even allow for read-only data
sharing (so that calls back into the hypervisor would be needed only
when data needs to be updated). But perhaps I guessed wrong?

If not, then method 2 would seem quite a bit less troublesome than
method 1, yet method 3 would (even if more involved in terms of
changes to be done) perhaps result in the most elegant result.

Again if not, whose runtime environment would the device model
use? It hardly would be qemu you intend to run that way, but
custom code would likely still require some runtime library code to
assist it. Do you mean to re-use hypervisor code for that (perhaps
again utilizing read-only [and executable] data sharing)?

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 14:20 ` Jan Beulich
@ 2015-07-17 15:19   ` Ben Catterall
  2015-07-17 15:38     ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Catterall @ 2015-07-17 15:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel



On 17/07/15 15:20, Jan Beulich wrote:
>>>> On 17.07.15 at 12:09, <Ben.Catterall@citrix.com> wrote:
>> Moving between privilege levels
>> --------------------------------
>> The general process is to determine if we need to run a device model (or
>> similar) and then, if so, switch into deprivileged mode. The operation
>> is performed by deprivileged code which calls into the hypervisor as and
>> when needed. After the operation completes, we return to the hypervisor.
>>
>> If deprivileged mode needs to make any hypervisor requests, it can do
>> these using a syscall interface, possibly placing an operation code into
>> a register to indicate the operation. This would allow it to get data
>> to/from the hypervisor.
> What I didn't understand from this as well as the individual models'
> descriptions is in whose address space the device model is to be
> run. Since you're hijacking the vCPU, it sounds like you're intending
> Xen's address space to be re-used, just such that the code gets
> run at CPL 3.
Yep, this will be in Xen's address space using a monitor table patch, 
mapping in the code for ring 3 execution.

> Which would potentially even allow for read-only data
> sharing (so that calls back into the hypervisor would be needed only
> when data needs to be updated). But perhaps I guessed wrong?
Yep, that sounds like a good idea for read-only data. I should be able 
to do page aliasing for this if I make the data and code page-aligned in 
their own sections, provided the code is compiled to be position 
independent and the data is accessed via a pointer. Andrew mentioned 
mapping in all of Xen's .text section so that I can make use of small 
helpers. More involved functionality would still need a hypercall due to 
hardcoded-virtual addresses for data access.

>
> If not, then method 2 would seem quite a bit less troublesome than
> method 1, yet method 3 would (even if more involved in terms of
> changes to be done) perhaps result in the most elegant result.
I agree that method three is more elegant. If both you and Andrew are ok 
with going in a per-vcpu stack direction for Xen in general then I'll 
write a per-vcpu patch first and then do another patch which adds the 
ring 3 feature on-top of that.

>
> Again if not, whose runtime environment would the device model
> use? It hardly would be qemu you intend to run that way, but
> custom code would likely still require some runtime library code to
> assist it. Do you mean to re-use hypervisor code for that (perhaps
> again utilizing read-only [and executable] data sharing)?
>
> Jan
>
Thanks once again,
Ben

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 15:19   ` Ben Catterall
@ 2015-07-17 15:38     ` Jan Beulich
  2015-07-20 13:43       ` Andrew Cooper
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2015-07-17 15:38 UTC (permalink / raw)
  To: Ben Catterall; +Cc: Andrew Cooper, xen-devel

>>> On 17.07.15 at 17:19, <Ben.Catterall@citrix.com> wrote:
> On 17/07/15 15:20, Jan Beulich wrote:
>> If not, then method 2 would seem quite a bit less troublesome than
>> method 1, yet method 3 would (even if more involved in terms of
>> changes to be done) perhaps result in the most elegant result.
> I agree that method three is more elegant. If both you and Andrew are ok 
> with going in a per-vcpu stack direction for Xen in general then I'll 
> write a per-vcpu patch first and then do another patch which adds the 
> ring 3 feature on-top of that.

Actually improvements to common/wait.c have also been thought of
long ago already, for whenever per-vCPU stacks would be available.
The few users of these interfaces never resulted in this becoming
important enough a work item, unfortunately.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-17 15:38     ` Jan Beulich
@ 2015-07-20 13:43       ` Andrew Cooper
  2015-07-20 13:58         ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2015-07-20 13:43 UTC (permalink / raw)
  To: Jan Beulich, Ben Catterall; +Cc: xen-devel

On 17/07/15 16:38, Jan Beulich wrote:
>>>> On 17.07.15 at 17:19, <Ben.Catterall@citrix.com> wrote:
>> On 17/07/15 15:20, Jan Beulich wrote:
>>> If not, then method 2 would seem quite a bit less troublesome than
>>> method 1, yet method 3 would (even if more involved in terms of
>>> changes to be done) perhaps result in the most elegant result.
>> I agree that method three is more elegant. If both you and Andrew are ok 
>> with going in a per-vcpu stack direction for Xen in general then I'll 
>> write a per-vcpu patch first and then do another patch which adds the 
>> ring 3 feature on-top of that.
> Actually improvements to common/wait.c have also been thought of
> long ago already, for whenever per-vCPU stacks would be available.
> The few users of these interfaces never resulted in this becoming
> important enough a work item, unfortunately.

While per-vcpu stacks would be nice, there are a number of challenges to
be overcome before they can sensibly be used.  Off the top of my head,
per-vcpu state living at the top of the stack, hard coded stack
addresses in the emulation stubs, IST stacks moving back onto the
primary stack and splitting out a separate irq stack if the ring0 stack
wants to be left with partial state on.

Therefore I recommend method 2 to reuse the existing kudge we have. 
That will allow you to actually investigate some of the depriv aspects
in the time you have.

~Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-20 13:43       ` Andrew Cooper
@ 2015-07-20 13:58         ` Jan Beulich
  2015-07-20 14:10           ` Ben Catterall
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2015-07-20 13:58 UTC (permalink / raw)
  To: Andrew Cooper, Ben Catterall; +Cc: xen-devel

>>> On 20.07.15 at 15:43, <andrew.cooper3@citrix.com> wrote:
> On 17/07/15 16:38, Jan Beulich wrote:
>>>>> On 17.07.15 at 17:19, <Ben.Catterall@citrix.com> wrote:
>>> On 17/07/15 15:20, Jan Beulich wrote:
>>>> If not, then method 2 would seem quite a bit less troublesome than
>>>> method 1, yet method 3 would (even if more involved in terms of
>>>> changes to be done) perhaps result in the most elegant result.
>>> I agree that method three is more elegant. If both you and Andrew are ok 
>>> with going in a per-vcpu stack direction for Xen in general then I'll 
>>> write a per-vcpu patch first and then do another patch which adds the 
>>> ring 3 feature on-top of that.
>> Actually improvements to common/wait.c have also been thought of
>> long ago already, for whenever per-vCPU stacks would be available.
>> The few users of these interfaces never resulted in this becoming
>> important enough a work item, unfortunately.
> 
> While per-vcpu stacks would be nice, there are a number of challenges to
> be overcome before they can sensibly be used.  Off the top of my head,
> per-vcpu state living at the top of the stack, hard coded stack
> addresses in the emulation stubs, IST stacks moving back onto the
> primary stack and splitting out a separate irq stack if the ring0 stack
> wants to be left with partial state on.
> 
> Therefore I recommend method 2 to reuse the existing kudge we have. 
> That will allow you to actually investigate some of the depriv aspects
> in the time you have.

Yeah, I too meant to point out that this per-pCPU stack work is
likely too much to be done as a preparatory thing here; I simply
forgot.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC on deprivileged x86 hypervisor device models
  2015-07-20 13:58         ` Jan Beulich
@ 2015-07-20 14:10           ` Ben Catterall
  0 siblings, 0 replies; 10+ messages in thread
From: Ben Catterall @ 2015-07-20 14:10 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: xen-devel



On 20/07/15 14:58, Jan Beulich wrote:
>>>> On 20.07.15 at 15:43, <andrew.cooper3@citrix.com> wrote:
>> On 17/07/15 16:38, Jan Beulich wrote:
>>>>>> On 17.07.15 at 17:19, <Ben.Catterall@citrix.com> wrote:
>>>> On 17/07/15 15:20, Jan Beulich wrote:
>>>>> If not, then method 2 would seem quite a bit less troublesome than
>>>>> method 1, yet method 3 would (even if more involved in terms of
>>>>> changes to be done) perhaps result in the most elegant result.
>>>> I agree that method three is more elegant. If both you and Andrew are ok
>>>> with going in a per-vcpu stack direction for Xen in general then I'll
>>>> write a per-vcpu patch first and then do another patch which adds the
>>>> ring 3 feature on-top of that.
>>> Actually improvements to common/wait.c have also been thought of
>>> long ago already, for whenever per-vCPU stacks would be available.
>>> The few users of these interfaces never resulted in this becoming
>>> important enough a work item, unfortunately.
>> While per-vcpu stacks would be nice, there are a number of challenges to
>> be overcome before they can sensibly be used.  Off the top of my head,
>> per-vcpu state living at the top of the stack, hard coded stack
>> addresses in the emulation stubs, IST stacks moving back onto the
>> primary stack and splitting out a separate irq stack if the ring0 stack
>> wants to be left with partial state on.
>>
>> Therefore I recommend method 2 to reuse the existing kudge we have.
>> That will allow you to actually investigate some of the depriv aspects
>> in the time you have.
> Yeah, I too meant to point out that this per-pCPU stack work is
> likely too much to be done as a preparatory thing here; I simply
> forgot.
>
> Jan
>
Ok. It sounds like I should go with method two in that case. Thanks for 
the feedback!

Ben

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-07-20 14:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-17 10:09 RFC on deprivileged x86 hypervisor device models Ben Catterall
2015-07-17 10:27 ` Fabio Fantoni
2015-07-17 10:29   ` Andrew Cooper
2015-07-17 12:32 ` Paul Durrant
2015-07-17 14:20 ` Jan Beulich
2015-07-17 15:19   ` Ben Catterall
2015-07-17 15:38     ` Jan Beulich
2015-07-20 13:43       ` Andrew Cooper
2015-07-20 13:58         ` Jan Beulich
2015-07-20 14:10           ` Ben Catterall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).