All of lore.kernel.org
 help / color / mirror / Atom feed
* Best way to use altp2m to support VMFUNC EPT-switching?
@ 2023-03-15  2:01 Johnson, Ethan
  2023-03-15  9:22 ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Johnson, Ethan @ 2023-03-15  2:01 UTC (permalink / raw)
  To: xen-devel

Hi all,

I'm looking for some pointers on how Xen's altp2m system works and how it's meant to be used with Intel's VMFUNC EPT-switching for secure isolation within an HVM/PVH guest's kernelspace.

Specifically, I am attempting to modify Xen to create (on request by an already-booted, cooperative guest with a duly modified Linux kernel) a second set of extended page tables that have access to additional privileged regions of host-physical memory (specifically, a page or two to store some sensitive data that we don't want the guest kernel to be able to overwrite, plus some host-physical MMIO ranges, specifically the xAPIC region). The idea is that the guest kernel will use VMFUNC to switch to the alternate EPTs and call "secure functions" provided (by the hypervisor) as read-only code to be executed in non-root mode on the alternate EPT, allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU of the same domain) to be handled without exiting non-root mode. Hence, these extra privileged pages should only be visible to the alternative p2m that the "secure realm" functions live in. (Transitions between the secure- and insecure-realm EPTs will be through special read-only "trampoline" code pages that ensure the untrusted guest kernel can only enter the secure realm at designated entry points.)

Looking at Xen's existing altp2m code, I get the sense that Xen is already designed to support something at least vaguely like this. I have not, however, been able to find much in the way of documentation on altp2m, so I am reaching out to see if anyone can offer pointers on how to best use it.

What is the intended workflow (either in the toolstack or within the hypervisor itself) for creating and configuring an altp2m that should have access to additional host-physical frames that are not present in the guest's main p2m?

FWIW, once the altp2m has been set up in this fashion, we don't anticipate needing to fiddle with its mappings any further as long as the guest is running (so I'm thinking *maybe* the "external" altp2m mode will suffice for this). In fact, we may not even need to have any "overlap" between the primary and alternative p2m except the trampoline pages themselves (although this aspect of our design is still somewhat in flux).

I've noticed a function, do_altp2m_op(), in the hypervisor (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related hypercalls intended to be called from the dom0. Do these hypercalls already provide a straightforward way to achieve my goals described above entirely via (a potentially modified version of) the dom0 toolstack? Or would I be better off creating and configuring the altp2m from within the hypervisor itself, since I want to map low-level stuff like xAPIC MMIO ranges into the altp2m?

Thank you in advance for your time and assistance!

Sincerely,

Ethan Johnson
Computer Science PhD candidate, Systems group, University of Rochester
mailto:ejohns48@cs.rochester.edu


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-15  2:01 Best way to use altp2m to support VMFUNC EPT-switching? Johnson, Ethan
@ 2023-03-15  9:22 ` Andrew Cooper
  2023-03-15 21:41   ` Johnson, Ethan
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2023-03-15  9:22 UTC (permalink / raw)
  To: Johnson, Ethan, xen-devel

On 15/03/2023 2:01 am, Johnson, Ethan wrote:
> Hi all,
>
> I'm looking for some pointers on how Xen's altp2m system works and how it's meant to be used with Intel's VMFUNC EPT-switching for secure isolation within an HVM/PVH guest's kernelspace.
>
> Specifically, I am attempting to modify Xen to create (on request by an already-booted, cooperative guest with a duly modified Linux kernel) a second set of extended page tables that have access to additional privileged regions of host-physical memory (specifically, a page or two to store some sensitive data that we don't want the guest kernel to be able to overwrite, plus some host-physical MMIO ranges, specifically the xAPIC region). The idea is that the guest kernel will use VMFUNC to switch to the alternate EPTs and call "secure functions" provided (by the hypervisor) as read-only code to be executed in non-root mode on the alternate EPT, allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU of the same domain) to be handled without exiting non-root mode. Hence, these extra privileged pages should only be visible to the alternative p2m that the "secure realm" functions live in. (Transitions between the secure- and insecure-realm EPTs will be through special read-only "trampoline" code pages that ensure the untrusted guest kernel can only enter the secure realm at designated entry points.)
>
> Looking at Xen's existing altp2m code, I get the sense that Xen is already designed to support something at least vaguely like this. I have not, however, been able to find much in the way of documentation on altp2m, so I am reaching out to see if anyone can offer pointers on how to best use it.
>
> What is the intended workflow (either in the toolstack or within the hypervisor itself) for creating and configuring an altp2m that should have access to additional host-physical frames that are not present in the guest's main p2m?
>
> FWIW, once the altp2m has been set up in this fashion, we don't anticipate needing to fiddle with its mappings any further as long as the guest is running (so I'm thinking *maybe* the "external" altp2m mode will suffice for this). In fact, we may not even need to have any "overlap" between the primary and alternative p2m except the trampoline pages themselves (although this aspect of our design is still somewhat in flux).
>
> I've noticed a function, do_altp2m_op(), in the hypervisor (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related hypercalls intended to be called from the dom0. Do these hypercalls already provide a straightforward way to achieve my goals described above entirely via (a potentially modified version of) the dom0 toolstack? Or would I be better off creating and configuring the altp2m from within the hypervisor itself, since I want to map low-level stuff like xAPIC MMIO ranges into the altp2m?
>
> Thank you in advance for your time and assistance!

Hello,

There's a lot to unpack here, but before I do so, one question.  In your
usecase, are you wanting to map any frames with reduced permissions
(i.e. such that you'd get a #VE exception), or are you just looking to
add new frames with RWX perms into an alternative view?

I suspect the latter, but it's not completely clear, and changes the answer.

~Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-15  9:22 ` Andrew Cooper
@ 2023-03-15 21:41   ` Johnson, Ethan
  2023-03-16  2:14     ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Johnson, Ethan @ 2023-03-15 21:41 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel

-----Original Message-----
From: Andrew Cooper <andrew.cooper3@citrix.com> 
Sent: Wednesday, March 15, 2023 5:22 AM
To: Johnson, Ethan <ejohns48@cs.rochester.edu>;
xen-devel@lists.xenproject.org
Subject: [EXT] Re: Best way to use altp2m to support VMFUNC EPT-switching?

> On 15/03/2023 2:01 am, Johnson, Ethan wrote:
>> Hi all,
>>
>> I'm looking for some pointers on how Xen's altp2m system works and how it's
>> meant to be used with Intel's VMFUNC EPT-switching for secure isolation
>> within an HVM/PVH guest's kernelspace.
>>
>> Specifically, I am attempting to modify Xen to create (on request by an
>> already-booted, cooperative guest with a duly modified Linux kernel) a
>> second set of extended page tables that have access to additional
>> privileged regions of host-physical memory (specifically, a page or two to
>> store some sensitive data that we don't want the guest kernel to be able to
>> overwrite, plus some host-physical MMIO ranges, specifically the xAPIC
>> region). The idea is that the guest kernel will use VMFUNC to switch to the
>> alternate EPTs and call "secure functions" provided (by the hypervisor) as
>> read-only code to be executed in non-root mode on the alternate EPT,
>> allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU
>> of the same domain) to be handled without exiting non-root mode. Hence,
>> these extra privileged pages should only be visible to the alternative p2m
>> that the "secure realm" functions live in. (Transitions between the secure-
>> and insecure-realm EPTs will be through special read-only "trampoline" code
>> pages that ensure the untrusted guest kernel can only enter the secure
>> realm at designated entry points.)
>>
>> Looking at Xen's existing altp2m code, I get the sense that Xen is already
>> designed to support something at least vaguely like this. I have not,
>> however, been able to find much in the way of documentation on altp2m, so I
>> am reaching out to see if anyone can offer pointers on how to best use it.
>>
>> What is the intended workflow (either in the toolstack or within the
>> hypervisor itself) for creating and configuring an altp2m that should have
>> access to additional host-physical frames that are not present in the
>> guest's main p2m?
>>
>> FWIW, once the altp2m has been set up in this fashion, we don't anticipate
>> needing to fiddle with its mappings any further as long as the guest is
>> running (so I'm thinking *maybe* the "external" altp2m mode will suffice
>> for this). In fact, we may not even need to have any "overlap" between the
>> primary and alternative p2m except the trampoline pages themselves
>> (although this aspect of our design is still somewhat in flux).
>>
>> I've noticed a function, do_altp2m_op(), in the hypervisor
>> (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related
>> hypercalls intended to be called from the dom0. Do these hypercalls already
>> provide a straightforward way to achieve my goals described above entirely
>> via (a potentially modified version of) the dom0 toolstack? Or would I be
>> better off creating and configuring the altp2m from within the hypervisor
>> itself, since I want to map low-level stuff like xAPIC MMIO ranges into the
>> altp2m?
>>
>> Thank you in advance for your time and assistance!
>
> Hello,
> 
> There's a lot to unpack here, but before I do so, one question.  In your
> usecase, are you wanting to map any frames with reduced permissions
> (i.e. such that you'd get a #VE exception), or are you just looking to
> add new frames with RWX perms into an alternative view?
> 
> I suspect the latter, but it's not completely clear, and changes the answer.
> 
> ~Andrew

Yes, the latter is correct: I am looking to add new frames with RWX perms
into an alternative view. I don't currently envision needing #VE in any form
for this work.

(We're using a modified PVH Linux guest for this, so rather than needing to
intercept and react to EPT faults via #VE, we can expect the guest kernel to
explicitly call our secure-realm functions via VMFUNC, replacing what would
otherwise be some hypercalls out to Xen in root mode. I suppose supporting
unmodified kernels by intercepting #VE could be an interesting enhancement
for future work, but for now we're content to work with a cooperative
modified PVH guest as a proof of concept. :-))

Basically, the primary p2m will be (largely) as it is normally, and the
untrusted guest kernel and userspace will run on it as an HVM/PVH guest
normally would. The alternate p2m will have some additional private code and
data pages mapped in (RWX in the altp2m, but either read-only or completely
unmapped in the primary p2m), as well as the host's xAPIC MMIO range so it
can send IPIs to other vCPUs without having to VM-exit. To facilitate safe
transitions between these two "realms", we'll be adding a couple of
R/X-permissioned "trampoline pages" that will contain the VMFUNC instructions
themselves and will be present in both p2ms.

Thanks,

Ethan Johnson
Computer Science PhD candidate, Systems group, University of Rochester
ejohns48@cs.rochester.edu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-15 21:41   ` Johnson, Ethan
@ 2023-03-16  2:14     ` Andrew Cooper
  2023-03-30  2:29       ` Johnson, Ethan
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2023-03-16  2:14 UTC (permalink / raw)
  To: Johnson, Ethan, xen-devel

On 15/03/2023 9:41 pm, Johnson, Ethan wrote:
>> On 15/03/2023 2:01 am, Johnson, Ethan wrote:
>>> Hi all,
>>>
>>> I'm looking for some pointers on how Xen's altp2m system works and how it's
>>> meant to be used with Intel's VMFUNC EPT-switching for secure isolation
>>> within an HVM/PVH guest's kernelspace.
>>>
>>> Specifically, I am attempting to modify Xen to create (on request by an
>>> already-booted, cooperative guest with a duly modified Linux kernel) a
>>> second set of extended page tables that have access to additional
>>> privileged regions of host-physical memory (specifically, a page or two to
>>> store some sensitive data that we don't want the guest kernel to be able to
>>> overwrite, plus some host-physical MMIO ranges, specifically the xAPIC
>>> region). The idea is that the guest kernel will use VMFUNC to switch to the
>>> alternate EPTs and call "secure functions" provided (by the hypervisor) as
>>> read-only code to be executed in non-root mode on the alternate EPT,
>>> allowing certain VM-exit scenarios (namely, sending an IPI to another vCPU
>>> of the same domain) to be handled without exiting non-root mode. Hence,
>>> these extra privileged pages should only be visible to the alternative p2m
>>> that the "secure realm" functions live in. (Transitions between the secure-
>>> and insecure-realm EPTs will be through special read-only "trampoline" code
>>> pages that ensure the untrusted guest kernel can only enter the secure
>>> realm at designated entry points.)
>>>
>>> Looking at Xen's existing altp2m code, I get the sense that Xen is already
>>> designed to support something at least vaguely like this. I have not,
>>> however, been able to find much in the way of documentation on altp2m, so I
>>> am reaching out to see if anyone can offer pointers on how to best use it.
>>>
>>> What is the intended workflow (either in the toolstack or within the
>>> hypervisor itself) for creating and configuring an altp2m that should have
>>> access to additional host-physical frames that are not present in the
>>> guest's main p2m?
>>>
>>> FWIW, once the altp2m has been set up in this fashion, we don't anticipate
>>> needing to fiddle with its mappings any further as long as the guest is
>>> running (so I'm thinking *maybe* the "external" altp2m mode will suffice
>>> for this). In fact, we may not even need to have any "overlap" between the
>>> primary and alternative p2m except the trampoline pages themselves
>>> (although this aspect of our design is still somewhat in flux).
>>>
>>> I've noticed a function, do_altp2m_op(), in the hypervisor
>>> (xen/arch/x86/hvm/hvm.c) that seems to implement a number of altp2m-related
>>> hypercalls intended to be called from the dom0. Do these hypercalls already
>>> provide a straightforward way to achieve my goals described above entirely
>>> via (a potentially modified version of) the dom0 toolstack? Or would I be
>>> better off creating and configuring the altp2m from within the hypervisor
>>> itself, since I want to map low-level stuff like xAPIC MMIO ranges into the
>>> altp2m?
>>>
>>> Thank you in advance for your time and assistance!
>> Hello,
>>
>> There's a lot to unpack here, but before I do so, one question.  In your
>> usecase, are you wanting to map any frames with reduced permissions
>> (i.e. such that you'd get a #VE exception), or are you just looking to
>> add new frames with RWX perms into an alternative view?
>>
>> I suspect the latter, but it's not completely clear, and changes the answer.
>>
>> ~Andrew
> Yes, the latter is correct: I am looking to add new frames with RWX perms
> into an alternative view. I don't currently envision needing #VE in any form
> for this work.
>
> (We're using a modified PVH Linux guest for this, so rather than needing to
> intercept and react to EPT faults via #VE, we can expect the guest kernel to
> explicitly call our secure-realm functions via VMFUNC, replacing what would
> otherwise be some hypercalls out to Xen in root mode. I suppose supporting
> unmodified kernels by intercepting #VE could be an interesting enhancement
> for future work, but for now we're content to work with a cooperative
> modified PVH guest as a proof of concept. :-))
>
> Basically, the primary p2m will be (largely) as it is normally, and the
> untrusted guest kernel and userspace will run on it as an HVM/PVH guest
> normally would. The alternate p2m will have some additional private code and
> data pages mapped in (RWX in the altp2m, but either read-only or completely
> unmapped in the primary p2m), as well as the host's xAPIC MMIO range so it
> can send IPIs to other vCPUs without having to VM-exit. To facilitate safe
> transitions between these two "realms", we'll be adding a couple of
> R/X-permissioned "trampoline pages" that will contain the VMFUNC instructions
> themselves and will be present in both p2ms.
>
> Thanks,

Ok, so there is a lot here.  Apologies in advance for the overly long
answer.

First, while altp2m was developed in parallel with EPTP-switching, we
took care to split the vendor neutral parts from the vendor specific
bits.  So while we do have VMFUNC support, that's considered "just" a
hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.

But before you start, it is important to understand your security
boundaries.  You've found external mode, and this is all about
controlling which aspects of altp2m the guest can invoke itself, and
modes other than external let the guest issue HVMOP_altp2m ops itself.

If you permit the guest to change views itself, either with VMFUNC, or
HVMOP_altp2m_switch_p2m, you have to realise that these are just
"regular" CPL0 actions, and can be invoked by any kernel code, not just
your driver.  i.e. the union of all primary and alternative views is one
single security domain.

For some usecases this is fine, but yours doesn't look like it fits in
this category.  In particular, no amount of protection on the trampoline
pages stops someone writing a VMFUNC instruction elsewhere in kernel
space and executing it.

(I have seen plenty of research papers try to construct a security
boundary around VMFUNC.  I have yet see one that does so robustly, but I
do enjoy being surprised on occasion...)

The first production use this technology I'm aware of was Bitdefender's
HVMI, where the guest had no control at all, and was subject to the
permission restrictions imposed on it by the agent in dom0.  The agent
trapped everything it considered sensitive, including writes to
sensitive areas of memory using reduced EPT permissions, and either
permitted execution to continue, or took other preventative action.

This highlights another key point.  Some entity in the system needs to
deal with faults that occur when the guest accidentally (or otherwise)
violates the reduced EPT permissions.  #VE is, again, an optimisation to
let violations be handled in guest context, rather than taking a VMExit,
but even with #VE the complicated corner cases are left to the external
agent.

With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
mitigate the perf hit from Window's Meltdown mitigation electing to use
LOCK'd BTS/BTC operations on pagetables (which were write protected
behind the scenes), but I'm reliably informed that the hoops required to
jump through to make that work, and in particular avoid the notice of
PatchGuard, were substantial.

Perhaps a more accessible example is
https://github.com/intel/kernel-fuzzer-for-xen-project and the
underlying libvmi.  There is also a very basic example in
tools/misc/xen-access.c in the Xen tree.

For your question specifically about mapping other frames, we do have
hypercalls to map other frames (its necessary for e.g. mapping BARs of
passed-through PCI devices), but for obvious reasons, it's restricted to
control software (Qemu) in dom0.  I suspect we don't actually have a
hypercall to map MMIO into an alternative view, but it shouldn't be hard
to add (if you still decide you want it by the end of this email).


But on to the specifics of mapping the xAPIC page.  Sorry, but
irrespective of altp2m, that is a non-starter, for reasons that date
back to ~1997 or thereabouts.

It's worth saying that AMD can fully virtualise IPI delivery from one
vCPU to another without either taking a VMExit in the common case, since
Zen1 (IIRC).  Intel has a similar capability since Sapphire Rapids
(IIRC).  Xen doesn't support either yet, because there are only so many
hours in the day...

It is technically possible to map the xAPIC window into a guest, and
such a guest could interact the real interrupt controller.  But now
you've got the problem that two bits of software (Xen, and your magic
piece of guest kernel) are trying to driver the same single interrupt
controller.

Even if you were to say that the guest would only use ICR to send
interrupts, that still doesn't work.  In xAPIC, ICR is formed of two
half registers, as it dates from the days of 32bit processors, with a
large stride between the two half registers.

Therefore, it is a minimum of two separate instructions (set destination
in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.

A common bug in kernels is to try and send IPIs when interrupts are
enabled, or in NMI context, both of which could interrupt an IPI
sequence.  This results in a sequence of writes (from the LAPIC's point
of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
to be sent with the wrong destination.

Guests always execute with IRQs enabled, but can take a VMExit on any
arbitrary instruction boundary for other reasons, so the guest kernel
can never be sure that ICR_HI hasn't been modified by Xen in the
background, even if it used two adjacent instructions to send the IPI.

Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
making ICR a single register, so it could be written atomically.  But
now you have an MSR based interface, not an MMIO based interface.

It's also worth noting that any system with >254 CPUs is necessarily
operating in x2APIC mode (so there isn't an xAPIC window to map, even if
you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
and later CPUs are locked into x2APIC mode by firmware, with no option
to revert back into xAPIC mode even on smaller systems.

On top of that, you've still got the problem of determining the
destination.  Even if the guest could send an IPI, it still has to know
the physical APIC ID of the CPU the target vCPU is currently scheduled
on.  And you'd have to ignore things like the logical mode or
destination shorthands, because multi/broadcast IPIs will hit incorrect
targets.

On top of that, even if you can determine the right destination, how
does the target receive the interrupt?  There can only be one entity in
the system receiving INTR, and that's Xen.  So you've got to pick some
vector that Xen knows what to do with, but isn't otherwise using.

Not to mention there's a(nother) giant security hole... A guest able to
issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
back into real mode behind Xen's back.  Xen will not take kindly to this.


So while I expect there's plenty of room to innovate on the realm switch
aspect of EPTP-switching, trying to send IPIs from within guest context
is something that I will firmly suggest you avoid.  There are good
reasons why it is so complicated to get VMExit-less guest IPIs working.

~Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-16  2:14     ` Andrew Cooper
@ 2023-03-30  2:29       ` Johnson, Ethan
  2023-03-31 21:06         ` Andrew Cooper
  2023-04-03 13:40         ` Tamas K Lengyel
  0 siblings, 2 replies; 7+ messages in thread
From: Johnson, Ethan @ 2023-03-30  2:29 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel

On 2023-03-16 02:14:18 +0000, Andrew Cooper wrote:
> Ok, so there is a lot here.  Apologies in advance for the overly long
> answer.
>
> First, while altp2m was developed in parallel with EPTP-switching, we
> took care to split the vendor neutral parts from the vendor specific
> bits.  So while we do have VMFUNC support, that's considered "just" a
> hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.
>
> But before you start, it is important to understand your security
> boundaries.  You've found external mode, and this is all about
> controlling which aspects of altp2m the guest can invoke itself, and
> modes other than external let the guest issue HVMOP_altp2m ops itself.
>
> If you permit the guest to change views itself, either with VMFUNC, or
> HVMOP_altp2m_switch_p2m, you have to realise that these are just
> "regular" CPL0 actions, and can be invoked by any kernel code, not just
> your driver.  i.e. the union of all primary and alternative views is one
> single security domain.
>
> For some usecases this is fine, but yours doesn't look like it fits in
> this category.  In particular, no amount of protection on the trampoline
> pages stops someone writing a VMFUNC instruction elsewhere in kernel
> space and executing it.
>
> (I have seen plenty of research papers try to construct a security
> boundary around VMFUNC.  I have yet see one that does so robustly, but I
> do enjoy being surprised on occasion...)
>
> The first production use this technology I'm aware of was Bitdefender's
> HVMI, where the guest had no control at all, and was subject to the
> permission restrictions imposed on it by the agent in dom0.  The agent
> trapped everything it considered sensitive, including writes to
> sensitive areas of memory using reduced EPT permissions, and either
> permitted execution to continue, or took other preventative action.
>
> This highlights another key point.  Some entity in the system needs to
> deal with faults that occur when the guest accidentally (or otherwise)
> violates the reduced EPT permissions.  #VE is, again, an optimisation to
> let violations be handled in guest context, rather than taking a VMExit,
> but even with #VE the complicated corner cases are left to the external
> agent.
>
> With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
> mitigate the perf hit from Window's Meltdown mitigation electing to use
> LOCK'd BTS/BTC operations on pagetables (which were write protected
> behind the scenes), but I'm reliably informed that the hoops required to
> jump through to make that work, and in particular avoid the notice of
> PatchGuard, were substantial.
>
> Perhaps a more accessible example is
> https://github.com/intel/kernel-fuzzer-for-xen-project and the
> underlying libvmi.  There is also a very basic example in
> tools/misc/xen-access.c in the Xen tree.
>
> For your question specifically about mapping other frames, we do have
> hypercalls to map other frames (its necessary for e.g. mapping BARs of
> passed-through PCI devices), but for obvious reasons, it's restricted to
> control software (Qemu) in dom0.  I suspect we don't actually have a
> hypercall to map MMIO into an alternative view, but it shouldn't be hard
> to add (if you still decide you want it by the end of this email).
>
>
> But on to the specifics of mapping the xAPIC page.  Sorry, but
> irrespective of altp2m, that is a non-starter, for reasons that date
> back to ~1997 or thereabouts.
>
> It's worth saying that AMD can fully virtualise IPI delivery from one
> vCPU to another without either taking a VMExit in the common case, since
> Zen1 (IIRC).  Intel has a similar capability since Sapphire Rapids
> (IIRC).  Xen doesn't support either yet, because there are only so many
> hours in the day...
>
> It is technically possible to map the xAPIC window into a guest, and
> such a guest could interact the real interrupt controller.  But now
> you've got the problem that two bits of software (Xen, and your magic
> piece of guest kernel) are trying to driver the same single interrupt
> controller.
>
> Even if you were to say that the guest would only use ICR to send
> interrupts, that still doesn't work.  In xAPIC, ICR is formed of two
> half registers, as it dates from the days of 32bit processors, with a
> large stride between the two half registers.
>
> Therefore, it is a minimum of two separate instructions (set destination
> in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.
>
> A common bug in kernels is to try and send IPIs when interrupts are
> enabled, or in NMI context, both of which could interrupt an IPI
> sequence.  This results in a sequence of writes (from the LAPIC's point
> of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
> to be sent with the wrong destination.
>
> Guests always execute with IRQs enabled, but can take a VMExit on any
> arbitrary instruction boundary for other reasons, so the guest kernel
> can never be sure that ICR_HI hasn't been modified by Xen in the
> background, even if it used two adjacent instructions to send the IPI.
>
> Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
> making ICR a single register, so it could be written atomically.  But
> now you have an MSR based interface, not an MMIO based interface.
>
> It's also worth noting that any system with >254 CPUs is necessarily
> operating in x2APIC mode (so there isn't an xAPIC window to map, even if
> you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
> and later CPUs are locked into x2APIC mode by firmware, with no option
> to revert back into xAPIC mode even on smaller systems.
>
> On top of that, you've still got the problem of determining the
> destination.  Even if the guest could send an IPI, it still has to know
> the physical APIC ID of the CPU the target vCPU is currently scheduled
> on.  And you'd have to ignore things like the logical mode or
> destination shorthands, because multi/broadcast IPIs will hit incorrect
> targets.
>
> On top of that, even if you can determine the right destination, how
> does the target receive the interrupt?  There can only be one entity in
> the system receiving INTR, and that's Xen.  So you've got to pick some
> vector that Xen knows what to do with, but isn't otherwise using.
>
> Not to mention there's a(nother) giant security hole... A guest able to
> issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
> back into real mode behind Xen's back.  Xen will not take kindly to this.
>
>
> So while I expect there's plenty of room to innovate on the realm switch
> aspect of EPTP-switching, trying to send IPIs from within guest context
> is something that I will firmly suggest you avoid.  There are good
> reasons why it is so complicated to get VMExit-less guest IPIs working.
>
> ~Andrew

Thank you for the detailed answers and context. I am somewhat encouraged to
note that most of the roadblocks you mentioned are issues we've specifically
considered (and think we have solutions for) in our design. :-) We're using
some rather exotic compiler-based instrumentation on the guest kernel (plus
some tricks with putting the "secure realm"'s page tables in a nonoverlapping
guest-physical address range that isn't present in the primary p2m used by
untrusted code) to prevent the guest from doing things it isn't supposed to
with VMFUNC and (x2)APIC access, despite running in ring 0 within non-root
mode.

On a more concrete level, I am looking to do the following from within the
hypervisor (specifically, from within a new hypercall I've added):

1) Get some (host-)physical memory frames from the domain heap and "pin" them
to make sure they won't be swapped out.

2) Create an altp2m for the calling (current) domain.

3) Map some of the newly-allocated physical frames into both the domain's
primary p2m and its altp2m, with R/X permissions.

4) Map the rest of the physical frames into only the altp2m (as R/W), at a
guest-physical address higher than the end of the main p2m's mapped range 
(such that when the primary p2m is active, the guest cannot access these
pages without taking a hard VM-exit fault).

I've been poring through Xen's p2m code (e.g. xen/arch/x86/mm/p2m.c) to try
to understand how to achieve these goals, but with little success. Comments
in the p2m code seem to be rather sparse, and mostly unhelpful for
understanding (without pre-understood context) what many of the functions do
and what is the intended workflow for using them. For instance,
similarly-named functions like guest_remove_page() and
guest_physmap_remove_page() seem to operate at different levels of
abstraction (in terms of memory management, refcount bookkeeping, etc.) but
it isn't externally obvious how they're meant to all fit together and be used
by client code.

Any suggestions on which p2m (or other) APIs I should be focusing on, and how
they're meant to be used, would be greatly appreciated. I suppose in theory I
could just bypass p2m entirely, and populate one of the VMCS's EPTP-switching
array's slots directly with my own manually constructed paging hierarchy
(since I'm envisioning the memory layout of our "secure realm" as being quite
simple - it only needs a handful of pages). But I'd rather "color within the
lines" of the existing APIs if possible, especially since some of the pages
will need to be mapped into the existing primary p2m (for the "insecure
realm") as well.

Much thanks,

Ethan Johnson
Computer Science PhD candidate, Systems group, University of Rochester
ejohns48@cs.rochester.edu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-30  2:29       ` Johnson, Ethan
@ 2023-03-31 21:06         ` Andrew Cooper
  2023-04-03 13:40         ` Tamas K Lengyel
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2023-03-31 21:06 UTC (permalink / raw)
  To: Johnson, Ethan, xen-devel

On 30/03/2023 3:29 am, Johnson, Ethan wrote:
> On 2023-03-16 02:14:18 +0000, Andrew Cooper wrote:
>> Ok, so there is a lot here.  Apologies in advance for the overly long
>> answer.
>>
>> First, while altp2m was developed in parallel with EPTP-switching, we
>> took care to split the vendor neutral parts from the vendor specific
>> bits.  So while we do have VMFUNC support, that's considered "just" a
>> hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.
>>
>> But before you start, it is important to understand your security
>> boundaries.  You've found external mode, and this is all about
>> controlling which aspects of altp2m the guest can invoke itself, and
>> modes other than external let the guest issue HVMOP_altp2m ops itself.
>>
>> If you permit the guest to change views itself, either with VMFUNC, or
>> HVMOP_altp2m_switch_p2m, you have to realise that these are just
>> "regular" CPL0 actions, and can be invoked by any kernel code, not just
>> your driver.  i.e. the union of all primary and alternative views is one
>> single security domain.
>>
>> For some usecases this is fine, but yours doesn't look like it fits in
>> this category.  In particular, no amount of protection on the trampoline
>> pages stops someone writing a VMFUNC instruction elsewhere in kernel
>> space and executing it.
>>
>> (I have seen plenty of research papers try to construct a security
>> boundary around VMFUNC.  I have yet see one that does so robustly, but I
>> do enjoy being surprised on occasion...)
>>
>> The first production use this technology I'm aware of was Bitdefender's
>> HVMI, where the guest had no control at all, and was subject to the
>> permission restrictions imposed on it by the agent in dom0.  The agent
>> trapped everything it considered sensitive, including writes to
>> sensitive areas of memory using reduced EPT permissions, and either
>> permitted execution to continue, or took other preventative action.
>>
>> This highlights another key point.  Some entity in the system needs to
>> deal with faults that occur when the guest accidentally (or otherwise)
>> violates the reduced EPT permissions.  #VE is, again, an optimisation to
>> let violations be handled in guest context, rather than taking a VMExit,
>> but even with #VE the complicated corner cases are left to the external
>> agent.
>>
>> With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
>> mitigate the perf hit from Window's Meltdown mitigation electing to use
>> LOCK'd BTS/BTC operations on pagetables (which were write protected
>> behind the scenes), but I'm reliably informed that the hoops required to
>> jump through to make that work, and in particular avoid the notice of
>> PatchGuard, were substantial.
>>
>> Perhaps a more accessible example is
>> https://github.com/intel/kernel-fuzzer-for-xen-project and the
>> underlying libvmi.  There is also a very basic example in
>> tools/misc/xen-access.c in the Xen tree.
>>
>> For your question specifically about mapping other frames, we do have
>> hypercalls to map other frames (its necessary for e.g. mapping BARs of
>> passed-through PCI devices), but for obvious reasons, it's restricted to
>> control software (Qemu) in dom0.  I suspect we don't actually have a
>> hypercall to map MMIO into an alternative view, but it shouldn't be hard
>> to add (if you still decide you want it by the end of this email).
>>
>>
>> But on to the specifics of mapping the xAPIC page.  Sorry, but
>> irrespective of altp2m, that is a non-starter, for reasons that date
>> back to ~1997 or thereabouts.
>>
>> It's worth saying that AMD can fully virtualise IPI delivery from one
>> vCPU to another without either taking a VMExit in the common case, since
>> Zen1 (IIRC).  Intel has a similar capability since Sapphire Rapids
>> (IIRC).  Xen doesn't support either yet, because there are only so many
>> hours in the day...
>>
>> It is technically possible to map the xAPIC window into a guest, and
>> such a guest could interact the real interrupt controller.  But now
>> you've got the problem that two bits of software (Xen, and your magic
>> piece of guest kernel) are trying to driver the same single interrupt
>> controller.
>>
>> Even if you were to say that the guest would only use ICR to send
>> interrupts, that still doesn't work.  In xAPIC, ICR is formed of two
>> half registers, as it dates from the days of 32bit processors, with a
>> large stride between the two half registers.
>>
>> Therefore, it is a minimum of two separate instructions (set destination
>> in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.
>>
>> A common bug in kernels is to try and send IPIs when interrupts are
>> enabled, or in NMI context, both of which could interrupt an IPI
>> sequence.  This results in a sequence of writes (from the LAPIC's point
>> of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
>> to be sent with the wrong destination.
>>
>> Guests always execute with IRQs enabled, but can take a VMExit on any
>> arbitrary instruction boundary for other reasons, so the guest kernel
>> can never be sure that ICR_HI hasn't been modified by Xen in the
>> background, even if it used two adjacent instructions to send the IPI.
>>
>> Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
>> making ICR a single register, so it could be written atomically.  But
>> now you have an MSR based interface, not an MMIO based interface.
>>
>> It's also worth noting that any system with >254 CPUs is necessarily
>> operating in x2APIC mode (so there isn't an xAPIC window to map, even if
>> you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
>> and later CPUs are locked into x2APIC mode by firmware, with no option
>> to revert back into xAPIC mode even on smaller systems.
>>
>> On top of that, you've still got the problem of determining the
>> destination.  Even if the guest could send an IPI, it still has to know
>> the physical APIC ID of the CPU the target vCPU is currently scheduled
>> on.  And you'd have to ignore things like the logical mode or
>> destination shorthands, because multi/broadcast IPIs will hit incorrect
>> targets.
>>
>> On top of that, even if you can determine the right destination, how
>> does the target receive the interrupt?  There can only be one entity in
>> the system receiving INTR, and that's Xen.  So you've got to pick some
>> vector that Xen knows what to do with, but isn't otherwise using.
>>
>> Not to mention there's a(nother) giant security hole... A guest able to
>> issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
>> back into real mode behind Xen's back.  Xen will not take kindly to this.
>>
>>
>> So while I expect there's plenty of room to innovate on the realm switch
>> aspect of EPTP-switching, trying to send IPIs from within guest context
>> is something that I will firmly suggest you avoid.  There are good
>> reasons why it is so complicated to get VMExit-less guest IPIs working.
>>
>> ~Andrew
> Thank you for the detailed answers and context. I am somewhat encouraged to
> note that most of the roadblocks you mentioned are issues we've specifically
> considered (and think we have solutions for) in our design. :-) We're using
> some rather exotic compiler-based instrumentation on the guest kernel (plus
> some tricks with putting the "secure realm"'s page tables in a nonoverlapping
> guest-physical address range that isn't present in the primary p2m used by
> untrusted code) to prevent the guest from doing things it isn't supposed to
> with VMFUNC and (x2)APIC access, despite running in ring 0 within non-root
> mode.
>
> On a more concrete level, I am looking to do the following from within the
> hypervisor (specifically, from within a new hypercall I've added):
>
> 1) Get some (host-)physical memory frames from the domain heap and "pin" them
> to make sure they won't be swapped out.

Xen doesn't have paging, owing to not having a disk driver.

There is a paging subsystem which you've probably found already in the
code, but it's a decade old and never got beyond experimental status, so
for most intents and purposes you can pretend that it doesn't exist.

i.e. nothing allocated in Xen moves around unexpectedly behind your back.

However, pages that are allocated to a guest (PGT_allocated) are
reference counted, and can be freed when the refcount drops to zero. 
This can include explicit guest actions such as a decrease_reservation()
hypercall.  You have to be aware of this if you want to point any other
non-refcounted thing at the memory, but I suspect it wont matter for
your cases here.

> 2) Create an altp2m for the calling (current) domain.
>
> 3) Map some of the newly-allocated physical frames into both the domain's
> primary p2m and its altp2m, with R/X permissions.
>
> 4) Map the rest of the physical frames into only the altp2m (as R/W), at a
> guest-physical address higher than the end of the main p2m's mapped range 
> (such that when the primary p2m is active, the guest cannot access these
> pages without taking a hard VM-exit fault).
>
> I've been poring through Xen's p2m code (e.g. xen/arch/x86/mm/p2m.c) to try
> to understand how to achieve these goals, but with little success. Comments
> in the p2m code seem to be rather sparse, and mostly unhelpful for
> understanding (without pre-understood context) what many of the functions do
> and what is the intended workflow for using them. For instance,
> similarly-named functions like guest_remove_page() and
> guest_physmap_remove_page() seem to operate at different levels of
> abstraction (in terms of memory management, refcount bookkeeping, etc.) but
> it isn't externally obvious how they're meant to all fit together and be used
> by client code.

Don't feel too bad...  Not even the maintainers can agree on where that
split is either.

It's mostly an answer of history.  Originally Xen had paravirtual guests
(dom0 still runs in this mode) which were aware they were running under
Xen, and had to manage their own memory, including whatever idea they
had about their layout.

Then HVM guests came along and Xen had to start managing the guest
physical address space on behalf of the guest, and this was (dubiously)
called the physical_to_machine or P2M.

Notice how the guest_phymap_* functions have paging_mode_translate()
checks and do two totally different things.  Read
paging_mode_translate() as is_hvm_domain() and it might help.  The
guest_physmap_* functions are for doing logically-the-same operation on
PV or HVM guests, where PV is often a no-op, and HVM is quite involved.

The p2m functions are all for HVM guests specifically.

But yes - the APIs are a mess and you're not the only person to have
noticed.

> Any suggestions on which p2m (or other) APIs I should be focusing on, and how
> they're meant to be used, would be greatly appreciated. I suppose in theory I
> could just bypass p2m entirely, and populate one of the VMCS's EPTP-switching
> array's slots directly with my own manually constructed paging hierarchy
> (since I'm envisioning the memory layout of our "secure realm" as being quite
> simple - it only needs a handful of pages). But I'd rather "color within the
> lines" of the existing APIs if possible, especially since some of the pages
> will need to be mapped into the existing primary p2m (for the "insecure
> realm") as well.

Taking your analogy, I'm afraid you're probably going to have to start
with a pencil and draw some more lines.

The altp2m work got as far as minor {i,d}TLB bifurcation (to
stealth-breakpoint code under analysis), but didn't ever get to "I'd
like something totally different in different views".

There has to be an authoritative idea of what the guest physmap
(singular) looks like, and that's the host p2m.  (Not relevant to your
case, but to highlight a point.  Consider trying to migrate a VM with a
mutli-view setup.  The logdirty bitmap is expressed as a bit per gfn,
and all those gfn bits had better come from the same view, not the
alternate view which happened to be active.)

I suspect what you might want to do is create the guest with all memory
(but mark the secure realm's memory as either E820_RESERVED, or remove
the entry entirely), and create two altp2m's; one for the insecure
secure realm and one for the secure realm.

IIRC, views are populated copy-on-write style from the hostp2m as the
vCPU executes in that view, but you can make modifications using
HVMOP_altp2m_set_mem_access{,_multi} to give it specific perms or
HVMOP_altp2m_change_gfn to bifurcate.

I suspect what you want to do is set the default perm to no access (i.e.
disable CoW) and use HVMOP_altp2m_set_mem_access_multi explicitly create
the subset of mappings you want in each view.

But honestly, you're beyond my experience of using altp2m.  Good luck :)

~Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Best way to use altp2m to support VMFUNC EPT-switching?
  2023-03-30  2:29       ` Johnson, Ethan
  2023-03-31 21:06         ` Andrew Cooper
@ 2023-04-03 13:40         ` Tamas K Lengyel
  1 sibling, 0 replies; 7+ messages in thread
From: Tamas K Lengyel @ 2023-04-03 13:40 UTC (permalink / raw)
  To: Johnson, Ethan; +Cc: Andrew Cooper, xen-devel

[-- Attachment #1: Type: text/plain, Size: 11677 bytes --]

On Wed, Mar 29, 2023 at 10:29 PM Johnson, Ethan <ejohns48@cs.rochester.edu>
wrote:
>
> On 2023-03-16 02:14:18 +0000, Andrew Cooper wrote:
> > Ok, so there is a lot here.  Apologies in advance for the overly long
> > answer.
> >
> > First, while altp2m was developed in parallel with EPTP-switching, we
> > took care to split the vendor neutral parts from the vendor specific
> > bits.  So while we do have VMFUNC support, that's considered "just" a
> > hardware optimisation to speed up the HVMOP_altp2m_switch_p2m hypercall.
> >
> > But before you start, it is important to understand your security
> > boundaries.  You've found external mode, and this is all about
> > controlling which aspects of altp2m the guest can invoke itself, and
> > modes other than external let the guest issue HVMOP_altp2m ops itself.
> >
> > If you permit the guest to change views itself, either with VMFUNC, or
> > HVMOP_altp2m_switch_p2m, you have to realise that these are just
> > "regular" CPL0 actions, and can be invoked by any kernel code, not just
> > your driver.  i.e. the union of all primary and alternative views is one
> > single security domain.
> >
> > For some usecases this is fine, but yours doesn't look like it fits in
> > this category.  In particular, no amount of protection on the trampoline
> > pages stops someone writing a VMFUNC instruction elsewhere in kernel
> > space and executing it.
> >
> > (I have seen plenty of research papers try to construct a security
> > boundary around VMFUNC.  I have yet see one that does so robustly, but I
> > do enjoy being surprised on occasion...)
> >
> > The first production use this technology I'm aware of was Bitdefender's
> > HVMI, where the guest had no control at all, and was subject to the
> > permission restrictions imposed on it by the agent in dom0.  The agent
> > trapped everything it considered sensitive, including writes to
> > sensitive areas of memory using reduced EPT permissions, and either
> > permitted execution to continue, or took other preventative action.
> >
> > This highlights another key point.  Some entity in the system needs to
> > deal with faults that occur when the guest accidentally (or otherwise)
> > violates the reduced EPT permissions.  #VE is, again, an optimisation to
> > let violations be handled in guest context, rather than taking a VMExit,
> > but even with #VE the complicated corner cases are left to the external
> > agent.
> >
> > With HVMI, #VE (but not VMFUNC IIRC) did get used as an optimisation to
> > mitigate the perf hit from Window's Meltdown mitigation electing to use
> > LOCK'd BTS/BTC operations on pagetables (which were write protected
> > behind the scenes), but I'm reliably informed that the hoops required to
> > jump through to make that work, and in particular avoid the notice of
> > PatchGuard, were substantial.
> >
> > Perhaps a more accessible example is
> > https://github.com/intel/kernel-fuzzer-for-xen-project and the
> > underlying libvmi.  There is also a very basic example in
> > tools/misc/xen-access.c in the Xen tree.
> >
> > For your question specifically about mapping other frames, we do have
> > hypercalls to map other frames (its necessary for e.g. mapping BARs of
> > passed-through PCI devices), but for obvious reasons, it's restricted to
> > control software (Qemu) in dom0.  I suspect we don't actually have a
> > hypercall to map MMIO into an alternative view, but it shouldn't be hard
> > to add (if you still decide you want it by the end of this email).
> >
> >
> > But on to the specifics of mapping the xAPIC page.  Sorry, but
> > irrespective of altp2m, that is a non-starter, for reasons that date
> > back to ~1997 or thereabouts.
> >
> > It's worth saying that AMD can fully virtualise IPI delivery from one
> > vCPU to another without either taking a VMExit in the common case, since
> > Zen1 (IIRC).  Intel has a similar capability since Sapphire Rapids
> > (IIRC).  Xen doesn't support either yet, because there are only so many
> > hours in the day...
> >
> > It is technically possible to map the xAPIC window into a guest, and
> > such a guest could interact the real interrupt controller.  But now
> > you've got the problem that two bits of software (Xen, and your magic
> > piece of guest kernel) are trying to driver the same single interrupt
> > controller.
> >
> > Even if you were to say that the guest would only use ICR to send
> > interrupts, that still doesn't work.  In xAPIC, ICR is formed of two
> > half registers, as it dates from the days of 32bit processors, with a
> > large stride between the two half registers.
> >
> > Therefore, it is a minimum of two separate instructions (set destination
> > in ICR_HI, set type/mode/etc in ICR_LO) to send an interrupt.
> >
> > A common bug in kernels is to try and send IPIs when interrupts are
> > enabled, or in NMI context, both of which could interrupt an IPI
> > sequence.  This results in a sequence of writes (from the LAPIC's point
> > of view) of ICR_HI, ICR_HI, ICR_LO, ICR_LO, which causes the outer IPI
> > to be sent with the wrong destination.
> >
> > Guests always execute with IRQs enabled, but can take a VMExit on any
> > arbitrary instruction boundary for other reasons, so the guest kernel
> > can never be sure that ICR_HI hasn't been modified by Xen in the
> > background, even if it used two adjacent instructions to send the IPI.
> >
> > Now, if you were to swap xAPIC for x2APIC, one of the bigger changes was
> > making ICR a single register, so it could be written atomically.  But
> > now you have an MSR based interface, not an MMIO based interface.
> >
> > It's also worth noting that any system with >254 CPUs is necessarily
> > operating in x2APIC mode (so there isn't an xAPIC window to map, even if
> > you wanted to try), and because of the ÆPIC Leak vulnerability, IceLake
> > and later CPUs are locked into x2APIC mode by firmware, with no option
> > to revert back into xAPIC mode even on smaller systems.
> >
> > On top of that, you've still got the problem of determining the
> > destination.  Even if the guest could send an IPI, it still has to know
> > the physical APIC ID of the CPU the target vCPU is currently scheduled
> > on.  And you'd have to ignore things like the logical mode or
> > destination shorthands, because multi/broadcast IPIs will hit incorrect
> > targets.
> >
> > On top of that, even if you can determine the right destination, how
> > does the target receive the interrupt?  There can only be one entity in
> > the system receiving INTR, and that's Xen.  So you've got to pick some
> > vector that Xen knows what to do with, but isn't otherwise using.
> >
> > Not to mention there's a(nother) giant security hole... A guest able to
> > issue interrupts could just send INIT-SIPI-SIPI and reset the target CPU
> > back into real mode behind Xen's back.  Xen will not take kindly to
this.
> >
> >
> > So while I expect there's plenty of room to innovate on the realm switch
> > aspect of EPTP-switching, trying to send IPIs from within guest context
> > is something that I will firmly suggest you avoid.  There are good
> > reasons why it is so complicated to get VMExit-less guest IPIs working.
> >
> > ~Andrew
>
> Thank you for the detailed answers and context. I am somewhat encouraged
to
> note that most of the roadblocks you mentioned are issues we've
specifically
> considered (and think we have solutions for) in our design. :-) We're
using
> some rather exotic compiler-based instrumentation on the guest kernel
(plus
> some tricks with putting the "secure realm"'s page tables in a
nonoverlapping
> guest-physical address range that isn't present in the primary p2m used by
> untrusted code) to prevent the guest from doing things it isn't supposed
to
> with VMFUNC and (x2)APIC access, despite running in ring 0 within non-root
> mode.
>
> On a more concrete level, I am looking to do the following from within the
> hypervisor (specifically, from within a new hypercall I've added):
>
> 1) Get some (host-)physical memory frames from the domain heap and "pin"
them
> to make sure they won't be swapped out.
>
> 2) Create an altp2m for the calling (current) domain.
>
> 3) Map some of the newly-allocated physical frames into both the domain's
> primary p2m and its altp2m, with R/X permissions.
>
> 4) Map the rest of the physical frames into only the altp2m (as R/W), at a
> guest-physical address higher than the end of the main p2m's mapped range
> (such that when the primary p2m is active, the guest cannot access these
> pages without taking a hard VM-exit fault).
>
> I've been poring through Xen's p2m code (e.g. xen/arch/x86/mm/p2m.c) to
try
> to understand how to achieve these goals, but with little success.
Comments
> in the p2m code seem to be rather sparse, and mostly unhelpful for
> understanding (without pre-understood context) what many of the functions
do
> and what is the intended workflow for using them. For instance,
> similarly-named functions like guest_remove_page() and
> guest_physmap_remove_page() seem to operate at different levels of
> abstraction (in terms of memory management, refcount bookkeeping, etc.)
but
> it isn't externally obvious how they're meant to all fit together and be
used
> by client code.
>
> Any suggestions on which p2m (or other) APIs I should be focusing on, and
how
> they're meant to be used, would be greatly appreciated. I suppose in
theory I
> could just bypass p2m entirely, and populate one of the VMCS's
EPTP-switching
> array's slots directly with my own manually constructed paging hierarchy
> (since I'm envisioning the memory layout of our "secure realm" as being
quite
> simple - it only needs a handful of pages). But I'd rather "color within
the
> lines" of the existing APIs if possible, especially since some of the
pages
> will need to be mapped into the existing primary p2m (for the "insecure
> realm") as well.

You can find an example work-flow here to create altp2m's and change memory
permissions in the different views:
https://github.com/xen-project/xen/blob/master/tools/misc/xen-access.c#L517.
To add a new page to the VM you can use xc_domain_populate_physmap_exact.
If you add the page after the VM has already booted the main kernel is
unaware of these extra pages that were added but that doesn't mean it can't
try to poke them. Similarly, using any type of memory map to avoid the
kernel accessing these pages is just wishful thinking, the memory map is
after all just a hint to the OS what to look for, not an access-control
mechanism.

Also keep in mind that altp2m's get CoW populated from the hostp2m. You can
still get your altp2m to be "only a couple pages" by either 1) ensuring no
other pages ever get touched while running the vCPU with the altp2m as to
not trigger the CoW mechanism; or 2) manually map change the memaccess
permissions to n on every page you want to be in-accessible in the altp2m.

You'll likely want to have pages like where the IDT and GDT is mapped into
the altp2m, alongside the pagetable pages. An easy way to check what pages
are needed for execution in a given code context is use the VM forking
mechanism, create a fork at the point your code is that you want to run in
the altp2m, singlestep the fork a single instruction, then examine the
fork's EPT using xl debug-keys D. Anything you see that got mapped into the
fork's memory would be similarly needed to be accessible in the altp2m.

Cheers,
Tamas

[-- Attachment #2: Type: text/html, Size: 14010 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-03 13:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-15  2:01 Best way to use altp2m to support VMFUNC EPT-switching? Johnson, Ethan
2023-03-15  9:22 ` Andrew Cooper
2023-03-15 21:41   ` Johnson, Ethan
2023-03-16  2:14     ` Andrew Cooper
2023-03-30  2:29       ` Johnson, Ethan
2023-03-31 21:06         ` Andrew Cooper
2023-04-03 13:40         ` Tamas K Lengyel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.