Xen-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
@ 2019-08-16 19:51 Johnson, Ethan
  2019-08-17 11:04 ` Andrew Cooper
  0 siblings, 1 reply; 13+ messages in thread
From: Johnson, Ethan @ 2019-08-16 19:51 UTC (permalink / raw)
  To: xen-devel

Hi all,

I have some follow-up questions about Xen's usage and layout of memory, 
building on the ones I asked here a few weeks ago (which were quite 
helpfully answered: see 
https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html 
for reference). For context on why I'm asking these questions, I'm using 
Xen as a research platform for enforcing novel memory protection schemes 
on hypervisors and guests.

1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 
0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode 
or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 
root pointer) must be used for each guest. Is that also true of the host 
(non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH 
guest is running? Or is the dom0 page table left in place, assuming the 
dom0 is PV, when an HVM/PVH guest is running, since extended paging is 
now being used to provide the guest's view of memory? Does this change 
if the dom0 is PVH?

Or, to ask this from another angle: is there ever anything *but* Xen 
living in the host-virtual address space when an HVM/PVH guest is 
active? And is the answer to this different depending on whether the 
HVM/PVH guest is a domU vs. a PVH dom0?

2. Do the mappings in Xen's slice of the host-virtual address space 
differ at all between the host page tables corresponding to different 
guests? If the mappings are in fact the same, does Xen therefore share 
lower-level page table pages between the page tables corresponding to 
different guests? Is any of this different for PV vs. HVM/PVH?

3. Under what circumstances, and for what purposes, does Xen use its 
ability to access guest memory through its direct map of host-physical 
memory? Similarly, to what extent does the dom0 (or other such 
privileged domain) utilize "foreign memory maps" to reach into another 
guest's memory? I understand that this is necessary when creating a 
guest, for live migration, and for QEMU to emulate stuff for HVM guests; 
but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" 
access a guest's memory?

(I ask because the research project I'm working on is seeking to protect 
guests from a compromised hypervisor and dom0, so I need to limit 
outside access to a guest's memory to explicitly shared pages that the 
guest will treat as untrusted - not storing any secrets there, vetting 
input as necessary, etc.)

4. What facilities/processes does Xen provide for PV(H) guests to 
explicitly/voluntarily share memory pages with Xen and other domains 
(dom0, etc.)? From what I can gather from the documentation, it sounds 
like "grant tables" are involved in this - is that how a PV-aware guest 
is expected to set up shared memory regions for communication with other 
domains (ring buffers, etc.)? Does a PV(H) guest need to voluntarily 
establish all external access to its pages, or is there ever a situation 
where it's the other way around - where Xen itself establishes/defines a 
region as shared and the guest is responsible for treating it accordingly?

Again, this mostly boils down to: under what circumstances, if ever, 
does Xen ever "force" access to any part of a guest's memory? 
(Particularly for PV(H). Clearly that must happen for HVM since, by 
definition, the guest is unaware there's a hypervisor controlling its 
world and emulating hardware behavior, and thus is in no position to 
cooperatively/voluntarily give the hypervisor and dom0 access to its 
memory.)

Thanks again in advance for any help anyone can offer!

Sincerely,
Ethan Johnson

-- 
Ethan J. Johnson
Computer Science PhD student, Systems group, University of Rochester
ejohns48@cs.rochester.edu
ethanjohnson@acm.org
PGP public key available from public directory or on request

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-16 19:51 [Xen-devel] More questions about Xen memory layout/usage, access to guest memory Johnson, Ethan
@ 2019-08-17 11:04 ` Andrew Cooper
  2019-08-22  2:06   ` Johnson, Ethan
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2019-08-17 11:04 UTC (permalink / raw)
  To: Johnson, Ethan, xen-devel

On 16/08/2019 20:51, Johnson, Ethan wrote:
> Hi all,
>
> I have some follow-up questions about Xen's usage and layout of memory, 
> building on the ones I asked here a few weeks ago (which were quite 
> helpfully answered: see 
> https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html 
> for reference). For context on why I'm asking these questions, I'm using 
> Xen as a research platform for enforcing novel memory protection schemes 
> on hypervisors and guests.
>
> 1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 
> 0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode 
> or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 
> root pointer) must be used for each guest.

More than that.  Each vCPU.

PV guests manage their own pagetables, and have a vCR3 which the guest
kernel controls, and we must honour.

For 64bit PV guests, each time a new L4 pagetable is created, Xen sets
up its own 16 slots appropriately.  As a result, Xen itself is able to
function appropriately on all pagetable hierarchies the PV guest
creates.  See init_xen_l4_slots() which does this.

For 32bit PV guests, things are a tad more complicated.  Each vCR3 is
actually a PAE-quad of pagetable entries.  Because Xen is still
operating in 64bit mode with 4-level paging, we enforce that guests
allocate a full 4k page for the pagetable (rather than the 32 bytes it
would normally be).

In Xen, we allocate what is called a monitor table, which is per-vcpu
(set up with all the correct details for Xen), and we rewrite slot 0
each time the vCPU changes vCR3.


Not related to this question, but important for future answers.  All
pagetables are actually at a minimum per-domain, because we have
per-domain mappings to simplify certain tasks.  Contained within these
are various structures, including the hypercall compatibility
translation area.  This per-domain restriction can in principle be
lifted if we alter the way Xen chooses to lay out its memory.

> Is that also true of the host 
> (non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH 
> guest is running?

Historical context is important to answer this question.

When the first HVM support came along, there was no EPT or NPT in
hardware.  Hypervisors were required to virtualise the guests pagetable
structure, which is called Shadow Paging in Xen.  The shadow pagetables
themselves are organised per-domain so as to form a single coherent
guest physical address space, but CPUs operating in non-root mode still
needed the real CR3 pointing at the logical vCPU's CR3 which was being
virtualised.

In practice, we still allocate a monitor pagetable per vcpu for HVM
guests, even with HAP support.  I can't think of any restrictions which
would prevent us from doing this differently.

> Or is the dom0 page table left in place, assuming the 
> dom0 is PV, when an HVM/PVH guest is running, since extended paging is 
> now being used to provide the guest's view of memory? Does this change 
> if the dom0 is PVH?

Here is some (prototype) documentation prepared since your last round of
questions.

https://andrewcoop-xen.readthedocs.io/en/docs-devel/admin-guide/introduction.html

Dom0 is just a VM, like every other domU in the system.  There is
nothing special about how it is virtualised.

Dom0 defaults to having full permissions, so can successfully issue a
whole range of more interesting hypercalls, but you could easily create
dom1, set the is_priv boolean in Xen, and give dom1 all the same
permissions that dom0 has, if you wished.

> Or, to ask this from another angle: is there ever anything *but* Xen 
> living in the host-virtual address space when an HVM/PVH guest is 
> active?

No, depending on how you classify Xen's directmap in this context.

> And is the answer to this different depending on whether the 
> HVM/PVH guest is a domU vs. a PVH dom0?

Dom0 vs domU has no relevance to the question.

> 2. Do the mappings in Xen's slice of the host-virtual address space 
> differ at all between the host page tables corresponding to different 
> guests?

No (ish).

Xen has a mostly flat address space, so most of the mappings are the
same.  There is a per-domain mapping slot which is common to each vcpu
in a domain, but different across domains, and a self-linear map for
easy modification of the PTEs for the current pagetable hierarchy, and a
shadow-linear map for easy modification of the shadow PTEs for which Xen
is not in the address space at all.

> If the mappings are in fact the same, does Xen therefore share 
> lower-level page table pages between the page tables corresponding to 
> different guests?

We have many different L4's (the monitor tables, every L4 a PV guest has
allocated) which can run Xen.  Most parts of Xen's address space
converge at L3 (the M2P, the directmap, Xen
text/data/bss/fixmap/vmap/heaps/misc), and are common to all contexts.

The per-domain mapping converges at L3 and are shared between vcpus of
the same guest, but not shared across guests.

One aspect I haven't really covered is XPTI for Meltdown mitigation for
PV guests.  Here, we have a per-CPU private pagetable which ends up
being a merge of most of the guests L4, but with some pre-construct
CPU-private pagetable hierarchy to hide the majority of data in the Xen
region.

> Is any of this different for PV vs. HVM/PVH?

PV guests control their parts of their address space, and can do largely
whatever they choose.  HVM has nothing in the lower canonical half, but
do have an extended directmap (which in practice only makes a difference
on a >5TB machine).

> 3. Under what circumstances, and for what purposes, does Xen use its 
> ability to access guest memory through its direct map of host-physical 
> memory?

That is a very broad question, and currently has the unfortunate answer
of "whenever speculation goes awry in an attackers favour."  There are
steps under way to reduce the usage of the directmap so we can run
without it, and prevent this kind of leakage.

As for when Xen would normally access memory, the most common answer is
for hypercall parameters which mostly use a virtual address based ABI. 
Also, any time we need to emulate an instruction, we need to read a fair
amount of guest state, including reading the instruction under %rip.

> Similarly, to what extent does the dom0 (or other such 
> privileged domain) utilize "foreign memory maps" to reach into another 
> guest's memory? I understand that this is necessary when creating a 
> guest, for live migration, and for QEMU to emulate stuff for HVM guests; 
> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" 
> access a guest's memory?

I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
if it chooses.  There is no "force" about it.

Debuggers and/or Introspection are other reasons why dom0 might chose to
map guest RAM, but I think you've covered the common cases.

> (I ask because the research project I'm working on is seeking to protect 
> guests from a compromised hypervisor and dom0, so I need to limit 
> outside access to a guest's memory to explicitly shared pages that the 
> guest will treat as untrusted - not storing any secrets there, vetting 
> input as necessary, etc.)

Sorry to come along with roadblocks, but how on earth do you intend to
prevent a compromised Xen from accessing guest memory?  A compromised
Xen can do almost anything it likes, and without recourse.  This is
ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
are coming along, because only the hardware itself is in a position to
isolate an untrusted hypervisor/kernel from guest data.

For dom0, that's perhaps easier.  You could reference count the number
of foreign mappings into the domain as it is created, and refuse to
unpause the guests vcpus until the foreign map count has dropped to 0.

> 4. What facilities/processes does Xen provide for PV(H) guests to 
> explicitly/voluntarily share memory pages with Xen and other domains 
> (dom0, etc.)? From what I can gather from the documentation, it sounds 
> like "grant tables" are involved in this - is that how a PV-aware guest 
> is expected to set up shared memory regions for communication with other 
> domains (ring buffers, etc.)?

Yes.  Grant Tables is Xen's mechanism for the coordinated setup of
shared memory between two consenting domains.

> Does a PV(H) guest need to voluntarily 
> establish all external access to its pages, or is there ever a situation 
> where it's the other way around - where Xen itself establishes/defines a 
> region as shared and the guest is responsible for treating it accordingly?

During domain construction, two grants/events are constructed
automatically.  One is for the xenstore ring, and one is for the console
ring.  The latter is so it can get debugging out from very early code,
while both are, in practice, done like this because the guest has no
a-priori way to establish the grants/events itself.

For all other shared interfaces, the guests are expected to negotiate
which grants/events/rings/details to use via Xenstore.

> Again, this mostly boils down to: under what circumstances, if ever, 
> does Xen ever "force" access to any part of a guest's memory? 
> (Particularly for PV(H). Clearly that must happen for HVM since, by 
> definition, the guest is unaware there's a hypervisor controlling its 
> world and emulating hardware behavior, and thus is in no position to 
> cooperatively/voluntarily give the hypervisor and dom0 access to its 
> memory.)

There are cases for all guest types where Xen will need to emulate
instructions.  Xen will access guest memory in order to perfom
architecturally correct actions, which generally starts with reading the
instruction under %rip.

For PV guests, this almost entirely restricted to guest-kernel
operations which are privileged in nature.  Access to MSRs, writes to
pagetables, etc.

For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
be a complete absence of emulation.  The Local APIC is emulated by Xen
in most cases, as a bare minimum, but for example, the LMSW instruction
on AMD hardware doesn't have any intercept decoding to help the
hypervisor out when a guest uses the instruction.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-17 11:04 ` Andrew Cooper
@ 2019-08-22  2:06   ` Johnson, Ethan
  2019-08-22 13:51     ` Andrew Cooper
  0 siblings, 1 reply; 13+ messages in thread
From: Johnson, Ethan @ 2019-08-22  2:06 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel

On 8/17/2019 7:04 AM, Andrew Cooper wrote:
>> Similarly, to what extent does the dom0 (or other such
>> privileged domain) utilize "foreign memory maps" to reach into another
>> guest's memory? I understand that this is necessary when creating a
>> guest, for live migration, and for QEMU to emulate stuff for HVM guests;
>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly"
>> access a guest's memory?
> I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
> if it chooses.  There is no "force" about it.
>
> Debuggers and/or Introspection are other reasons why dom0 might chose to
> map guest RAM, but I think you've covered the common cases.
>
>> (I ask because the research project I'm working on is seeking to protect
>> guests from a compromised hypervisor and dom0, so I need to limit
>> outside access to a guest's memory to explicitly shared pages that the
>> guest will treat as untrusted - not storing any secrets there, vetting
>> input as necessary, etc.)
> Sorry to come along with roadblocks, but how on earth do you intend to
> prevent a compromised Xen from accessing guest memory?  A compromised
> Xen can do almost anything it likes, and without recourse.  This is
> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
> are coming along, because only the hardware itself is in a position to
> isolate an untrusted hypervisor/kernel from guest data.
>
> For dom0, that's perhaps easier.  You could reference count the number
> of foreign mappings into the domain as it is created, and refuse to
> unpause the guests vcpus until the foreign map count has dropped to 0.

We're using a technique where privileged system software (in this case, 
the hypervisor) is compiled to a virtual instruction set (based on LLVM 
IR) that limits its access to hardware features and its view of 
available memory. These limitations are/can be enforced in a variety of 
ways but the main techniques we're employing are software fault 
isolation (i.e., memory loads and stores in privileged code are 
instrumented with checks to ensure they aren't accessing forbidden 
regions), and mediation of page table updates (by modifying privileged 
software to make page table updates through a virtual instruction set 
interface, very similarly to how Xen PV guests make page table updates 
through hypercalls which gives Xen the opportunity to ensure mappings 
aren't made to protected regions).

Our technique is based on that used by the "Virtual Ghost" project (see 
https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF 
link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), 
which does something similar to protect applications from a compromised 
operating system kernel without relying on something like a hypervisor 
operating at a higher privileged level. We're looking to extend that 
approach to hypervisors to protect guest VMs from a compromised hypervisor.

>> Again, this mostly boils down to: under what circumstances, if ever,
>> does Xen ever "force" access to any part of a guest's memory?
>> (Particularly for PV(H). Clearly that must happen for HVM since, by
>> definition, the guest is unaware there's a hypervisor controlling its
>> world and emulating hardware behavior, and thus is in no position to
>> cooperatively/voluntarily give the hypervisor and dom0 access to its
>> memory.)
> There are cases for all guest types where Xen will need to emulate
> instructions.  Xen will access guest memory in order to perfom
> architecturally correct actions, which generally starts with reading the
> instruction under %rip.
>
> For PV guests, this almost entirely restricted to guest-kernel
> operations which are privileged in nature.  Access to MSRs, writes to
> pagetables, etc.
>
> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
> be a complete absence of emulation.  The Local APIC is emulated by Xen
> in most cases, as a bare minimum, but for example, the LMSW instruction
> on AMD hardware doesn't have any intercept decoding to help the
> hypervisor out when a guest uses the instruction.
>
> ~Andrew

I've found a number of files in the Xen source tree which seem to be 
related to instruction/x86 platform emulation:

arch/x86/x86_emulate.c
arch/x86/hvm/emulate.c
arch/x86/hvm/vmx/realmode.c
arch/x86/hvm/svm/emulate.c
arch/x86/pv/emulate.c
arch/x86/pv/emul-priv-op.c
arch/x86/x86_emulate/x86_emulate.c

The last of these, in particular, looks especially hairy (it seems to 
support emulation of essentially the entire x86 instruction set through 
a quite impressive edifice of switch statements).

How does all of this fit into the big picture of how Xen virtualizes the 
different types of VMs (PV/HVM/PVH)?

My impression (from reading the original "Xen and the Art of 
Virtualization" SOSP '03 paper that describes the basic architecture) 
had been that PV guests, in particular, used hypercalls in place of all 
privileged operations that the guest kernel would otherwise need to 
execute in ring 0; and that all other (unprivileged) operations could 
execute natively on the CPU without requiring emulation. From what 
you're saying (and what I'm seeing in the source code), though, it 
sounds like in reality things are a bit fuzzier - that there are some 
operations that Xen traps and emulates instead of explicitly 
paravirtualizing.

Likewise, the Xen design described in the SOSP paper discussed guest I/O 
as something that's fully paravirtualized, taking place not through 
emulation of either memory-mapped or port I/O but rather through ring 
buffers shared between the guest and dom0 via grant tables. I was a bit 
confused to find I/O emulation code under arch/x86/pv (see e.g. 
arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and 
the like. Is this another example of things being fuzzier in reality 
than in the "theoretical" PV design? What devices, if any, are emulated 
rather than paravirtualized for a PV guest? I know that for PVH, you 
mentioned that the Local APIC is (at a minimum) emulated, along with 
some special instructions; is that true for classic PV as well?

For HVM, obviously anything that can't be virtualized natively by the 
hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't 
expected to be cooperative to issue PV hypercalls instead); but I would 
expect emulation to be limited to the relatively small subset of the ISA 
that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c 
supports emulating just about everything. Under what circumstances does 
Xen actually need to put all that emulation code to use?

I'm also wondering just how much of this is Xen's responsibility vs. 
QEMU's. I understand that when QEMU is used on its own (i.e., not with 
Xen), it uses dynamic binary recompilation to handle the parts of the 
ISA that can't be virtualized natively in lower-privilege modes. Does 
Xen only use QEMU for emulating off-CPU devices (interrupt controller, 
non-paravirtualized disk/network/graphics/etc.), or does it ever employ 
any of QEMU's x86 emulation support in addition to Xen's own emulation code?

Is there any particular place in the code where I can go to get a 
comprehensive "list" (or other such summary) of which parts of the ISA 
and off-CPU system are emulated for each respective guest type (PV, HVM, 
and PVH)? I realize that the difference between HVM and PVH is more of a 
continuum than a line; what I'm especially interested in is, what's the 
*bare minimum* of emulation required for a PVH guest that's using as 
much paravirtualization as possible? (That's the setting I'm looking to 
target for my research on protecting guests from a compromised 
hypervisor, since I'm trying to minimize the scope of interactions 
between the guest and hypervisor/dom0 that our virtual instruction set 
layer needs to mediate.)//


On a somewhat related note, I also have a question about a particular 
piece of code in arch/x86/pv/emul-priv-op.c, namely the function 
io_emul_stub_setup(). It looks like it is, at runtime, crafting a 
function that switches to the guest register context, emulates a 
particular I/O operation, then switches back to the host register 
context. This caught our attention while we were implementing Control 
Flow Integrity (CFI) instrumentation for Xen (which is necessary for us 
to enforce the software fault isolation (SFI) instrumentation that 
provides our memory protections). Why does Xen use dynamically-generated 
code here? Is it just for implementation convenience (i.e., to improve 
the generalizability of the code)?

Thanks again for all your time and effort spent answering my questions. 
I know I'm throwing a lot of unusual questions out there - this 
back-and-forth has been very helpful for me in figuring out *what* 
questions I need to be asking in the first place to understand what's 
feasible to do in the Xen architecture and how I might go about doing 
it. :-)

Thanks,
Ethan Johnson

-- 
Ethan J. Johnson
Computer Science PhD student, Systems group, University of Rochester
ejohns48@cs.rochester.edu
ethanjohnson@acm.org
PGP public key available from public directory or on request

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22  2:06   ` Johnson, Ethan
@ 2019-08-22 13:51     ` Andrew Cooper
  2019-08-22 15:06       ` Rian Quinn
                         ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrew Cooper @ 2019-08-22 13:51 UTC (permalink / raw)
  To: Johnson, Ethan, xen-devel

On 22/08/2019 03:06, Johnson, Ethan wrote:
> On 8/17/2019 7:04 AM, Andrew Cooper wrote:
>>> Similarly, to what extent does the dom0 (or other such
>>> privileged domain) utilize "foreign memory maps" to reach into another
>>> guest's memory? I understand that this is necessary when creating a
>>> guest, for live migration, and for QEMU to emulate stuff for HVM guests;
>>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly"
>>> access a guest's memory?
>> I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
>> if it chooses.  There is no "force" about it.
>>
>> Debuggers and/or Introspection are other reasons why dom0 might chose to
>> map guest RAM, but I think you've covered the common cases.
>>
>>> (I ask because the research project I'm working on is seeking to protect
>>> guests from a compromised hypervisor and dom0, so I need to limit
>>> outside access to a guest's memory to explicitly shared pages that the
>>> guest will treat as untrusted - not storing any secrets there, vetting
>>> input as necessary, etc.)
>> Sorry to come along with roadblocks, but how on earth do you intend to
>> prevent a compromised Xen from accessing guest memory?  A compromised
>> Xen can do almost anything it likes, and without recourse.  This is
>> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
>> are coming along, because only the hardware itself is in a position to
>> isolate an untrusted hypervisor/kernel from guest data.
>>
>> For dom0, that's perhaps easier.  You could reference count the number
>> of foreign mappings into the domain as it is created, and refuse to
>> unpause the guests vcpus until the foreign map count has dropped to 0.
> We're using a technique where privileged system software (in this case, 
> the hypervisor) is compiled to a virtual instruction set (based on LLVM 
> IR) that limits its access to hardware features and its view of 
> available memory. These limitations are/can be enforced in a variety of 
> ways but the main techniques we're employing are software fault 
> isolation (i.e., memory loads and stores in privileged code are 
> instrumented with checks to ensure they aren't accessing forbidden 
> regions), and mediation of page table updates (by modifying privileged 
> software to make page table updates through a virtual instruction set 
> interface, very similarly to how Xen PV guests make page table updates 
> through hypercalls which gives Xen the opportunity to ensure mappings 
> aren't made to protected regions).
>
> Our technique is based on that used by the "Virtual Ghost" project (see 
> https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF 
> link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), 
> which does something similar to protect applications from a compromised 
> operating system kernel without relying on something like a hypervisor 
> operating at a higher privileged level. We're looking to extend that 
> approach to hypervisors to protect guest VMs from a compromised hypervisor.

I have come across that paper before.

The extra language safety (which is effectively what this is) should
make it harder to compromise the hypervisor (and this is certainly a
good thing), but nothing at this level will get in the way of an
actually-compromised piece of ring 0 code from doing whatever it wants.

Suffice it to say that I'll be delighted if someone managed to
demonstrate me wrong.

>
>>> Again, this mostly boils down to: under what circumstances, if ever,
>>> does Xen ever "force" access to any part of a guest's memory?
>>> (Particularly for PV(H). Clearly that must happen for HVM since, by
>>> definition, the guest is unaware there's a hypervisor controlling its
>>> world and emulating hardware behavior, and thus is in no position to
>>> cooperatively/voluntarily give the hypervisor and dom0 access to its
>>> memory.)
>> There are cases for all guest types where Xen will need to emulate
>> instructions.  Xen will access guest memory in order to perfom
>> architecturally correct actions, which generally starts with reading the
>> instruction under %rip.
>>
>> For PV guests, this almost entirely restricted to guest-kernel
>> operations which are privileged in nature.  Access to MSRs, writes to
>> pagetables, etc.
>>
>> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
>> be a complete absence of emulation.  The Local APIC is emulated by Xen
>> in most cases, as a bare minimum, but for example, the LMSW instruction
>> on AMD hardware doesn't have any intercept decoding to help the
>> hypervisor out when a guest uses the instruction.
>>
>> ~Andrew
> I've found a number of files in the Xen source tree which seem to be 
> related to instruction/x86 platform emulation:
>
> arch/x86/x86_emulate.c
> arch/x86/hvm/emulate.c
> arch/x86/hvm/vmx/realmode.c
> arch/x86/hvm/svm/emulate.c
> arch/x86/pv/emulate.c
> arch/x86/pv/emul-priv-op.c
> arch/x86/x86_emulate/x86_emulate.c
>
> The last of these, in particular, looks especially hairy (it seems to 
> support emulation of essentially the entire x86 instruction set through 
> a quite impressive edifice of switch statements).

Lovely, isn't it.  For Introspection, we need to be able to emulate an
instruction which took a permission fault (including No Execute), was
sent to the analysis engine, and deemed ok to continue.

Other users of emulation are arch/x86/pv/ro-page-fault.c and
arch/x86/mm/shadow/multi.c

That said, most of these can be ignored in common cases.  vmx/realmode.c
is only for pre-Westmere Intel CPUs which lack the unrestricted_guest
feature.  svm/emulate.c is only for K8 hardware which lacks the NRIPS
feature.

> How does all of this fit into the big picture of how Xen virtualizes the 
> different types of VMs (PV/HVM/PVH)?

Consider this "core x86 support".  All areas which need to emulate an
instruction for whatever reason use this function.  (We previously had
multiple areas of code each doing subsets of x86 instruction
decode/execute, and it was an even bigger mess.)

> My impression (from reading the original "Xen and the Art of 
> Virtualization" SOSP '03 paper that describes the basic architecture) 
> had been that PV guests, in particular, used hypercalls in place of all 
> privileged operations that the guest kernel would otherwise need to 
> execute in ring 0; and that all other (unprivileged) operations could 
> execute natively on the CPU without requiring emulation. From what 
> you're saying (and what I'm seeing in the source code), though, it 
> sounds like in reality things are a bit fuzzier - that there are some 
> operations that Xen traps and emulates instead of explicitly 
> paravirtualizing.

Correct.  Few theories survive contact with the real world.

Some emulation, such as writeable_pagetable support was added to make it
easier to port guests to being PV.  In this case, writes to pagetables
are trapped an emulated, as if an equivalent hypercall had been made. 
Sure, its slower than the hypercall, but its far easier to get started with.

Some emulation is a consequence of of CPUs changing in the 16 years
since that paper was published, and some emulation is a stopgap for
things which really should be paravirtualised properly.  A whole load of
speculative security fits into this category, as we haven't had time to
fix it nicely, following the panic of simply fixing it safely.

> Likewise, the Xen design described in the SOSP paper discussed guest I/O 
> as something that's fully paravirtualized, taking place not through 
> emulation of either memory-mapped or port I/O but rather through ring 
> buffers shared between the guest and dom0 via grant tables.

This is still correct and accurate.  Paravirtual split front/back driver
pairs for network and block are by far the most efficient way of
shuffling data in and out of the VM.

> I was a bit 
> confused to find I/O emulation code under arch/x86/pv (see e.g. 
> arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and 
> the like. Is this another example of things being fuzzier in reality 
> than in the "theoretical" PV design?

This is "general x86 architecture".  Xen handles all exceptions,
including from PV userspace (possibly being naughty), so at a bare
minimum needs to filter those which should be handed to the guest kernel
to deal with.

When it comes to x86 Port IO, it is a critical point of safety that Xen
runs with IOPL set to 0, or a guest kernel could modify the real
interrupt flag with a popf instruction.  As a result, all `in` and `out`
instructions trap with a #GP fault.

Guest userspace could use use iopl() to logically gain access to IO
ports, after which `in` and `out` instructions would not fault.  Also,
these instructions don't fault in kernel context.  In both cases, Xen
has to filter between actually passing the IO request to hardware (if
the guest is suitably configured), or terminating it defaults, so it
fails in a manner consistent with how x86 behaves.

For VT-x/SVM guests, filtering of #GP faults happens before the VMExit
so Xen doesn't have to handle those, but still has to handle all IO
accesses which are fine (permission wise) according to the guest kernel.

> What devices, if any, are emulated rather than paravirtualized for a PV guest?

Look for XEN_X86_EMU_* throughout the code.  Those are all the discrete
devices which Xen may emulate, for both kinds of guests.  There is a
very restricted set of valid combinations.

PV dom0's get an emulated PIT to partially forward to real hardware.
ISTR it is legacy for some laptops where DRAM refresh was still
configured off timer 1.  I doubt it is revenant these days.

> I know that for PVH, you 
> mentioned that the Local APIC is (at a minimum) emulated, along with 
> some special instructions; is that true for classic PV as well?

Classic PV guests don't get a Local APIC.  They are required to use the
event channel interface instead.

> For HVM, obviously anything that can't be virtualized natively by the 
> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't 
> expected to be cooperative to issue PV hypercalls instead); but I would 
> expect emulation to be limited to the relatively small subset of the ISA 
> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c 
> supports emulating just about everything. Under what circumstances does 
> Xen actually need to put all that emulation code to use?

Introspection, as I said earlier, which is potentially any instruction.

MMIO regions (including to the Local APIC when it is in xAPIC mode, and
hardware acceleration isn't available) can be the target of any
instruction with a memory operand.  While mov is by far the most common
instruction, other instructions such as and/or/xadd are used in some
cases.  Various of the vector moves (movups/movaps/movnti) are very
common with framebuffers.

The cfc/cf8 IO ports are used for PCI Config space accesses, which all
kernels try to use, and any kernel with real devices need to use.  The
alternative is the the MMCFG scheme which is plain MMIO as above.

> I'm also wondering just how much of this is Xen's responsibility vs. 
> QEMU's. I understand that when QEMU is used on its own (i.e., not with 
> Xen), it uses dynamic binary recompilation to handle the parts of the 
> ISA that can't be virtualized natively in lower-privilege modes. Does 
> Xen only use QEMU for emulating off-CPU devices (interrupt controller, 
> non-paravirtualized disk/network/graphics/etc.), or does it ever employ 
> any of QEMU's x86 emulation support in addition to Xen's own emulation code?

We only use QEMU for off-CPU devices.  For performance reasons, some of
the interrupt emulation (IO-APIC in particular), and timer emulation
(HPET, PIT) is done in Xen, even when it would locally be part of the
motherboard if we were looking for a clear delineation of where Xen
stops and QEMU starts.

> Is there any particular place in the code where I can go to get a 
> comprehensive "list" (or other such summary) of which parts of the ISA 
> and off-CPU system are emulated for each respective guest type (PV, HVM, 
> and PVH)?

XEN_X86_EMU_* should cover you here.

> I realize that the difference between HVM and PVH is more of a 
> continuum than a line; what I'm especially interested in is, what's the 
> *bare minimum* of emulation required for a PVH guest that's using as 
> much paravirtualization as possible? (That's the setting I'm looking to 
> target for my research on protecting guests from a compromised 
> hypervisor, since I'm trying to minimize the scope of interactions 
> between the guest and hypervisor/dom0 that our virtual instruction set 
> layer needs to mediate.)

If you are using PVH guests, on not-ancient hardware, and you can
persuade the guest kernel to use x2APIC mode, and without using any
ins/outs instructions, then you just might be able to get away without
any x86_emulate() at all.

x2APIC mode has an MSR-based interface rather than an MMIO interface,
which means that the VMExit intercept information alone is sufficient to
work out exactly what to do, and ins/outs is the only other instructions
(which come to mind) liable to trap and need emulator support above and
beyond the intercept information.

That said, whatever you do here is going to have to cope with dom0 and
all the requirements for keeping the system running.  Depending on
exactly how you're approaching the problem, it might be possible to
declare that out of scope and leave it to one side.

> On a somewhat related note, I also have a question about a particular 
> piece of code in arch/x86/pv/emul-priv-op.c, namely the function 
> io_emul_stub_setup(). It looks like it is, at runtime, crafting a 
> function that switches to the guest register context, emulates a 
> particular I/O operation, then switches back to the host register 
> context. This caught our attention while we were implementing Control 
> Flow Integrity (CFI) instrumentation for Xen (which is necessary for us 
> to enforce the software fault isolation (SFI) instrumentation that 
> provides our memory protections). Why does Xen use dynamically-generated 
> code here? Is it just for implementation convenience (i.e., to improve 
> the generalizability of the code)?

This mechanism is for dom0 only, and exists because some firmware is
terrible.

Some AML in ACPI tables uses an IO port to generate an SMI, and has an
API which uses the GPRs.  It turns out things go rather wrong when Xen
intercepts the IO instruction, and replays it to hardware in Xen's GPR
context, rather than the guest kernels.

This bodge swaps Xen's and dom0's GPRs just around the IO instruction,
so the SMI API gets its parameters properly, and the results get fed
back properly into AML.

There is a related hypercall, SCHEDOP_pin_override, used by dom0,
because sometimes the AML really does need to execute on CPU0, and not
wherever dom0's vcpu0 happens to be executing.

> Thanks again for all your time and effort spent answering my questions. 
> I know I'm throwing a lot of unusual questions out there - this 
> back-and-forth has been very helpful for me in figuring out *what* 
> questions I need to be asking in the first place to understand what's 
> feasible to do in the Xen architecture and how I might go about doing 
> it. :-)

Not a problem in the slightest.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 13:51     ` Andrew Cooper
@ 2019-08-22 15:06       ` Rian Quinn
  2019-08-22 22:42         ` Andrew Cooper
  2019-08-22 17:36       ` Tamas K Lengyel
  2019-08-22 20:57       ` Rich Persaud
  2 siblings, 1 reply; 13+ messages in thread
From: Rian Quinn @ 2019-08-22 15:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan

[-- Attachment #1.1: Type: text/plain, Size: 16578 bytes --]

I can at least confirm that no emulation is needed to execute a Linux
guest, even with the Xen PVH interface, but I don't think that works out of
the box today with Xen, something we are currently working on and will
hopefully have some more data near the end of the year. x2APIC helps, but
it takes some work to convince Linux to use that currently. The trick is to
avoid PortIO and, where possible, MMIO interfaces.

Rian

On Thu, Aug 22, 2019 at 1:53 PM Andrew Cooper <andrew.cooper3@citrix.com>
wrote:

> On 22/08/2019 03:06, Johnson, Ethan wrote:
> > On 8/17/2019 7:04 AM, Andrew Cooper wrote:
> >>> Similarly, to what extent does the dom0 (or other such
> >>> privileged domain) utilize "foreign memory maps" to reach into another
> >>> guest's memory? I understand that this is necessary when creating a
> >>> guest, for live migration, and for QEMU to emulate stuff for HVM
> guests;
> >>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly"
> >>> access a guest's memory?
> >> I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
> >> if it chooses.  There is no "force" about it.
> >>
> >> Debuggers and/or Introspection are other reasons why dom0 might chose to
> >> map guest RAM, but I think you've covered the common cases.
> >>
> >>> (I ask because the research project I'm working on is seeking to
> protect
> >>> guests from a compromised hypervisor and dom0, so I need to limit
> >>> outside access to a guest's memory to explicitly shared pages that the
> >>> guest will treat as untrusted - not storing any secrets there, vetting
> >>> input as necessary, etc.)
> >> Sorry to come along with roadblocks, but how on earth do you intend to
> >> prevent a compromised Xen from accessing guest memory?  A compromised
> >> Xen can do almost anything it likes, and without recourse.  This is
> >> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
> >> are coming along, because only the hardware itself is in a position to
> >> isolate an untrusted hypervisor/kernel from guest data.
> >>
> >> For dom0, that's perhaps easier.  You could reference count the number
> >> of foreign mappings into the domain as it is created, and refuse to
> >> unpause the guests vcpus until the foreign map count has dropped to 0.
> > We're using a technique where privileged system software (in this case,
> > the hypervisor) is compiled to a virtual instruction set (based on LLVM
> > IR) that limits its access to hardware features and its view of
> > available memory. These limitations are/can be enforced in a variety of
> > ways but the main techniques we're employing are software fault
> > isolation (i.e., memory loads and stores in privileged code are
> > instrumented with checks to ensure they aren't accessing forbidden
> > regions), and mediation of page table updates (by modifying privileged
> > software to make page table updates through a virtual instruction set
> > interface, very similarly to how Xen PV guests make page table updates
> > through hypercalls which gives Xen the opportunity to ensure mappings
> > aren't made to protected regions).
> >
> > Our technique is based on that used by the "Virtual Ghost" project (see
> > https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF
> > link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf),
> > which does something similar to protect applications from a compromised
> > operating system kernel without relying on something like a hypervisor
> > operating at a higher privileged level. We're looking to extend that
> > approach to hypervisors to protect guest VMs from a compromised
> hypervisor.
>
> I have come across that paper before.
>
> The extra language safety (which is effectively what this is) should
> make it harder to compromise the hypervisor (and this is certainly a
> good thing), but nothing at this level will get in the way of an
> actually-compromised piece of ring 0 code from doing whatever it wants.
>
> Suffice it to say that I'll be delighted if someone managed to
> demonstrate me wrong.
>
> >
> >>> Again, this mostly boils down to: under what circumstances, if ever,
> >>> does Xen ever "force" access to any part of a guest's memory?
> >>> (Particularly for PV(H). Clearly that must happen for HVM since, by
> >>> definition, the guest is unaware there's a hypervisor controlling its
> >>> world and emulating hardware behavior, and thus is in no position to
> >>> cooperatively/voluntarily give the hypervisor and dom0 access to its
> >>> memory.)
> >> There are cases for all guest types where Xen will need to emulate
> >> instructions.  Xen will access guest memory in order to perfom
> >> architecturally correct actions, which generally starts with reading the
> >> instruction under %rip.
> >>
> >> For PV guests, this almost entirely restricted to guest-kernel
> >> operations which are privileged in nature.  Access to MSRs, writes to
> >> pagetables, etc.
> >>
> >> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
> >> be a complete absence of emulation.  The Local APIC is emulated by Xen
> >> in most cases, as a bare minimum, but for example, the LMSW instruction
> >> on AMD hardware doesn't have any intercept decoding to help the
> >> hypervisor out when a guest uses the instruction.
> >>
> >> ~Andrew
> > I've found a number of files in the Xen source tree which seem to be
> > related to instruction/x86 platform emulation:
> >
> > arch/x86/x86_emulate.c
> > arch/x86/hvm/emulate.c
> > arch/x86/hvm/vmx/realmode.c
> > arch/x86/hvm/svm/emulate.c
> > arch/x86/pv/emulate.c
> > arch/x86/pv/emul-priv-op.c
> > arch/x86/x86_emulate/x86_emulate.c
> >
> > The last of these, in particular, looks especially hairy (it seems to
> > support emulation of essentially the entire x86 instruction set through
> > a quite impressive edifice of switch statements).
>
> Lovely, isn't it.  For Introspection, we need to be able to emulate an
> instruction which took a permission fault (including No Execute), was
> sent to the analysis engine, and deemed ok to continue.
>
> Other users of emulation are arch/x86/pv/ro-page-fault.c and
> arch/x86/mm/shadow/multi.c
>
> That said, most of these can be ignored in common cases.  vmx/realmode.c
> is only for pre-Westmere Intel CPUs which lack the unrestricted_guest
> feature.  svm/emulate.c is only for K8 hardware which lacks the NRIPS
> feature.
>
> > How does all of this fit into the big picture of how Xen virtualizes the
> > different types of VMs (PV/HVM/PVH)?
>
> Consider this "core x86 support".  All areas which need to emulate an
> instruction for whatever reason use this function.  (We previously had
> multiple areas of code each doing subsets of x86 instruction
> decode/execute, and it was an even bigger mess.)
>
> > My impression (from reading the original "Xen and the Art of
> > Virtualization" SOSP '03 paper that describes the basic architecture)
> > had been that PV guests, in particular, used hypercalls in place of all
> > privileged operations that the guest kernel would otherwise need to
> > execute in ring 0; and that all other (unprivileged) operations could
> > execute natively on the CPU without requiring emulation. From what
> > you're saying (and what I'm seeing in the source code), though, it
> > sounds like in reality things are a bit fuzzier - that there are some
> > operations that Xen traps and emulates instead of explicitly
> > paravirtualizing.
>
> Correct.  Few theories survive contact with the real world.
>
> Some emulation, such as writeable_pagetable support was added to make it
> easier to port guests to being PV.  In this case, writes to pagetables
> are trapped an emulated, as if an equivalent hypercall had been made.
> Sure, its slower than the hypercall, but its far easier to get started
> with.
>
> Some emulation is a consequence of of CPUs changing in the 16 years
> since that paper was published, and some emulation is a stopgap for
> things which really should be paravirtualised properly.  A whole load of
> speculative security fits into this category, as we haven't had time to
> fix it nicely, following the panic of simply fixing it safely.
>
> > Likewise, the Xen design described in the SOSP paper discussed guest I/O
> > as something that's fully paravirtualized, taking place not through
> > emulation of either memory-mapped or port I/O but rather through ring
> > buffers shared between the guest and dom0 via grant tables.
>
> This is still correct and accurate.  Paravirtual split front/back driver
> pairs for network and block are by far the most efficient way of
> shuffling data in and out of the VM.
>
> > I was a bit
> > confused to find I/O emulation code under arch/x86/pv (see e.g.
> > arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and
> > the like. Is this another example of things being fuzzier in reality
> > than in the "theoretical" PV design?
>
> This is "general x86 architecture".  Xen handles all exceptions,
> including from PV userspace (possibly being naughty), so at a bare
> minimum needs to filter those which should be handed to the guest kernel
> to deal with.
>
> When it comes to x86 Port IO, it is a critical point of safety that Xen
> runs with IOPL set to 0, or a guest kernel could modify the real
> interrupt flag with a popf instruction.  As a result, all `in` and `out`
> instructions trap with a #GP fault.
>
> Guest userspace could use use iopl() to logically gain access to IO
> ports, after which `in` and `out` instructions would not fault.  Also,
> these instructions don't fault in kernel context.  In both cases, Xen
> has to filter between actually passing the IO request to hardware (if
> the guest is suitably configured), or terminating it defaults, so it
> fails in a manner consistent with how x86 behaves.
>
> For VT-x/SVM guests, filtering of #GP faults happens before the VMExit
> so Xen doesn't have to handle those, but still has to handle all IO
> accesses which are fine (permission wise) according to the guest kernel.
>
> > What devices, if any, are emulated rather than paravirtualized for a PV
> guest?
>
> Look for XEN_X86_EMU_* throughout the code.  Those are all the discrete
> devices which Xen may emulate, for both kinds of guests.  There is a
> very restricted set of valid combinations.
>
> PV dom0's get an emulated PIT to partially forward to real hardware.
> ISTR it is legacy for some laptops where DRAM refresh was still
> configured off timer 1.  I doubt it is revenant these days.
>
> > I know that for PVH, you
> > mentioned that the Local APIC is (at a minimum) emulated, along with
> > some special instructions; is that true for classic PV as well?
>
> Classic PV guests don't get a Local APIC.  They are required to use the
> event channel interface instead.
>
> > For HVM, obviously anything that can't be virtualized natively by the
> > hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't
> > expected to be cooperative to issue PV hypercalls instead); but I would
> > expect emulation to be limited to the relatively small subset of the ISA
> > that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c
> > supports emulating just about everything. Under what circumstances does
> > Xen actually need to put all that emulation code to use?
>
> Introspection, as I said earlier, which is potentially any instruction.
>
> MMIO regions (including to the Local APIC when it is in xAPIC mode, and
> hardware acceleration isn't available) can be the target of any
> instruction with a memory operand.  While mov is by far the most common
> instruction, other instructions such as and/or/xadd are used in some
> cases.  Various of the vector moves (movups/movaps/movnti) are very
> common with framebuffers.
>
> The cfc/cf8 IO ports are used for PCI Config space accesses, which all
> kernels try to use, and any kernel with real devices need to use.  The
> alternative is the the MMCFG scheme which is plain MMIO as above.
>
> > I'm also wondering just how much of this is Xen's responsibility vs.
> > QEMU's. I understand that when QEMU is used on its own (i.e., not with
> > Xen), it uses dynamic binary recompilation to handle the parts of the
> > ISA that can't be virtualized natively in lower-privilege modes. Does
> > Xen only use QEMU for emulating off-CPU devices (interrupt controller,
> > non-paravirtualized disk/network/graphics/etc.), or does it ever employ
> > any of QEMU's x86 emulation support in addition to Xen's own emulation
> code?
>
> We only use QEMU for off-CPU devices.  For performance reasons, some of
> the interrupt emulation (IO-APIC in particular), and timer emulation
> (HPET, PIT) is done in Xen, even when it would locally be part of the
> motherboard if we were looking for a clear delineation of where Xen
> stops and QEMU starts.
>
> > Is there any particular place in the code where I can go to get a
> > comprehensive "list" (or other such summary) of which parts of the ISA
> > and off-CPU system are emulated for each respective guest type (PV, HVM,
> > and PVH)?
>
> XEN_X86_EMU_* should cover you here.
>
> > I realize that the difference between HVM and PVH is more of a
> > continuum than a line; what I'm especially interested in is, what's the
> > *bare minimum* of emulation required for a PVH guest that's using as
> > much paravirtualization as possible? (That's the setting I'm looking to
> > target for my research on protecting guests from a compromised
> > hypervisor, since I'm trying to minimize the scope of interactions
> > between the guest and hypervisor/dom0 that our virtual instruction set
> > layer needs to mediate.)
>
> If you are using PVH guests, on not-ancient hardware, and you can
> persuade the guest kernel to use x2APIC mode, and without using any
> ins/outs instructions, then you just might be able to get away without
> any x86_emulate() at all.
>
> x2APIC mode has an MSR-based interface rather than an MMIO interface,
> which means that the VMExit intercept information alone is sufficient to
> work out exactly what to do, and ins/outs is the only other instructions
> (which come to mind) liable to trap and need emulator support above and
> beyond the intercept information.
>
> That said, whatever you do here is going to have to cope with dom0 and
> all the requirements for keeping the system running.  Depending on
> exactly how you're approaching the problem, it might be possible to
> declare that out of scope and leave it to one side.
>
> > On a somewhat related note, I also have a question about a particular
> > piece of code in arch/x86/pv/emul-priv-op.c, namely the function
> > io_emul_stub_setup(). It looks like it is, at runtime, crafting a
> > function that switches to the guest register context, emulates a
> > particular I/O operation, then switches back to the host register
> > context. This caught our attention while we were implementing Control
> > Flow Integrity (CFI) instrumentation for Xen (which is necessary for us
> > to enforce the software fault isolation (SFI) instrumentation that
> > provides our memory protections). Why does Xen use dynamically-generated
> > code here? Is it just for implementation convenience (i.e., to improve
> > the generalizability of the code)?
>
> This mechanism is for dom0 only, and exists because some firmware is
> terrible.
>
> Some AML in ACPI tables uses an IO port to generate an SMI, and has an
> API which uses the GPRs.  It turns out things go rather wrong when Xen
> intercepts the IO instruction, and replays it to hardware in Xen's GPR
> context, rather than the guest kernels.
>
> This bodge swaps Xen's and dom0's GPRs just around the IO instruction,
> so the SMI API gets its parameters properly, and the results get fed
> back properly into AML.
>
> There is a related hypercall, SCHEDOP_pin_override, used by dom0,
> because sometimes the AML really does need to execute on CPU0, and not
> wherever dom0's vcpu0 happens to be executing.
>
> > Thanks again for all your time and effort spent answering my questions.
> > I know I'm throwing a lot of unusual questions out there - this
> > back-and-forth has been very helpful for me in figuring out *what*
> > questions I need to be asking in the first place to understand what's
> > feasible to do in the Xen architecture and how I might go about doing
> > it. :-)
>
> Not a problem in the slightest.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 19325 bytes --]

<div dir="ltr"><div>I can at least confirm that no emulation is needed to execute a Linux guest, even with the Xen PVH interface, but I don&#39;t think that works out of the box today with Xen, something we are currently working on and will hopefully have some more data near the end of the year. x2APIC helps, but it takes some work to convince Linux to use that currently. The trick is to avoid PortIO and, where possible, MMIO interfaces. <br></div><div><br></div><div>Rian<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 22, 2019 at 1:53 PM Andrew Cooper &lt;<a href="mailto:andrew.cooper3@citrix.com">andrew.cooper3@citrix.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 22/08/2019 03:06, Johnson, Ethan wrote:<br>
&gt; On 8/17/2019 7:04 AM, Andrew Cooper wrote:<br>
&gt;&gt;&gt; Similarly, to what extent does the dom0 (or other such<br>
&gt;&gt;&gt; privileged domain) utilize &quot;foreign memory maps&quot; to reach into another<br>
&gt;&gt;&gt; guest&#39;s memory? I understand that this is necessary when creating a<br>
&gt;&gt;&gt; guest, for live migration, and for QEMU to emulate stuff for HVM guests;<br>
&gt;&gt;&gt; but for PVH, is it ever necessary for Xen or the dom0 to &quot;forcibly&quot;<br>
&gt;&gt;&gt; access a guest&#39;s memory?<br>
&gt;&gt; I&#39;m not sure what you mean by forcibly.  Dom0 has the ability to do so,<br>
&gt;&gt; if it chooses.  There is no &quot;force&quot; about it.<br>
&gt;&gt;<br>
&gt;&gt; Debuggers and/or Introspection are other reasons why dom0 might chose to<br>
&gt;&gt; map guest RAM, but I think you&#39;ve covered the common cases.<br>
&gt;&gt;<br>
&gt;&gt;&gt; (I ask because the research project I&#39;m working on is seeking to protect<br>
&gt;&gt;&gt; guests from a compromised hypervisor and dom0, so I need to limit<br>
&gt;&gt;&gt; outside access to a guest&#39;s memory to explicitly shared pages that the<br>
&gt;&gt;&gt; guest will treat as untrusted - not storing any secrets there, vetting<br>
&gt;&gt;&gt; input as necessary, etc.)<br>
&gt;&gt; Sorry to come along with roadblocks, but how on earth do you intend to<br>
&gt;&gt; prevent a compromised Xen from accessing guest memory?  A compromised<br>
&gt;&gt; Xen can do almost anything it likes, and without recourse.  This is<br>
&gt;&gt; ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM<br>
&gt;&gt; are coming along, because only the hardware itself is in a position to<br>
&gt;&gt; isolate an untrusted hypervisor/kernel from guest data.<br>
&gt;&gt;<br>
&gt;&gt; For dom0, that&#39;s perhaps easier.  You could reference count the number<br>
&gt;&gt; of foreign mappings into the domain as it is created, and refuse to<br>
&gt;&gt; unpause the guests vcpus until the foreign map count has dropped to 0.<br>
&gt; We&#39;re using a technique where privileged system software (in this case, <br>
&gt; the hypervisor) is compiled to a virtual instruction set (based on LLVM <br>
&gt; IR) that limits its access to hardware features and its view of <br>
&gt; available memory. These limitations are/can be enforced in a variety of <br>
&gt; ways but the main techniques we&#39;re employing are software fault <br>
&gt; isolation (i.e., memory loads and stores in privileged code are <br>
&gt; instrumented with checks to ensure they aren&#39;t accessing forbidden <br>
&gt; regions), and mediation of page table updates (by modifying privileged <br>
&gt; software to make page table updates through a virtual instruction set <br>
&gt; interface, very similarly to how Xen PV guests make page table updates <br>
&gt; through hypercalls which gives Xen the opportunity to ensure mappings <br>
&gt; aren&#39;t made to protected regions).<br>
&gt;<br>
&gt; Our technique is based on that used by the &quot;Virtual Ghost&quot; project (see <br>
&gt; <a href="https://dl.acm.org/citation.cfm?id=2541986" rel="noreferrer" target="_blank">https://dl.acm.org/citation.cfm?id=2541986</a> for the paper; direct PDF <br>
&gt; link: <a href="http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf" rel="noreferrer" target="_blank">http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf</a>), <br>
&gt; which does something similar to protect applications from a compromised <br>
&gt; operating system kernel without relying on something like a hypervisor <br>
&gt; operating at a higher privileged level. We&#39;re looking to extend that <br>
&gt; approach to hypervisors to protect guest VMs from a compromised hypervisor.<br>
<br>
I have come across that paper before.<br>
<br>
The extra language safety (which is effectively what this is) should<br>
make it harder to compromise the hypervisor (and this is certainly a<br>
good thing), but nothing at this level will get in the way of an<br>
actually-compromised piece of ring 0 code from doing whatever it wants.<br>
<br>
Suffice it to say that I&#39;ll be delighted if someone managed to<br>
demonstrate me wrong.<br>
<br>
&gt;<br>
&gt;&gt;&gt; Again, this mostly boils down to: under what circumstances, if ever,<br>
&gt;&gt;&gt; does Xen ever &quot;force&quot; access to any part of a guest&#39;s memory?<br>
&gt;&gt;&gt; (Particularly for PV(H). Clearly that must happen for HVM since, by<br>
&gt;&gt;&gt; definition, the guest is unaware there&#39;s a hypervisor controlling its<br>
&gt;&gt;&gt; world and emulating hardware behavior, and thus is in no position to<br>
&gt;&gt;&gt; cooperatively/voluntarily give the hypervisor and dom0 access to its<br>
&gt;&gt;&gt; memory.)<br>
&gt;&gt; There are cases for all guest types where Xen will need to emulate<br>
&gt;&gt; instructions.  Xen will access guest memory in order to perfom<br>
&gt;&gt; architecturally correct actions, which generally starts with reading the<br>
&gt;&gt; instruction under %rip.<br>
&gt;&gt;<br>
&gt;&gt; For PV guests, this almost entirely restricted to guest-kernel<br>
&gt;&gt; operations which are privileged in nature.  Access to MSRs, writes to<br>
&gt;&gt; pagetables, etc.<br>
&gt;&gt;<br>
&gt;&gt; For HVM and PVH guests, while PVH means &quot;HVM without Qemu&quot;, it doesn&#39;t<br>
&gt;&gt; be a complete absence of emulation.  The Local APIC is emulated by Xen<br>
&gt;&gt; in most cases, as a bare minimum, but for example, the LMSW instruction<br>
&gt;&gt; on AMD hardware doesn&#39;t have any intercept decoding to help the<br>
&gt;&gt; hypervisor out when a guest uses the instruction.<br>
&gt;&gt;<br>
&gt;&gt; ~Andrew<br>
&gt; I&#39;ve found a number of files in the Xen source tree which seem to be <br>
&gt; related to instruction/x86 platform emulation:<br>
&gt;<br>
&gt; arch/x86/x86_emulate.c<br>
&gt; arch/x86/hvm/emulate.c<br>
&gt; arch/x86/hvm/vmx/realmode.c<br>
&gt; arch/x86/hvm/svm/emulate.c<br>
&gt; arch/x86/pv/emulate.c<br>
&gt; arch/x86/pv/emul-priv-op.c<br>
&gt; arch/x86/x86_emulate/x86_emulate.c<br>
&gt;<br>
&gt; The last of these, in particular, looks especially hairy (it seems to <br>
&gt; support emulation of essentially the entire x86 instruction set through <br>
&gt; a quite impressive edifice of switch statements).<br>
<br>
Lovely, isn&#39;t it.  For Introspection, we need to be able to emulate an<br>
instruction which took a permission fault (including No Execute), was<br>
sent to the analysis engine, and deemed ok to continue.<br>
<br>
Other users of emulation are arch/x86/pv/ro-page-fault.c and<br>
arch/x86/mm/shadow/multi.c<br>
<br>
That said, most of these can be ignored in common cases.  vmx/realmode.c<br>
is only for pre-Westmere Intel CPUs which lack the unrestricted_guest<br>
feature.  svm/emulate.c is only for K8 hardware which lacks the NRIPS<br>
feature.<br>
<br>
&gt; How does all of this fit into the big picture of how Xen virtualizes the <br>
&gt; different types of VMs (PV/HVM/PVH)?<br>
<br>
Consider this &quot;core x86 support&quot;.  All areas which need to emulate an<br>
instruction for whatever reason use this function.  (We previously had<br>
multiple areas of code each doing subsets of x86 instruction<br>
decode/execute, and it was an even bigger mess.)<br>
<br>
&gt; My impression (from reading the original &quot;Xen and the Art of <br>
&gt; Virtualization&quot; SOSP &#39;03 paper that describes the basic architecture) <br>
&gt; had been that PV guests, in particular, used hypercalls in place of all <br>
&gt; privileged operations that the guest kernel would otherwise need to <br>
&gt; execute in ring 0; and that all other (unprivileged) operations could <br>
&gt; execute natively on the CPU without requiring emulation. From what <br>
&gt; you&#39;re saying (and what I&#39;m seeing in the source code), though, it <br>
&gt; sounds like in reality things are a bit fuzzier - that there are some <br>
&gt; operations that Xen traps and emulates instead of explicitly <br>
&gt; paravirtualizing.<br>
<br>
Correct.  Few theories survive contact with the real world.<br>
<br>
Some emulation, such as writeable_pagetable support was added to make it<br>
easier to port guests to being PV.  In this case, writes to pagetables<br>
are trapped an emulated, as if an equivalent hypercall had been made. <br>
Sure, its slower than the hypercall, but its far easier to get started with.<br>
<br>
Some emulation is a consequence of of CPUs changing in the 16 years<br>
since that paper was published, and some emulation is a stopgap for<br>
things which really should be paravirtualised properly.  A whole load of<br>
speculative security fits into this category, as we haven&#39;t had time to<br>
fix it nicely, following the panic of simply fixing it safely.<br>
<br>
&gt; Likewise, the Xen design described in the SOSP paper discussed guest I/O <br>
&gt; as something that&#39;s fully paravirtualized, taking place not through <br>
&gt; emulation of either memory-mapped or port I/O but rather through ring <br>
&gt; buffers shared between the guest and dom0 via grant tables.<br>
<br>
This is still correct and accurate.  Paravirtual split front/back driver<br>
pairs for network and block are by far the most efficient way of<br>
shuffling data in and out of the VM.<br>
<br>
&gt; I was a bit <br>
&gt; confused to find I/O emulation code under arch/x86/pv (see e.g. <br>
&gt; arch/x86/pv/emul-priv-op.c) that seems to be talking about &quot;ports&quot; and <br>
&gt; the like. Is this another example of things being fuzzier in reality <br>
&gt; than in the &quot;theoretical&quot; PV design?<br>
<br>
This is &quot;general x86 architecture&quot;.  Xen handles all exceptions,<br>
including from PV userspace (possibly being naughty), so at a bare<br>
minimum needs to filter those which should be handed to the guest kernel<br>
to deal with.<br>
<br>
When it comes to x86 Port IO, it is a critical point of safety that Xen<br>
runs with IOPL set to 0, or a guest kernel could modify the real<br>
interrupt flag with a popf instruction.  As a result, all `in` and `out`<br>
instructions trap with a #GP fault.<br>
<br>
Guest userspace could use use iopl() to logically gain access to IO<br>
ports, after which `in` and `out` instructions would not fault.  Also,<br>
these instructions don&#39;t fault in kernel context.  In both cases, Xen<br>
has to filter between actually passing the IO request to hardware (if<br>
the guest is suitably configured), or terminating it defaults, so it<br>
fails in a manner consistent with how x86 behaves.<br>
<br>
For VT-x/SVM guests, filtering of #GP faults happens before the VMExit<br>
so Xen doesn&#39;t have to handle those, but still has to handle all IO<br>
accesses which are fine (permission wise) according to the guest kernel.<br>
<br>
&gt; What devices, if any, are emulated rather than paravirtualized for a PV guest?<br>
<br>
Look for XEN_X86_EMU_* throughout the code.  Those are all the discrete<br>
devices which Xen may emulate, for both kinds of guests.  There is a<br>
very restricted set of valid combinations.<br>
<br>
PV dom0&#39;s get an emulated PIT to partially forward to real hardware.<br>
ISTR it is legacy for some laptops where DRAM refresh was still<br>
configured off timer 1.  I doubt it is revenant these days.<br>
<br>
&gt; I know that for PVH, you <br>
&gt; mentioned that the Local APIC is (at a minimum) emulated, along with <br>
&gt; some special instructions; is that true for classic PV as well?<br>
<br>
Classic PV guests don&#39;t get a Local APIC.  They are required to use the<br>
event channel interface instead.<br>
<br>
&gt; For HVM, obviously anything that can&#39;t be virtualized natively by the <br>
&gt; hardware needs to be emulated by Xen/QEMU (since the guest kernel isn&#39;t <br>
&gt; expected to be cooperative to issue PV hypercalls instead); but I would <br>
&gt; expect emulation to be limited to the relatively small subset of the ISA <br>
&gt; that VMX/SVM can&#39;t natively virtualize. Yet I see that x86_emulate.c <br>
&gt; supports emulating just about everything. Under what circumstances does <br>
&gt; Xen actually need to put all that emulation code to use?<br>
<br>
Introspection, as I said earlier, which is potentially any instruction.<br>
<br>
MMIO regions (including to the Local APIC when it is in xAPIC mode, and<br>
hardware acceleration isn&#39;t available) can be the target of any<br>
instruction with a memory operand.  While mov is by far the most common<br>
instruction, other instructions such as and/or/xadd are used in some<br>
cases.  Various of the vector moves (movups/movaps/movnti) are very<br>
common with framebuffers.<br>
<br>
The cfc/cf8 IO ports are used for PCI Config space accesses, which all<br>
kernels try to use, and any kernel with real devices need to use.  The<br>
alternative is the the MMCFG scheme which is plain MMIO as above.<br>
<br>
&gt; I&#39;m also wondering just how much of this is Xen&#39;s responsibility vs. <br>
&gt; QEMU&#39;s. I understand that when QEMU is used on its own (i.e., not with <br>
&gt; Xen), it uses dynamic binary recompilation to handle the parts of the <br>
&gt; ISA that can&#39;t be virtualized natively in lower-privilege modes. Does <br>
&gt; Xen only use QEMU for emulating off-CPU devices (interrupt controller, <br>
&gt; non-paravirtualized disk/network/graphics/etc.), or does it ever employ <br>
&gt; any of QEMU&#39;s x86 emulation support in addition to Xen&#39;s own emulation code?<br>
<br>
We only use QEMU for off-CPU devices.  For performance reasons, some of<br>
the interrupt emulation (IO-APIC in particular), and timer emulation<br>
(HPET, PIT) is done in Xen, even when it would locally be part of the<br>
motherboard if we were looking for a clear delineation of where Xen<br>
stops and QEMU starts.<br>
<br>
&gt; Is there any particular place in the code where I can go to get a <br>
&gt; comprehensive &quot;list&quot; (or other such summary) of which parts of the ISA <br>
&gt; and off-CPU system are emulated for each respective guest type (PV, HVM, <br>
&gt; and PVH)?<br>
<br>
XEN_X86_EMU_* should cover you here.<br>
<br>
&gt; I realize that the difference between HVM and PVH is more of a <br>
&gt; continuum than a line; what I&#39;m especially interested in is, what&#39;s the <br>
&gt; *bare minimum* of emulation required for a PVH guest that&#39;s using as <br>
&gt; much paravirtualization as possible? (That&#39;s the setting I&#39;m looking to <br>
&gt; target for my research on protecting guests from a compromised <br>
&gt; hypervisor, since I&#39;m trying to minimize the scope of interactions <br>
&gt; between the guest and hypervisor/dom0 that our virtual instruction set <br>
&gt; layer needs to mediate.)<br>
<br>
If you are using PVH guests, on not-ancient hardware, and you can<br>
persuade the guest kernel to use x2APIC mode, and without using any<br>
ins/outs instructions, then you just might be able to get away without<br>
any x86_emulate() at all.<br>
<br>
x2APIC mode has an MSR-based interface rather than an MMIO interface,<br>
which means that the VMExit intercept information alone is sufficient to<br>
work out exactly what to do, and ins/outs is the only other instructions<br>
(which come to mind) liable to trap and need emulator support above and<br>
beyond the intercept information.<br>
<br>
That said, whatever you do here is going to have to cope with dom0 and<br>
all the requirements for keeping the system running.  Depending on<br>
exactly how you&#39;re approaching the problem, it might be possible to<br>
declare that out of scope and leave it to one side.<br>
<br>
&gt; On a somewhat related note, I also have a question about a particular <br>
&gt; piece of code in arch/x86/pv/emul-priv-op.c, namely the function <br>
&gt; io_emul_stub_setup(). It looks like it is, at runtime, crafting a <br>
&gt; function that switches to the guest register context, emulates a <br>
&gt; particular I/O operation, then switches back to the host register <br>
&gt; context. This caught our attention while we were implementing Control <br>
&gt; Flow Integrity (CFI) instrumentation for Xen (which is necessary for us <br>
&gt; to enforce the software fault isolation (SFI) instrumentation that <br>
&gt; provides our memory protections). Why does Xen use dynamically-generated <br>
&gt; code here? Is it just for implementation convenience (i.e., to improve <br>
&gt; the generalizability of the code)?<br>
<br>
This mechanism is for dom0 only, and exists because some firmware is<br>
terrible.<br>
<br>
Some AML in ACPI tables uses an IO port to generate an SMI, and has an<br>
API which uses the GPRs.  It turns out things go rather wrong when Xen<br>
intercepts the IO instruction, and replays it to hardware in Xen&#39;s GPR<br>
context, rather than the guest kernels.<br>
<br>
This bodge swaps Xen&#39;s and dom0&#39;s GPRs just around the IO instruction,<br>
so the SMI API gets its parameters properly, and the results get fed<br>
back properly into AML.<br>
<br>
There is a related hypercall, SCHEDOP_pin_override, used by dom0,<br>
because sometimes the AML really does need to execute on CPU0, and not<br>
wherever dom0&#39;s vcpu0 happens to be executing.<br>
<br>
&gt; Thanks again for all your time and effort spent answering my questions. <br>
&gt; I know I&#39;m throwing a lot of unusual questions out there - this <br>
&gt; back-and-forth has been very helpful for me in figuring out *what* <br>
&gt; questions I need to be asking in the first place to understand what&#39;s <br>
&gt; feasible to do in the Xen architecture and how I might go about doing <br>
&gt; it. :-)<br>
<br>
Not a problem in the slightest.<br>
<br>
~Andrew<br>
<br>
_______________________________________________<br>
Xen-devel mailing list<br>
<a href="mailto:Xen-devel@lists.xenproject.org" target="_blank">Xen-devel@lists.xenproject.org</a><br>
<a href="https://lists.xenproject.org/mailman/listinfo/xen-devel" rel="noreferrer" target="_blank">https://lists.xenproject.org/mailman/listinfo/xen-devel</a></blockquote></div>

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 13:51     ` Andrew Cooper
  2019-08-22 15:06       ` Rian Quinn
@ 2019-08-22 17:36       ` Tamas K Lengyel
  2019-08-22 22:49         ` Andrew Cooper
  2019-08-22 20:57       ` Rich Persaud
  2 siblings, 1 reply; 13+ messages in thread
From: Tamas K Lengyel @ 2019-08-22 17:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan

> > I've found a number of files in the Xen source tree which seem to be
> > related to instruction/x86 platform emulation:
> >
> > arch/x86/x86_emulate.c
> > arch/x86/hvm/emulate.c
> > arch/x86/hvm/vmx/realmode.c
> > arch/x86/hvm/svm/emulate.c
> > arch/x86/pv/emulate.c
> > arch/x86/pv/emul-priv-op.c
> > arch/x86/x86_emulate/x86_emulate.c
> >
> > The last of these, in particular, looks especially hairy (it seems to
> > support emulation of essentially the entire x86 instruction set through
> > a quite impressive edifice of switch statements).
>
> Lovely, isn't it.  For Introspection, we need to be able to emulate an
> instruction which took a permission fault (including No Execute), was
> sent to the analysis engine, and deemed ok to continue.

That's not a requirement for introspection and I find that kind of use
of the emulation very hairy, especially for anything security related.
IMHO it's nothing more then a convenient hack.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 13:51     ` Andrew Cooper
  2019-08-22 15:06       ` Rian Quinn
  2019-08-22 17:36       ` Tamas K Lengyel
@ 2019-08-22 20:57       ` Rich Persaud
  2019-08-22 22:39         ` Andrew Cooper
  2 siblings, 1 reply; 13+ messages in thread
From: Rich Persaud @ 2019-08-22 20:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan

> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
>> On 22/08/2019 03:06, Johnson, Ethan wrote:
>> 
>> For HVM, obviously anything that can't be virtualized natively by the 
>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't 
>> expected to be cooperative to issue PV hypercalls instead); but I would 
>> expect emulation to be limited to the relatively small subset of the ISA 
>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c 
>> supports emulating just about everything. Under what circumstances does 
>> Xen actually need to put all that emulation code to use?
> 
> Introspection, as I said earlier, which is potentially any instruction.

Could introspection-specific emulation code be disabled via KConfig?

Rich

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 20:57       ` Rich Persaud
@ 2019-08-22 22:39         ` Andrew Cooper
  2019-08-22 23:06           ` Tamas K Lengyel
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2019-08-22 22:39 UTC (permalink / raw)
  To: Rich Persaud; +Cc: xen-devel, Johnson, Ethan

On 22/08/2019 21:57, Rich Persaud wrote:
>> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>
>>> On 22/08/2019 03:06, Johnson, Ethan wrote:
>>>
>>> For HVM, obviously anything that can't be virtualized natively by the 
>>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't 
>>> expected to be cooperative to issue PV hypercalls instead); but I would 
>>> expect emulation to be limited to the relatively small subset of the ISA 
>>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c 
>>> supports emulating just about everything. Under what circumstances does 
>>> Xen actually need to put all that emulation code to use?
>> Introspection, as I said earlier, which is potentially any instruction.
> Could introspection-specific emulation code be disabled via KConfig?

Not really.

At the point something has trapped for emulation, we must complete it in
a manner consistent with the x86 architecture, or the guest will crash.

If you don't want emulation from introspection, don't start
introspecting in the first place, at which point guest actions won't
trap in the first place.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 15:06       ` Rian Quinn
@ 2019-08-22 22:42         ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2019-08-22 22:42 UTC (permalink / raw)
  To: Rian Quinn; +Cc: xen-devel, Johnson, Ethan

On 22/08/2019 16:06, Rian Quinn wrote:
> I can at least confirm that no emulation is needed to execute a Linux
> guest, even with the Xen PVH interface, but I don't think that works
> out of the box today with Xen, something we are currently working on
> and will hopefully have some more data near the end of the year.
> x2APIC helps, but it takes some work to convince Linux to use that
> currently. The trick is to avoid PortIO and, where possible, MMIO
> interfaces.

There is a bit in the CPUID leaves which allegedly should cause Linux to
try and use x2apic, but I can easily see this having bitrotted.

It is certainly something we need to fix.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 17:36       ` Tamas K Lengyel
@ 2019-08-22 22:49         ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2019-08-22 22:49 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Johnson, Ethan

On 22/08/2019 18:36, Tamas K Lengyel wrote:
>>> I've found a number of files in the Xen source tree which seem to be
>>> related to instruction/x86 platform emulation:
>>>
>>> arch/x86/x86_emulate.c
>>> arch/x86/hvm/emulate.c
>>> arch/x86/hvm/vmx/realmode.c
>>> arch/x86/hvm/svm/emulate.c
>>> arch/x86/pv/emulate.c
>>> arch/x86/pv/emul-priv-op.c
>>> arch/x86/x86_emulate/x86_emulate.c
>>>
>>> The last of these, in particular, looks especially hairy (it seems to
>>> support emulation of essentially the entire x86 instruction set through
>>> a quite impressive edifice of switch statements).
>> Lovely, isn't it.  For Introspection, we need to be able to emulate an
>> instruction which took a permission fault (including No Execute), was
>> sent to the analysis engine, and deemed ok to continue.
> That's not a requirement for introspection and I find that kind of use
> of the emulation very hairy, especially for anything security related.
> IMHO it's nothing more then a convenient hack.

Ok fine.  I was specialising to the form of introspection that I deal
with regularly.  Nothing in the Xen introspection APIs forces you to
take extra emulation.

However, when you're doing a proper product based on it, customers care
about it not being unusable slow.  In our case, that relies on not
falling back to completing instructions using the "pause all other
vcpus, unrestricted permissions, singlestep the vcpu, restrict 
permissions again" approach.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 22:39         ` Andrew Cooper
@ 2019-08-22 23:06           ` Tamas K Lengyel
  2019-08-23  0:03             ` Andrew Cooper
  0 siblings, 1 reply; 13+ messages in thread
From: Tamas K Lengyel @ 2019-08-22 23:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Rich Persaud, Johnson, Ethan

On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 22/08/2019 21:57, Rich Persaud wrote:
> >> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>
> >>> On 22/08/2019 03:06, Johnson, Ethan wrote:
> >>>
> >>> For HVM, obviously anything that can't be virtualized natively by the
> >>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't
> >>> expected to be cooperative to issue PV hypercalls instead); but I would
> >>> expect emulation to be limited to the relatively small subset of the ISA
> >>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c
> >>> supports emulating just about everything. Under what circumstances does
> >>> Xen actually need to put all that emulation code to use?
> >> Introspection, as I said earlier, which is potentially any instruction.
> > Could introspection-specific emulation code be disabled via KConfig?
>
> Not really.
>
> At the point something has trapped for emulation, we must complete it in
> a manner consistent with the x86 architecture, or the guest will crash.
>
> If you don't want emulation from introspection, don't start
> introspecting in the first place, at which point guest actions won't
> trap in the first place.

That's incorrect, you can absolutely do introspection with vm_events
and NOT emulate anything. You can have altp2m in place with different
memory permissions set in different views and switch between the views
with MTF enabled to allow the system to continue executing. This does
not require emulation of anything. I would be behind a KCONFIG option
that turns off parts of the emulator that are only used by a subset of
introspection usecases. But this should not be an option that turns
off introspection itself, the two things are NOT inter-dependent.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-22 23:06           ` Tamas K Lengyel
@ 2019-08-23  0:03             ` Andrew Cooper
  2019-08-23  1:12               ` Tamas K Lengyel
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2019-08-23  0:03 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Rich Persaud, Johnson, Ethan

On 23/08/2019 00:06, Tamas K Lengyel wrote:
> On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 22/08/2019 21:57, Rich Persaud wrote:
>>>> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>
>>>>> On 22/08/2019 03:06, Johnson, Ethan wrote:
>>>>>
>>>>> For HVM, obviously anything that can't be virtualized natively by the
>>>>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't
>>>>> expected to be cooperative to issue PV hypercalls instead); but I would
>>>>> expect emulation to be limited to the relatively small subset of the ISA
>>>>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c
>>>>> supports emulating just about everything. Under what circumstances does
>>>>> Xen actually need to put all that emulation code to use?
>>>> Introspection, as I said earlier, which is potentially any instruction.
>>> Could introspection-specific emulation code be disabled via KConfig?
>> Not really.
>>
>> At the point something has trapped for emulation, we must complete it in
>> a manner consistent with the x86 architecture, or the guest will crash.
>>
>> If you don't want emulation from introspection, don't start
>> introspecting in the first place, at which point guest actions won't
>> trap in the first place.
> That's incorrect, you can absolutely do introspection with vm_events
> and NOT emulate anything. You can have altp2m in place with different
> memory permissions set in different views and switch between the views
> with MTF enabled to allow the system to continue executing. This does
> not require emulation of anything. I would be behind a KCONFIG option
> that turns off parts of the emulator that are only used by a subset of
> introspection usecases. But this should not be an option that turns
> off introspection itself, the two things are NOT inter-dependent.

I fear we are getting slightly off track here, but I'll bite...

Introspection is a young technology, with vast potential.  This is great
- it means there is a lot of novel R&D going into it.  It doesn't mean
that all aspects of it are viable for use by customers today.

I'll have an easier time believing that altp2m is close to being
production ready when I no longer fine security-relevant bugs in it
every time I go looking, and someone has made a coherent attempt to
justify it being security supported.

None of this alters the fact that introspection in general is one key
factor as to why we have a mostly-complete x86_emulate() (even if "x86
emulate" is a slightly poor choice of name.  "decode and replay" would
be a far more apt description of what it does for the majority of
instructions.)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
  2019-08-23  0:03             ` Andrew Cooper
@ 2019-08-23  1:12               ` Tamas K Lengyel
  0 siblings, 0 replies; 13+ messages in thread
From: Tamas K Lengyel @ 2019-08-23  1:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Rich Persaud, Johnson, Ethan

On Thu, Aug 22, 2019 at 6:03 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 23/08/2019 00:06, Tamas K Lengyel wrote:
> > On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> On 22/08/2019 21:57, Rich Persaud wrote:
> >>>> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>>>
> >>>>> On 22/08/2019 03:06, Johnson, Ethan wrote:
> >>>>>
> >>>>> For HVM, obviously anything that can't be virtualized natively by the
> >>>>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't
> >>>>> expected to be cooperative to issue PV hypercalls instead); but I would
> >>>>> expect emulation to be limited to the relatively small subset of the ISA
> >>>>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c
> >>>>> supports emulating just about everything. Under what circumstances does
> >>>>> Xen actually need to put all that emulation code to use?
> >>>> Introspection, as I said earlier, which is potentially any instruction.
> >>> Could introspection-specific emulation code be disabled via KConfig?
> >> Not really.
> >>
> >> At the point something has trapped for emulation, we must complete it in
> >> a manner consistent with the x86 architecture, or the guest will crash.
> >>
> >> If you don't want emulation from introspection, don't start
> >> introspecting in the first place, at which point guest actions won't
> >> trap in the first place.
> > That's incorrect, you can absolutely do introspection with vm_events
> > and NOT emulate anything. You can have altp2m in place with different
> > memory permissions set in different views and switch between the views
> > with MTF enabled to allow the system to continue executing. This does
> > not require emulation of anything. I would be behind a KCONFIG option
> > that turns off parts of the emulator that are only used by a subset of
> > introspection usecases. But this should not be an option that turns
> > off introspection itself, the two things are NOT inter-dependent.
>
> I fear we are getting slightly off track here, but I'll bite...
>
> Introspection is a young technology, with vast potential.  This is great
> - it means there is a lot of novel R&D going into it.  It doesn't mean
> that all aspects of it are viable for use by customers today.
>
> I'll have an easier time believing that altp2m is close to being
> production ready when I no longer fine security-relevant bugs in it
> every time I go looking, and someone has made a coherent attempt to
> justify it being security supported.

I didn't say altp2m is security supported or that it's "production
ready", only that it's a viable alternative to using the emulator.
With the external-only mode I added I don't see any additional attack
surface as compared to regular use of EPT, but of course I would be
very interested in the security bugs you seem to be finding left and
right. In my experience it's the emulator that's buggy (or simply
incomplete).

>
> None of this alters the fact that introspection in general is one key
> factor as to why we have a mostly-complete x86_emulate() (even if "x86
> emulate" is a slightly poor choice of name.  "decode and replay" would
> be a far more apt description of what it does for the majority of
> instructions.)

Which is fine, but if people find the presence of a full x86 emulator
troubling and want to disable as much of it as possible, saying that
it's needed for introspection is incorrect. It is not needed for
introspection. So I'm not OK with using that justification for keeping
it. Nor would I like to see an option that says that if you are doing
introspection you _must_ have that full emulator in place. You simply
don't.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-16 19:51 [Xen-devel] More questions about Xen memory layout/usage, access to guest memory Johnson, Ethan
2019-08-17 11:04 ` Andrew Cooper
2019-08-22  2:06   ` Johnson, Ethan
2019-08-22 13:51     ` Andrew Cooper
2019-08-22 15:06       ` Rian Quinn
2019-08-22 22:42         ` Andrew Cooper
2019-08-22 17:36       ` Tamas K Lengyel
2019-08-22 22:49         ` Andrew Cooper
2019-08-22 20:57       ` Rich Persaud
2019-08-22 22:39         ` Andrew Cooper
2019-08-22 23:06           ` Tamas K Lengyel
2019-08-23  0:03             ` Andrew Cooper
2019-08-23  1:12               ` Tamas K Lengyel

Xen-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/xen-devel/0 xen-devel/git/0.git
	git clone --mirror https://lore.kernel.org/xen-devel/1 xen-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 xen-devel xen-devel/ https://lore.kernel.org/xen-devel \
		xen-devel@lists.xenproject.org xen-devel@archiver.kernel.org
	public-inbox-index xen-devel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.xenproject.lists.xen-devel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox