* [Xen-devel] More questions about Xen memory layout/usage, access to guest memory @ 2019-08-16 19:51 Johnson, Ethan 2019-08-17 11:04 ` Andrew Cooper 0 siblings, 1 reply; 13+ messages in thread From: Johnson, Ethan @ 2019-08-16 19:51 UTC (permalink / raw) To: xen-devel Hi all, I have some follow-up questions about Xen's usage and layout of memory, building on the ones I asked here a few weeks ago (which were quite helpfully answered: see https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html for reference). For context on why I'm asking these questions, I'm using Xen as a research platform for enforcing novel memory protection schemes on hypervisors and guests. 1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 root pointer) must be used for each guest. Is that also true of the host (non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH guest is running? Or is the dom0 page table left in place, assuming the dom0 is PV, when an HVM/PVH guest is running, since extended paging is now being used to provide the guest's view of memory? Does this change if the dom0 is PVH? Or, to ask this from another angle: is there ever anything *but* Xen living in the host-virtual address space when an HVM/PVH guest is active? And is the answer to this different depending on whether the HVM/PVH guest is a domU vs. a PVH dom0? 2. Do the mappings in Xen's slice of the host-virtual address space differ at all between the host page tables corresponding to different guests? If the mappings are in fact the same, does Xen therefore share lower-level page table pages between the page tables corresponding to different guests? Is any of this different for PV vs. HVM/PVH? 3. Under what circumstances, and for what purposes, does Xen use its ability to access guest memory through its direct map of host-physical memory? Similarly, to what extent does the dom0 (or other such privileged domain) utilize "foreign memory maps" to reach into another guest's memory? I understand that this is necessary when creating a guest, for live migration, and for QEMU to emulate stuff for HVM guests; but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" access a guest's memory? (I ask because the research project I'm working on is seeking to protect guests from a compromised hypervisor and dom0, so I need to limit outside access to a guest's memory to explicitly shared pages that the guest will treat as untrusted - not storing any secrets there, vetting input as necessary, etc.) 4. What facilities/processes does Xen provide for PV(H) guests to explicitly/voluntarily share memory pages with Xen and other domains (dom0, etc.)? From what I can gather from the documentation, it sounds like "grant tables" are involved in this - is that how a PV-aware guest is expected to set up shared memory regions for communication with other domains (ring buffers, etc.)? Does a PV(H) guest need to voluntarily establish all external access to its pages, or is there ever a situation where it's the other way around - where Xen itself establishes/defines a region as shared and the guest is responsible for treating it accordingly? Again, this mostly boils down to: under what circumstances, if ever, does Xen ever "force" access to any part of a guest's memory? (Particularly for PV(H). Clearly that must happen for HVM since, by definition, the guest is unaware there's a hypervisor controlling its world and emulating hardware behavior, and thus is in no position to cooperatively/voluntarily give the hypervisor and dom0 access to its memory.) Thanks again in advance for any help anyone can offer! Sincerely, Ethan Johnson -- Ethan J. Johnson Computer Science PhD student, Systems group, University of Rochester ejohns48@cs.rochester.edu ethanjohnson@acm.org PGP public key available from public directory or on request _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-16 19:51 [Xen-devel] More questions about Xen memory layout/usage, access to guest memory Johnson, Ethan @ 2019-08-17 11:04 ` Andrew Cooper 2019-08-22 2:06 ` Johnson, Ethan 0 siblings, 1 reply; 13+ messages in thread From: Andrew Cooper @ 2019-08-17 11:04 UTC (permalink / raw) To: Johnson, Ethan, xen-devel On 16/08/2019 20:51, Johnson, Ethan wrote: > Hi all, > > I have some follow-up questions about Xen's usage and layout of memory, > building on the ones I asked here a few weeks ago (which were quite > helpfully answered: see > https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html > for reference). For context on why I'm asking these questions, I'm using > Xen as a research platform for enforcing novel memory protection schemes > on hypervisors and guests. > > 1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 > 0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode > or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 > root pointer) must be used for each guest. More than that. Each vCPU. PV guests manage their own pagetables, and have a vCR3 which the guest kernel controls, and we must honour. For 64bit PV guests, each time a new L4 pagetable is created, Xen sets up its own 16 slots appropriately. As a result, Xen itself is able to function appropriately on all pagetable hierarchies the PV guest creates. See init_xen_l4_slots() which does this. For 32bit PV guests, things are a tad more complicated. Each vCR3 is actually a PAE-quad of pagetable entries. Because Xen is still operating in 64bit mode with 4-level paging, we enforce that guests allocate a full 4k page for the pagetable (rather than the 32 bytes it would normally be). In Xen, we allocate what is called a monitor table, which is per-vcpu (set up with all the correct details for Xen), and we rewrite slot 0 each time the vCPU changes vCR3. Not related to this question, but important for future answers. All pagetables are actually at a minimum per-domain, because we have per-domain mappings to simplify certain tasks. Contained within these are various structures, including the hypercall compatibility translation area. This per-domain restriction can in principle be lifted if we alter the way Xen chooses to lay out its memory. > Is that also true of the host > (non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH > guest is running? Historical context is important to answer this question. When the first HVM support came along, there was no EPT or NPT in hardware. Hypervisors were required to virtualise the guests pagetable structure, which is called Shadow Paging in Xen. The shadow pagetables themselves are organised per-domain so as to form a single coherent guest physical address space, but CPUs operating in non-root mode still needed the real CR3 pointing at the logical vCPU's CR3 which was being virtualised. In practice, we still allocate a monitor pagetable per vcpu for HVM guests, even with HAP support. I can't think of any restrictions which would prevent us from doing this differently. > Or is the dom0 page table left in place, assuming the > dom0 is PV, when an HVM/PVH guest is running, since extended paging is > now being used to provide the guest's view of memory? Does this change > if the dom0 is PVH? Here is some (prototype) documentation prepared since your last round of questions. https://andrewcoop-xen.readthedocs.io/en/docs-devel/admin-guide/introduction.html Dom0 is just a VM, like every other domU in the system. There is nothing special about how it is virtualised. Dom0 defaults to having full permissions, so can successfully issue a whole range of more interesting hypercalls, but you could easily create dom1, set the is_priv boolean in Xen, and give dom1 all the same permissions that dom0 has, if you wished. > Or, to ask this from another angle: is there ever anything *but* Xen > living in the host-virtual address space when an HVM/PVH guest is > active? No, depending on how you classify Xen's directmap in this context. > And is the answer to this different depending on whether the > HVM/PVH guest is a domU vs. a PVH dom0? Dom0 vs domU has no relevance to the question. > 2. Do the mappings in Xen's slice of the host-virtual address space > differ at all between the host page tables corresponding to different > guests? No (ish). Xen has a mostly flat address space, so most of the mappings are the same. There is a per-domain mapping slot which is common to each vcpu in a domain, but different across domains, and a self-linear map for easy modification of the PTEs for the current pagetable hierarchy, and a shadow-linear map for easy modification of the shadow PTEs for which Xen is not in the address space at all. > If the mappings are in fact the same, does Xen therefore share > lower-level page table pages between the page tables corresponding to > different guests? We have many different L4's (the monitor tables, every L4 a PV guest has allocated) which can run Xen. Most parts of Xen's address space converge at L3 (the M2P, the directmap, Xen text/data/bss/fixmap/vmap/heaps/misc), and are common to all contexts. The per-domain mapping converges at L3 and are shared between vcpus of the same guest, but not shared across guests. One aspect I haven't really covered is XPTI for Meltdown mitigation for PV guests. Here, we have a per-CPU private pagetable which ends up being a merge of most of the guests L4, but with some pre-construct CPU-private pagetable hierarchy to hide the majority of data in the Xen region. > Is any of this different for PV vs. HVM/PVH? PV guests control their parts of their address space, and can do largely whatever they choose. HVM has nothing in the lower canonical half, but do have an extended directmap (which in practice only makes a difference on a >5TB machine). > 3. Under what circumstances, and for what purposes, does Xen use its > ability to access guest memory through its direct map of host-physical > memory? That is a very broad question, and currently has the unfortunate answer of "whenever speculation goes awry in an attackers favour." There are steps under way to reduce the usage of the directmap so we can run without it, and prevent this kind of leakage. As for when Xen would normally access memory, the most common answer is for hypercall parameters which mostly use a virtual address based ABI. Also, any time we need to emulate an instruction, we need to read a fair amount of guest state, including reading the instruction under %rip. > Similarly, to what extent does the dom0 (or other such > privileged domain) utilize "foreign memory maps" to reach into another > guest's memory? I understand that this is necessary when creating a > guest, for live migration, and for QEMU to emulate stuff for HVM guests; > but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" > access a guest's memory? I'm not sure what you mean by forcibly. Dom0 has the ability to do so, if it chooses. There is no "force" about it. Debuggers and/or Introspection are other reasons why dom0 might chose to map guest RAM, but I think you've covered the common cases. > (I ask because the research project I'm working on is seeking to protect > guests from a compromised hypervisor and dom0, so I need to limit > outside access to a guest's memory to explicitly shared pages that the > guest will treat as untrusted - not storing any secrets there, vetting > input as necessary, etc.) Sorry to come along with roadblocks, but how on earth do you intend to prevent a compromised Xen from accessing guest memory? A compromised Xen can do almost anything it likes, and without recourse. This is ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM are coming along, because only the hardware itself is in a position to isolate an untrusted hypervisor/kernel from guest data. For dom0, that's perhaps easier. You could reference count the number of foreign mappings into the domain as it is created, and refuse to unpause the guests vcpus until the foreign map count has dropped to 0. > 4. What facilities/processes does Xen provide for PV(H) guests to > explicitly/voluntarily share memory pages with Xen and other domains > (dom0, etc.)? From what I can gather from the documentation, it sounds > like "grant tables" are involved in this - is that how a PV-aware guest > is expected to set up shared memory regions for communication with other > domains (ring buffers, etc.)? Yes. Grant Tables is Xen's mechanism for the coordinated setup of shared memory between two consenting domains. > Does a PV(H) guest need to voluntarily > establish all external access to its pages, or is there ever a situation > where it's the other way around - where Xen itself establishes/defines a > region as shared and the guest is responsible for treating it accordingly? During domain construction, two grants/events are constructed automatically. One is for the xenstore ring, and one is for the console ring. The latter is so it can get debugging out from very early code, while both are, in practice, done like this because the guest has no a-priori way to establish the grants/events itself. For all other shared interfaces, the guests are expected to negotiate which grants/events/rings/details to use via Xenstore. > Again, this mostly boils down to: under what circumstances, if ever, > does Xen ever "force" access to any part of a guest's memory? > (Particularly for PV(H). Clearly that must happen for HVM since, by > definition, the guest is unaware there's a hypervisor controlling its > world and emulating hardware behavior, and thus is in no position to > cooperatively/voluntarily give the hypervisor and dom0 access to its > memory.) There are cases for all guest types where Xen will need to emulate instructions. Xen will access guest memory in order to perfom architecturally correct actions, which generally starts with reading the instruction under %rip. For PV guests, this almost entirely restricted to guest-kernel operations which are privileged in nature. Access to MSRs, writes to pagetables, etc. For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't be a complete absence of emulation. The Local APIC is emulated by Xen in most cases, as a bare minimum, but for example, the LMSW instruction on AMD hardware doesn't have any intercept decoding to help the hypervisor out when a guest uses the instruction. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-17 11:04 ` Andrew Cooper @ 2019-08-22 2:06 ` Johnson, Ethan 2019-08-22 13:51 ` Andrew Cooper 0 siblings, 1 reply; 13+ messages in thread From: Johnson, Ethan @ 2019-08-22 2:06 UTC (permalink / raw) To: Andrew Cooper, xen-devel On 8/17/2019 7:04 AM, Andrew Cooper wrote: >> Similarly, to what extent does the dom0 (or other such >> privileged domain) utilize "foreign memory maps" to reach into another >> guest's memory? I understand that this is necessary when creating a >> guest, for live migration, and for QEMU to emulate stuff for HVM guests; >> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" >> access a guest's memory? > I'm not sure what you mean by forcibly. Dom0 has the ability to do so, > if it chooses. There is no "force" about it. > > Debuggers and/or Introspection are other reasons why dom0 might chose to > map guest RAM, but I think you've covered the common cases. > >> (I ask because the research project I'm working on is seeking to protect >> guests from a compromised hypervisor and dom0, so I need to limit >> outside access to a guest's memory to explicitly shared pages that the >> guest will treat as untrusted - not storing any secrets there, vetting >> input as necessary, etc.) > Sorry to come along with roadblocks, but how on earth do you intend to > prevent a compromised Xen from accessing guest memory? A compromised > Xen can do almost anything it likes, and without recourse. This is > ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM > are coming along, because only the hardware itself is in a position to > isolate an untrusted hypervisor/kernel from guest data. > > For dom0, that's perhaps easier. You could reference count the number > of foreign mappings into the domain as it is created, and refuse to > unpause the guests vcpus until the foreign map count has dropped to 0. We're using a technique where privileged system software (in this case, the hypervisor) is compiled to a virtual instruction set (based on LLVM IR) that limits its access to hardware features and its view of available memory. These limitations are/can be enforced in a variety of ways but the main techniques we're employing are software fault isolation (i.e., memory loads and stores in privileged code are instrumented with checks to ensure they aren't accessing forbidden regions), and mediation of page table updates (by modifying privileged software to make page table updates through a virtual instruction set interface, very similarly to how Xen PV guests make page table updates through hypercalls which gives Xen the opportunity to ensure mappings aren't made to protected regions). Our technique is based on that used by the "Virtual Ghost" project (see https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), which does something similar to protect applications from a compromised operating system kernel without relying on something like a hypervisor operating at a higher privileged level. We're looking to extend that approach to hypervisors to protect guest VMs from a compromised hypervisor. >> Again, this mostly boils down to: under what circumstances, if ever, >> does Xen ever "force" access to any part of a guest's memory? >> (Particularly for PV(H). Clearly that must happen for HVM since, by >> definition, the guest is unaware there's a hypervisor controlling its >> world and emulating hardware behavior, and thus is in no position to >> cooperatively/voluntarily give the hypervisor and dom0 access to its >> memory.) > There are cases for all guest types where Xen will need to emulate > instructions. Xen will access guest memory in order to perfom > architecturally correct actions, which generally starts with reading the > instruction under %rip. > > For PV guests, this almost entirely restricted to guest-kernel > operations which are privileged in nature. Access to MSRs, writes to > pagetables, etc. > > For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't > be a complete absence of emulation. The Local APIC is emulated by Xen > in most cases, as a bare minimum, but for example, the LMSW instruction > on AMD hardware doesn't have any intercept decoding to help the > hypervisor out when a guest uses the instruction. > > ~Andrew I've found a number of files in the Xen source tree which seem to be related to instruction/x86 platform emulation: arch/x86/x86_emulate.c arch/x86/hvm/emulate.c arch/x86/hvm/vmx/realmode.c arch/x86/hvm/svm/emulate.c arch/x86/pv/emulate.c arch/x86/pv/emul-priv-op.c arch/x86/x86_emulate/x86_emulate.c The last of these, in particular, looks especially hairy (it seems to support emulation of essentially the entire x86 instruction set through a quite impressive edifice of switch statements). How does all of this fit into the big picture of how Xen virtualizes the different types of VMs (PV/HVM/PVH)? My impression (from reading the original "Xen and the Art of Virtualization" SOSP '03 paper that describes the basic architecture) had been that PV guests, in particular, used hypercalls in place of all privileged operations that the guest kernel would otherwise need to execute in ring 0; and that all other (unprivileged) operations could execute natively on the CPU without requiring emulation. From what you're saying (and what I'm seeing in the source code), though, it sounds like in reality things are a bit fuzzier - that there are some operations that Xen traps and emulates instead of explicitly paravirtualizing. Likewise, the Xen design described in the SOSP paper discussed guest I/O as something that's fully paravirtualized, taking place not through emulation of either memory-mapped or port I/O but rather through ring buffers shared between the guest and dom0 via grant tables. I was a bit confused to find I/O emulation code under arch/x86/pv (see e.g. arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and the like. Is this another example of things being fuzzier in reality than in the "theoretical" PV design? What devices, if any, are emulated rather than paravirtualized for a PV guest? I know that for PVH, you mentioned that the Local APIC is (at a minimum) emulated, along with some special instructions; is that true for classic PV as well? For HVM, obviously anything that can't be virtualized natively by the hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't expected to be cooperative to issue PV hypercalls instead); but I would expect emulation to be limited to the relatively small subset of the ISA that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c supports emulating just about everything. Under what circumstances does Xen actually need to put all that emulation code to use? I'm also wondering just how much of this is Xen's responsibility vs. QEMU's. I understand that when QEMU is used on its own (i.e., not with Xen), it uses dynamic binary recompilation to handle the parts of the ISA that can't be virtualized natively in lower-privilege modes. Does Xen only use QEMU for emulating off-CPU devices (interrupt controller, non-paravirtualized disk/network/graphics/etc.), or does it ever employ any of QEMU's x86 emulation support in addition to Xen's own emulation code? Is there any particular place in the code where I can go to get a comprehensive "list" (or other such summary) of which parts of the ISA and off-CPU system are emulated for each respective guest type (PV, HVM, and PVH)? I realize that the difference between HVM and PVH is more of a continuum than a line; what I'm especially interested in is, what's the *bare minimum* of emulation required for a PVH guest that's using as much paravirtualization as possible? (That's the setting I'm looking to target for my research on protecting guests from a compromised hypervisor, since I'm trying to minimize the scope of interactions between the guest and hypervisor/dom0 that our virtual instruction set layer needs to mediate.)// On a somewhat related note, I also have a question about a particular piece of code in arch/x86/pv/emul-priv-op.c, namely the function io_emul_stub_setup(). It looks like it is, at runtime, crafting a function that switches to the guest register context, emulates a particular I/O operation, then switches back to the host register context. This caught our attention while we were implementing Control Flow Integrity (CFI) instrumentation for Xen (which is necessary for us to enforce the software fault isolation (SFI) instrumentation that provides our memory protections). Why does Xen use dynamically-generated code here? Is it just for implementation convenience (i.e., to improve the generalizability of the code)? Thanks again for all your time and effort spent answering my questions. I know I'm throwing a lot of unusual questions out there - this back-and-forth has been very helpful for me in figuring out *what* questions I need to be asking in the first place to understand what's feasible to do in the Xen architecture and how I might go about doing it. :-) Thanks, Ethan Johnson -- Ethan J. Johnson Computer Science PhD student, Systems group, University of Rochester ejohns48@cs.rochester.edu ethanjohnson@acm.org PGP public key available from public directory or on request _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 2:06 ` Johnson, Ethan @ 2019-08-22 13:51 ` Andrew Cooper 2019-08-22 15:06 ` Rian Quinn ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Andrew Cooper @ 2019-08-22 13:51 UTC (permalink / raw) To: Johnson, Ethan, xen-devel On 22/08/2019 03:06, Johnson, Ethan wrote: > On 8/17/2019 7:04 AM, Andrew Cooper wrote: >>> Similarly, to what extent does the dom0 (or other such >>> privileged domain) utilize "foreign memory maps" to reach into another >>> guest's memory? I understand that this is necessary when creating a >>> guest, for live migration, and for QEMU to emulate stuff for HVM guests; >>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" >>> access a guest's memory? >> I'm not sure what you mean by forcibly. Dom0 has the ability to do so, >> if it chooses. There is no "force" about it. >> >> Debuggers and/or Introspection are other reasons why dom0 might chose to >> map guest RAM, but I think you've covered the common cases. >> >>> (I ask because the research project I'm working on is seeking to protect >>> guests from a compromised hypervisor and dom0, so I need to limit >>> outside access to a guest's memory to explicitly shared pages that the >>> guest will treat as untrusted - not storing any secrets there, vetting >>> input as necessary, etc.) >> Sorry to come along with roadblocks, but how on earth do you intend to >> prevent a compromised Xen from accessing guest memory? A compromised >> Xen can do almost anything it likes, and without recourse. This is >> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM >> are coming along, because only the hardware itself is in a position to >> isolate an untrusted hypervisor/kernel from guest data. >> >> For dom0, that's perhaps easier. You could reference count the number >> of foreign mappings into the domain as it is created, and refuse to >> unpause the guests vcpus until the foreign map count has dropped to 0. > We're using a technique where privileged system software (in this case, > the hypervisor) is compiled to a virtual instruction set (based on LLVM > IR) that limits its access to hardware features and its view of > available memory. These limitations are/can be enforced in a variety of > ways but the main techniques we're employing are software fault > isolation (i.e., memory loads and stores in privileged code are > instrumented with checks to ensure they aren't accessing forbidden > regions), and mediation of page table updates (by modifying privileged > software to make page table updates through a virtual instruction set > interface, very similarly to how Xen PV guests make page table updates > through hypercalls which gives Xen the opportunity to ensure mappings > aren't made to protected regions). > > Our technique is based on that used by the "Virtual Ghost" project (see > https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF > link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), > which does something similar to protect applications from a compromised > operating system kernel without relying on something like a hypervisor > operating at a higher privileged level. We're looking to extend that > approach to hypervisors to protect guest VMs from a compromised hypervisor. I have come across that paper before. The extra language safety (which is effectively what this is) should make it harder to compromise the hypervisor (and this is certainly a good thing), but nothing at this level will get in the way of an actually-compromised piece of ring 0 code from doing whatever it wants. Suffice it to say that I'll be delighted if someone managed to demonstrate me wrong. > >>> Again, this mostly boils down to: under what circumstances, if ever, >>> does Xen ever "force" access to any part of a guest's memory? >>> (Particularly for PV(H). Clearly that must happen for HVM since, by >>> definition, the guest is unaware there's a hypervisor controlling its >>> world and emulating hardware behavior, and thus is in no position to >>> cooperatively/voluntarily give the hypervisor and dom0 access to its >>> memory.) >> There are cases for all guest types where Xen will need to emulate >> instructions. Xen will access guest memory in order to perfom >> architecturally correct actions, which generally starts with reading the >> instruction under %rip. >> >> For PV guests, this almost entirely restricted to guest-kernel >> operations which are privileged in nature. Access to MSRs, writes to >> pagetables, etc. >> >> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't >> be a complete absence of emulation. The Local APIC is emulated by Xen >> in most cases, as a bare minimum, but for example, the LMSW instruction >> on AMD hardware doesn't have any intercept decoding to help the >> hypervisor out when a guest uses the instruction. >> >> ~Andrew > I've found a number of files in the Xen source tree which seem to be > related to instruction/x86 platform emulation: > > arch/x86/x86_emulate.c > arch/x86/hvm/emulate.c > arch/x86/hvm/vmx/realmode.c > arch/x86/hvm/svm/emulate.c > arch/x86/pv/emulate.c > arch/x86/pv/emul-priv-op.c > arch/x86/x86_emulate/x86_emulate.c > > The last of these, in particular, looks especially hairy (it seems to > support emulation of essentially the entire x86 instruction set through > a quite impressive edifice of switch statements). Lovely, isn't it. For Introspection, we need to be able to emulate an instruction which took a permission fault (including No Execute), was sent to the analysis engine, and deemed ok to continue. Other users of emulation are arch/x86/pv/ro-page-fault.c and arch/x86/mm/shadow/multi.c That said, most of these can be ignored in common cases. vmx/realmode.c is only for pre-Westmere Intel CPUs which lack the unrestricted_guest feature. svm/emulate.c is only for K8 hardware which lacks the NRIPS feature. > How does all of this fit into the big picture of how Xen virtualizes the > different types of VMs (PV/HVM/PVH)? Consider this "core x86 support". All areas which need to emulate an instruction for whatever reason use this function. (We previously had multiple areas of code each doing subsets of x86 instruction decode/execute, and it was an even bigger mess.) > My impression (from reading the original "Xen and the Art of > Virtualization" SOSP '03 paper that describes the basic architecture) > had been that PV guests, in particular, used hypercalls in place of all > privileged operations that the guest kernel would otherwise need to > execute in ring 0; and that all other (unprivileged) operations could > execute natively on the CPU without requiring emulation. From what > you're saying (and what I'm seeing in the source code), though, it > sounds like in reality things are a bit fuzzier - that there are some > operations that Xen traps and emulates instead of explicitly > paravirtualizing. Correct. Few theories survive contact with the real world. Some emulation, such as writeable_pagetable support was added to make it easier to port guests to being PV. In this case, writes to pagetables are trapped an emulated, as if an equivalent hypercall had been made. Sure, its slower than the hypercall, but its far easier to get started with. Some emulation is a consequence of of CPUs changing in the 16 years since that paper was published, and some emulation is a stopgap for things which really should be paravirtualised properly. A whole load of speculative security fits into this category, as we haven't had time to fix it nicely, following the panic of simply fixing it safely. > Likewise, the Xen design described in the SOSP paper discussed guest I/O > as something that's fully paravirtualized, taking place not through > emulation of either memory-mapped or port I/O but rather through ring > buffers shared between the guest and dom0 via grant tables. This is still correct and accurate. Paravirtual split front/back driver pairs for network and block are by far the most efficient way of shuffling data in and out of the VM. > I was a bit > confused to find I/O emulation code under arch/x86/pv (see e.g. > arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and > the like. Is this another example of things being fuzzier in reality > than in the "theoretical" PV design? This is "general x86 architecture". Xen handles all exceptions, including from PV userspace (possibly being naughty), so at a bare minimum needs to filter those which should be handed to the guest kernel to deal with. When it comes to x86 Port IO, it is a critical point of safety that Xen runs with IOPL set to 0, or a guest kernel could modify the real interrupt flag with a popf instruction. As a result, all `in` and `out` instructions trap with a #GP fault. Guest userspace could use use iopl() to logically gain access to IO ports, after which `in` and `out` instructions would not fault. Also, these instructions don't fault in kernel context. In both cases, Xen has to filter between actually passing the IO request to hardware (if the guest is suitably configured), or terminating it defaults, so it fails in a manner consistent with how x86 behaves. For VT-x/SVM guests, filtering of #GP faults happens before the VMExit so Xen doesn't have to handle those, but still has to handle all IO accesses which are fine (permission wise) according to the guest kernel. > What devices, if any, are emulated rather than paravirtualized for a PV guest? Look for XEN_X86_EMU_* throughout the code. Those are all the discrete devices which Xen may emulate, for both kinds of guests. There is a very restricted set of valid combinations. PV dom0's get an emulated PIT to partially forward to real hardware. ISTR it is legacy for some laptops where DRAM refresh was still configured off timer 1. I doubt it is revenant these days. > I know that for PVH, you > mentioned that the Local APIC is (at a minimum) emulated, along with > some special instructions; is that true for classic PV as well? Classic PV guests don't get a Local APIC. They are required to use the event channel interface instead. > For HVM, obviously anything that can't be virtualized natively by the > hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't > expected to be cooperative to issue PV hypercalls instead); but I would > expect emulation to be limited to the relatively small subset of the ISA > that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c > supports emulating just about everything. Under what circumstances does > Xen actually need to put all that emulation code to use? Introspection, as I said earlier, which is potentially any instruction. MMIO regions (including to the Local APIC when it is in xAPIC mode, and hardware acceleration isn't available) can be the target of any instruction with a memory operand. While mov is by far the most common instruction, other instructions such as and/or/xadd are used in some cases. Various of the vector moves (movups/movaps/movnti) are very common with framebuffers. The cfc/cf8 IO ports are used for PCI Config space accesses, which all kernels try to use, and any kernel with real devices need to use. The alternative is the the MMCFG scheme which is plain MMIO as above. > I'm also wondering just how much of this is Xen's responsibility vs. > QEMU's. I understand that when QEMU is used on its own (i.e., not with > Xen), it uses dynamic binary recompilation to handle the parts of the > ISA that can't be virtualized natively in lower-privilege modes. Does > Xen only use QEMU for emulating off-CPU devices (interrupt controller, > non-paravirtualized disk/network/graphics/etc.), or does it ever employ > any of QEMU's x86 emulation support in addition to Xen's own emulation code? We only use QEMU for off-CPU devices. For performance reasons, some of the interrupt emulation (IO-APIC in particular), and timer emulation (HPET, PIT) is done in Xen, even when it would locally be part of the motherboard if we were looking for a clear delineation of where Xen stops and QEMU starts. > Is there any particular place in the code where I can go to get a > comprehensive "list" (or other such summary) of which parts of the ISA > and off-CPU system are emulated for each respective guest type (PV, HVM, > and PVH)? XEN_X86_EMU_* should cover you here. > I realize that the difference between HVM and PVH is more of a > continuum than a line; what I'm especially interested in is, what's the > *bare minimum* of emulation required for a PVH guest that's using as > much paravirtualization as possible? (That's the setting I'm looking to > target for my research on protecting guests from a compromised > hypervisor, since I'm trying to minimize the scope of interactions > between the guest and hypervisor/dom0 that our virtual instruction set > layer needs to mediate.) If you are using PVH guests, on not-ancient hardware, and you can persuade the guest kernel to use x2APIC mode, and without using any ins/outs instructions, then you just might be able to get away without any x86_emulate() at all. x2APIC mode has an MSR-based interface rather than an MMIO interface, which means that the VMExit intercept information alone is sufficient to work out exactly what to do, and ins/outs is the only other instructions (which come to mind) liable to trap and need emulator support above and beyond the intercept information. That said, whatever you do here is going to have to cope with dom0 and all the requirements for keeping the system running. Depending on exactly how you're approaching the problem, it might be possible to declare that out of scope and leave it to one side. > On a somewhat related note, I also have a question about a particular > piece of code in arch/x86/pv/emul-priv-op.c, namely the function > io_emul_stub_setup(). It looks like it is, at runtime, crafting a > function that switches to the guest register context, emulates a > particular I/O operation, then switches back to the host register > context. This caught our attention while we were implementing Control > Flow Integrity (CFI) instrumentation for Xen (which is necessary for us > to enforce the software fault isolation (SFI) instrumentation that > provides our memory protections). Why does Xen use dynamically-generated > code here? Is it just for implementation convenience (i.e., to improve > the generalizability of the code)? This mechanism is for dom0 only, and exists because some firmware is terrible. Some AML in ACPI tables uses an IO port to generate an SMI, and has an API which uses the GPRs. It turns out things go rather wrong when Xen intercepts the IO instruction, and replays it to hardware in Xen's GPR context, rather than the guest kernels. This bodge swaps Xen's and dom0's GPRs just around the IO instruction, so the SMI API gets its parameters properly, and the results get fed back properly into AML. There is a related hypercall, SCHEDOP_pin_override, used by dom0, because sometimes the AML really does need to execute on CPU0, and not wherever dom0's vcpu0 happens to be executing. > Thanks again for all your time and effort spent answering my questions. > I know I'm throwing a lot of unusual questions out there - this > back-and-forth has been very helpful for me in figuring out *what* > questions I need to be asking in the first place to understand what's > feasible to do in the Xen architecture and how I might go about doing > it. :-) Not a problem in the slightest. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 13:51 ` Andrew Cooper @ 2019-08-22 15:06 ` Rian Quinn 2019-08-22 22:42 ` Andrew Cooper 2019-08-22 17:36 ` Tamas K Lengyel 2019-08-22 20:57 ` Rich Persaud 2 siblings, 1 reply; 13+ messages in thread From: Rian Quinn @ 2019-08-22 15:06 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan [-- Attachment #1.1: Type: text/plain, Size: 16578 bytes --] I can at least confirm that no emulation is needed to execute a Linux guest, even with the Xen PVH interface, but I don't think that works out of the box today with Xen, something we are currently working on and will hopefully have some more data near the end of the year. x2APIC helps, but it takes some work to convince Linux to use that currently. The trick is to avoid PortIO and, where possible, MMIO interfaces. Rian On Thu, Aug 22, 2019 at 1:53 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 22/08/2019 03:06, Johnson, Ethan wrote: > > On 8/17/2019 7:04 AM, Andrew Cooper wrote: > >>> Similarly, to what extent does the dom0 (or other such > >>> privileged domain) utilize "foreign memory maps" to reach into another > >>> guest's memory? I understand that this is necessary when creating a > >>> guest, for live migration, and for QEMU to emulate stuff for HVM > guests; > >>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" > >>> access a guest's memory? > >> I'm not sure what you mean by forcibly. Dom0 has the ability to do so, > >> if it chooses. There is no "force" about it. > >> > >> Debuggers and/or Introspection are other reasons why dom0 might chose to > >> map guest RAM, but I think you've covered the common cases. > >> > >>> (I ask because the research project I'm working on is seeking to > protect > >>> guests from a compromised hypervisor and dom0, so I need to limit > >>> outside access to a guest's memory to explicitly shared pages that the > >>> guest will treat as untrusted - not storing any secrets there, vetting > >>> input as necessary, etc.) > >> Sorry to come along with roadblocks, but how on earth do you intend to > >> prevent a compromised Xen from accessing guest memory? A compromised > >> Xen can do almost anything it likes, and without recourse. This is > >> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM > >> are coming along, because only the hardware itself is in a position to > >> isolate an untrusted hypervisor/kernel from guest data. > >> > >> For dom0, that's perhaps easier. You could reference count the number > >> of foreign mappings into the domain as it is created, and refuse to > >> unpause the guests vcpus until the foreign map count has dropped to 0. > > We're using a technique where privileged system software (in this case, > > the hypervisor) is compiled to a virtual instruction set (based on LLVM > > IR) that limits its access to hardware features and its view of > > available memory. These limitations are/can be enforced in a variety of > > ways but the main techniques we're employing are software fault > > isolation (i.e., memory loads and stores in privileged code are > > instrumented with checks to ensure they aren't accessing forbidden > > regions), and mediation of page table updates (by modifying privileged > > software to make page table updates through a virtual instruction set > > interface, very similarly to how Xen PV guests make page table updates > > through hypercalls which gives Xen the opportunity to ensure mappings > > aren't made to protected regions). > > > > Our technique is based on that used by the "Virtual Ghost" project (see > > https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF > > link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), > > which does something similar to protect applications from a compromised > > operating system kernel without relying on something like a hypervisor > > operating at a higher privileged level. We're looking to extend that > > approach to hypervisors to protect guest VMs from a compromised > hypervisor. > > I have come across that paper before. > > The extra language safety (which is effectively what this is) should > make it harder to compromise the hypervisor (and this is certainly a > good thing), but nothing at this level will get in the way of an > actually-compromised piece of ring 0 code from doing whatever it wants. > > Suffice it to say that I'll be delighted if someone managed to > demonstrate me wrong. > > > > >>> Again, this mostly boils down to: under what circumstances, if ever, > >>> does Xen ever "force" access to any part of a guest's memory? > >>> (Particularly for PV(H). Clearly that must happen for HVM since, by > >>> definition, the guest is unaware there's a hypervisor controlling its > >>> world and emulating hardware behavior, and thus is in no position to > >>> cooperatively/voluntarily give the hypervisor and dom0 access to its > >>> memory.) > >> There are cases for all guest types where Xen will need to emulate > >> instructions. Xen will access guest memory in order to perfom > >> architecturally correct actions, which generally starts with reading the > >> instruction under %rip. > >> > >> For PV guests, this almost entirely restricted to guest-kernel > >> operations which are privileged in nature. Access to MSRs, writes to > >> pagetables, etc. > >> > >> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't > >> be a complete absence of emulation. The Local APIC is emulated by Xen > >> in most cases, as a bare minimum, but for example, the LMSW instruction > >> on AMD hardware doesn't have any intercept decoding to help the > >> hypervisor out when a guest uses the instruction. > >> > >> ~Andrew > > I've found a number of files in the Xen source tree which seem to be > > related to instruction/x86 platform emulation: > > > > arch/x86/x86_emulate.c > > arch/x86/hvm/emulate.c > > arch/x86/hvm/vmx/realmode.c > > arch/x86/hvm/svm/emulate.c > > arch/x86/pv/emulate.c > > arch/x86/pv/emul-priv-op.c > > arch/x86/x86_emulate/x86_emulate.c > > > > The last of these, in particular, looks especially hairy (it seems to > > support emulation of essentially the entire x86 instruction set through > > a quite impressive edifice of switch statements). > > Lovely, isn't it. For Introspection, we need to be able to emulate an > instruction which took a permission fault (including No Execute), was > sent to the analysis engine, and deemed ok to continue. > > Other users of emulation are arch/x86/pv/ro-page-fault.c and > arch/x86/mm/shadow/multi.c > > That said, most of these can be ignored in common cases. vmx/realmode.c > is only for pre-Westmere Intel CPUs which lack the unrestricted_guest > feature. svm/emulate.c is only for K8 hardware which lacks the NRIPS > feature. > > > How does all of this fit into the big picture of how Xen virtualizes the > > different types of VMs (PV/HVM/PVH)? > > Consider this "core x86 support". All areas which need to emulate an > instruction for whatever reason use this function. (We previously had > multiple areas of code each doing subsets of x86 instruction > decode/execute, and it was an even bigger mess.) > > > My impression (from reading the original "Xen and the Art of > > Virtualization" SOSP '03 paper that describes the basic architecture) > > had been that PV guests, in particular, used hypercalls in place of all > > privileged operations that the guest kernel would otherwise need to > > execute in ring 0; and that all other (unprivileged) operations could > > execute natively on the CPU without requiring emulation. From what > > you're saying (and what I'm seeing in the source code), though, it > > sounds like in reality things are a bit fuzzier - that there are some > > operations that Xen traps and emulates instead of explicitly > > paravirtualizing. > > Correct. Few theories survive contact with the real world. > > Some emulation, such as writeable_pagetable support was added to make it > easier to port guests to being PV. In this case, writes to pagetables > are trapped an emulated, as if an equivalent hypercall had been made. > Sure, its slower than the hypercall, but its far easier to get started > with. > > Some emulation is a consequence of of CPUs changing in the 16 years > since that paper was published, and some emulation is a stopgap for > things which really should be paravirtualised properly. A whole load of > speculative security fits into this category, as we haven't had time to > fix it nicely, following the panic of simply fixing it safely. > > > Likewise, the Xen design described in the SOSP paper discussed guest I/O > > as something that's fully paravirtualized, taking place not through > > emulation of either memory-mapped or port I/O but rather through ring > > buffers shared between the guest and dom0 via grant tables. > > This is still correct and accurate. Paravirtual split front/back driver > pairs for network and block are by far the most efficient way of > shuffling data in and out of the VM. > > > I was a bit > > confused to find I/O emulation code under arch/x86/pv (see e.g. > > arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and > > the like. Is this another example of things being fuzzier in reality > > than in the "theoretical" PV design? > > This is "general x86 architecture". Xen handles all exceptions, > including from PV userspace (possibly being naughty), so at a bare > minimum needs to filter those which should be handed to the guest kernel > to deal with. > > When it comes to x86 Port IO, it is a critical point of safety that Xen > runs with IOPL set to 0, or a guest kernel could modify the real > interrupt flag with a popf instruction. As a result, all `in` and `out` > instructions trap with a #GP fault. > > Guest userspace could use use iopl() to logically gain access to IO > ports, after which `in` and `out` instructions would not fault. Also, > these instructions don't fault in kernel context. In both cases, Xen > has to filter between actually passing the IO request to hardware (if > the guest is suitably configured), or terminating it defaults, so it > fails in a manner consistent with how x86 behaves. > > For VT-x/SVM guests, filtering of #GP faults happens before the VMExit > so Xen doesn't have to handle those, but still has to handle all IO > accesses which are fine (permission wise) according to the guest kernel. > > > What devices, if any, are emulated rather than paravirtualized for a PV > guest? > > Look for XEN_X86_EMU_* throughout the code. Those are all the discrete > devices which Xen may emulate, for both kinds of guests. There is a > very restricted set of valid combinations. > > PV dom0's get an emulated PIT to partially forward to real hardware. > ISTR it is legacy for some laptops where DRAM refresh was still > configured off timer 1. I doubt it is revenant these days. > > > I know that for PVH, you > > mentioned that the Local APIC is (at a minimum) emulated, along with > > some special instructions; is that true for classic PV as well? > > Classic PV guests don't get a Local APIC. They are required to use the > event channel interface instead. > > > For HVM, obviously anything that can't be virtualized natively by the > > hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't > > expected to be cooperative to issue PV hypercalls instead); but I would > > expect emulation to be limited to the relatively small subset of the ISA > > that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c > > supports emulating just about everything. Under what circumstances does > > Xen actually need to put all that emulation code to use? > > Introspection, as I said earlier, which is potentially any instruction. > > MMIO regions (including to the Local APIC when it is in xAPIC mode, and > hardware acceleration isn't available) can be the target of any > instruction with a memory operand. While mov is by far the most common > instruction, other instructions such as and/or/xadd are used in some > cases. Various of the vector moves (movups/movaps/movnti) are very > common with framebuffers. > > The cfc/cf8 IO ports are used for PCI Config space accesses, which all > kernels try to use, and any kernel with real devices need to use. The > alternative is the the MMCFG scheme which is plain MMIO as above. > > > I'm also wondering just how much of this is Xen's responsibility vs. > > QEMU's. I understand that when QEMU is used on its own (i.e., not with > > Xen), it uses dynamic binary recompilation to handle the parts of the > > ISA that can't be virtualized natively in lower-privilege modes. Does > > Xen only use QEMU for emulating off-CPU devices (interrupt controller, > > non-paravirtualized disk/network/graphics/etc.), or does it ever employ > > any of QEMU's x86 emulation support in addition to Xen's own emulation > code? > > We only use QEMU for off-CPU devices. For performance reasons, some of > the interrupt emulation (IO-APIC in particular), and timer emulation > (HPET, PIT) is done in Xen, even when it would locally be part of the > motherboard if we were looking for a clear delineation of where Xen > stops and QEMU starts. > > > Is there any particular place in the code where I can go to get a > > comprehensive "list" (or other such summary) of which parts of the ISA > > and off-CPU system are emulated for each respective guest type (PV, HVM, > > and PVH)? > > XEN_X86_EMU_* should cover you here. > > > I realize that the difference between HVM and PVH is more of a > > continuum than a line; what I'm especially interested in is, what's the > > *bare minimum* of emulation required for a PVH guest that's using as > > much paravirtualization as possible? (That's the setting I'm looking to > > target for my research on protecting guests from a compromised > > hypervisor, since I'm trying to minimize the scope of interactions > > between the guest and hypervisor/dom0 that our virtual instruction set > > layer needs to mediate.) > > If you are using PVH guests, on not-ancient hardware, and you can > persuade the guest kernel to use x2APIC mode, and without using any > ins/outs instructions, then you just might be able to get away without > any x86_emulate() at all. > > x2APIC mode has an MSR-based interface rather than an MMIO interface, > which means that the VMExit intercept information alone is sufficient to > work out exactly what to do, and ins/outs is the only other instructions > (which come to mind) liable to trap and need emulator support above and > beyond the intercept information. > > That said, whatever you do here is going to have to cope with dom0 and > all the requirements for keeping the system running. Depending on > exactly how you're approaching the problem, it might be possible to > declare that out of scope and leave it to one side. > > > On a somewhat related note, I also have a question about a particular > > piece of code in arch/x86/pv/emul-priv-op.c, namely the function > > io_emul_stub_setup(). It looks like it is, at runtime, crafting a > > function that switches to the guest register context, emulates a > > particular I/O operation, then switches back to the host register > > context. This caught our attention while we were implementing Control > > Flow Integrity (CFI) instrumentation for Xen (which is necessary for us > > to enforce the software fault isolation (SFI) instrumentation that > > provides our memory protections). Why does Xen use dynamically-generated > > code here? Is it just for implementation convenience (i.e., to improve > > the generalizability of the code)? > > This mechanism is for dom0 only, and exists because some firmware is > terrible. > > Some AML in ACPI tables uses an IO port to generate an SMI, and has an > API which uses the GPRs. It turns out things go rather wrong when Xen > intercepts the IO instruction, and replays it to hardware in Xen's GPR > context, rather than the guest kernels. > > This bodge swaps Xen's and dom0's GPRs just around the IO instruction, > so the SMI API gets its parameters properly, and the results get fed > back properly into AML. > > There is a related hypercall, SCHEDOP_pin_override, used by dom0, > because sometimes the AML really does need to execute on CPU0, and not > wherever dom0's vcpu0 happens to be executing. > > > Thanks again for all your time and effort spent answering my questions. > > I know I'm throwing a lot of unusual questions out there - this > > back-and-forth has been very helpful for me in figuring out *what* > > questions I need to be asking in the first place to understand what's > > feasible to do in the Xen architecture and how I might go about doing > > it. :-) > > Not a problem in the slightest. > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel [-- Attachment #1.2: Type: text/html, Size: 19325 bytes --] [-- Attachment #2: Type: text/plain, Size: 157 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 15:06 ` Rian Quinn @ 2019-08-22 22:42 ` Andrew Cooper 0 siblings, 0 replies; 13+ messages in thread From: Andrew Cooper @ 2019-08-22 22:42 UTC (permalink / raw) To: Rian Quinn; +Cc: xen-devel, Johnson, Ethan On 22/08/2019 16:06, Rian Quinn wrote: > I can at least confirm that no emulation is needed to execute a Linux > guest, even with the Xen PVH interface, but I don't think that works > out of the box today with Xen, something we are currently working on > and will hopefully have some more data near the end of the year. > x2APIC helps, but it takes some work to convince Linux to use that > currently. The trick is to avoid PortIO and, where possible, MMIO > interfaces. There is a bit in the CPUID leaves which allegedly should cause Linux to try and use x2apic, but I can easily see this having bitrotted. It is certainly something we need to fix. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 13:51 ` Andrew Cooper 2019-08-22 15:06 ` Rian Quinn @ 2019-08-22 17:36 ` Tamas K Lengyel 2019-08-22 22:49 ` Andrew Cooper 2019-08-22 20:57 ` Rich Persaud 2 siblings, 1 reply; 13+ messages in thread From: Tamas K Lengyel @ 2019-08-22 17:36 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan > > I've found a number of files in the Xen source tree which seem to be > > related to instruction/x86 platform emulation: > > > > arch/x86/x86_emulate.c > > arch/x86/hvm/emulate.c > > arch/x86/hvm/vmx/realmode.c > > arch/x86/hvm/svm/emulate.c > > arch/x86/pv/emulate.c > > arch/x86/pv/emul-priv-op.c > > arch/x86/x86_emulate/x86_emulate.c > > > > The last of these, in particular, looks especially hairy (it seems to > > support emulation of essentially the entire x86 instruction set through > > a quite impressive edifice of switch statements). > > Lovely, isn't it. For Introspection, we need to be able to emulate an > instruction which took a permission fault (including No Execute), was > sent to the analysis engine, and deemed ok to continue. That's not a requirement for introspection and I find that kind of use of the emulation very hairy, especially for anything security related. IMHO it's nothing more then a convenient hack. Tamas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 17:36 ` Tamas K Lengyel @ 2019-08-22 22:49 ` Andrew Cooper 0 siblings, 0 replies; 13+ messages in thread From: Andrew Cooper @ 2019-08-22 22:49 UTC (permalink / raw) To: Tamas K Lengyel; +Cc: xen-devel, Johnson, Ethan On 22/08/2019 18:36, Tamas K Lengyel wrote: >>> I've found a number of files in the Xen source tree which seem to be >>> related to instruction/x86 platform emulation: >>> >>> arch/x86/x86_emulate.c >>> arch/x86/hvm/emulate.c >>> arch/x86/hvm/vmx/realmode.c >>> arch/x86/hvm/svm/emulate.c >>> arch/x86/pv/emulate.c >>> arch/x86/pv/emul-priv-op.c >>> arch/x86/x86_emulate/x86_emulate.c >>> >>> The last of these, in particular, looks especially hairy (it seems to >>> support emulation of essentially the entire x86 instruction set through >>> a quite impressive edifice of switch statements). >> Lovely, isn't it. For Introspection, we need to be able to emulate an >> instruction which took a permission fault (including No Execute), was >> sent to the analysis engine, and deemed ok to continue. > That's not a requirement for introspection and I find that kind of use > of the emulation very hairy, especially for anything security related. > IMHO it's nothing more then a convenient hack. Ok fine. I was specialising to the form of introspection that I deal with regularly. Nothing in the Xen introspection APIs forces you to take extra emulation. However, when you're doing a proper product based on it, customers care about it not being unusable slow. In our case, that relies on not falling back to completing instructions using the "pause all other vcpus, unrestricted permissions, singlestep the vcpu, restrict permissions again" approach. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 13:51 ` Andrew Cooper 2019-08-22 15:06 ` Rian Quinn 2019-08-22 17:36 ` Tamas K Lengyel @ 2019-08-22 20:57 ` Rich Persaud 2019-08-22 22:39 ` Andrew Cooper 2 siblings, 1 reply; 13+ messages in thread From: Rich Persaud @ 2019-08-22 20:57 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel, Johnson, Ethan > On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >> On 22/08/2019 03:06, Johnson, Ethan wrote: >> >> For HVM, obviously anything that can't be virtualized natively by the >> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't >> expected to be cooperative to issue PV hypercalls instead); but I would >> expect emulation to be limited to the relatively small subset of the ISA >> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c >> supports emulating just about everything. Under what circumstances does >> Xen actually need to put all that emulation code to use? > > Introspection, as I said earlier, which is potentially any instruction. Could introspection-specific emulation code be disabled via KConfig? Rich _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 20:57 ` Rich Persaud @ 2019-08-22 22:39 ` Andrew Cooper 2019-08-22 23:06 ` Tamas K Lengyel 0 siblings, 1 reply; 13+ messages in thread From: Andrew Cooper @ 2019-08-22 22:39 UTC (permalink / raw) To: Rich Persaud; +Cc: xen-devel, Johnson, Ethan On 22/08/2019 21:57, Rich Persaud wrote: >> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >>> On 22/08/2019 03:06, Johnson, Ethan wrote: >>> >>> For HVM, obviously anything that can't be virtualized natively by the >>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't >>> expected to be cooperative to issue PV hypercalls instead); but I would >>> expect emulation to be limited to the relatively small subset of the ISA >>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c >>> supports emulating just about everything. Under what circumstances does >>> Xen actually need to put all that emulation code to use? >> Introspection, as I said earlier, which is potentially any instruction. > Could introspection-specific emulation code be disabled via KConfig? Not really. At the point something has trapped for emulation, we must complete it in a manner consistent with the x86 architecture, or the guest will crash. If you don't want emulation from introspection, don't start introspecting in the first place, at which point guest actions won't trap in the first place. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 22:39 ` Andrew Cooper @ 2019-08-22 23:06 ` Tamas K Lengyel 2019-08-23 0:03 ` Andrew Cooper 0 siblings, 1 reply; 13+ messages in thread From: Tamas K Lengyel @ 2019-08-22 23:06 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel, Rich Persaud, Johnson, Ethan On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 22/08/2019 21:57, Rich Persaud wrote: > >> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >> > >>> On 22/08/2019 03:06, Johnson, Ethan wrote: > >>> > >>> For HVM, obviously anything that can't be virtualized natively by the > >>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't > >>> expected to be cooperative to issue PV hypercalls instead); but I would > >>> expect emulation to be limited to the relatively small subset of the ISA > >>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c > >>> supports emulating just about everything. Under what circumstances does > >>> Xen actually need to put all that emulation code to use? > >> Introspection, as I said earlier, which is potentially any instruction. > > Could introspection-specific emulation code be disabled via KConfig? > > Not really. > > At the point something has trapped for emulation, we must complete it in > a manner consistent with the x86 architecture, or the guest will crash. > > If you don't want emulation from introspection, don't start > introspecting in the first place, at which point guest actions won't > trap in the first place. That's incorrect, you can absolutely do introspection with vm_events and NOT emulate anything. You can have altp2m in place with different memory permissions set in different views and switch between the views with MTF enabled to allow the system to continue executing. This does not require emulation of anything. I would be behind a KCONFIG option that turns off parts of the emulator that are only used by a subset of introspection usecases. But this should not be an option that turns off introspection itself, the two things are NOT inter-dependent. Tamas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-22 23:06 ` Tamas K Lengyel @ 2019-08-23 0:03 ` Andrew Cooper 2019-08-23 1:12 ` Tamas K Lengyel 0 siblings, 1 reply; 13+ messages in thread From: Andrew Cooper @ 2019-08-23 0:03 UTC (permalink / raw) To: Tamas K Lengyel; +Cc: xen-devel, Rich Persaud, Johnson, Ethan On 23/08/2019 00:06, Tamas K Lengyel wrote: > On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 22/08/2019 21:57, Rich Persaud wrote: >>>> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> >>>>> On 22/08/2019 03:06, Johnson, Ethan wrote: >>>>> >>>>> For HVM, obviously anything that can't be virtualized natively by the >>>>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't >>>>> expected to be cooperative to issue PV hypercalls instead); but I would >>>>> expect emulation to be limited to the relatively small subset of the ISA >>>>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c >>>>> supports emulating just about everything. Under what circumstances does >>>>> Xen actually need to put all that emulation code to use? >>>> Introspection, as I said earlier, which is potentially any instruction. >>> Could introspection-specific emulation code be disabled via KConfig? >> Not really. >> >> At the point something has trapped for emulation, we must complete it in >> a manner consistent with the x86 architecture, or the guest will crash. >> >> If you don't want emulation from introspection, don't start >> introspecting in the first place, at which point guest actions won't >> trap in the first place. > That's incorrect, you can absolutely do introspection with vm_events > and NOT emulate anything. You can have altp2m in place with different > memory permissions set in different views and switch between the views > with MTF enabled to allow the system to continue executing. This does > not require emulation of anything. I would be behind a KCONFIG option > that turns off parts of the emulator that are only used by a subset of > introspection usecases. But this should not be an option that turns > off introspection itself, the two things are NOT inter-dependent. I fear we are getting slightly off track here, but I'll bite... Introspection is a young technology, with vast potential. This is great - it means there is a lot of novel R&D going into it. It doesn't mean that all aspects of it are viable for use by customers today. I'll have an easier time believing that altp2m is close to being production ready when I no longer fine security-relevant bugs in it every time I go looking, and someone has made a coherent attempt to justify it being security supported. None of this alters the fact that introspection in general is one key factor as to why we have a mostly-complete x86_emulate() (even if "x86 emulate" is a slightly poor choice of name. "decode and replay" would be a far more apt description of what it does for the majority of instructions.) ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory 2019-08-23 0:03 ` Andrew Cooper @ 2019-08-23 1:12 ` Tamas K Lengyel 0 siblings, 0 replies; 13+ messages in thread From: Tamas K Lengyel @ 2019-08-23 1:12 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel, Rich Persaud, Johnson, Ethan On Thu, Aug 22, 2019 at 6:03 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 23/08/2019 00:06, Tamas K Lengyel wrote: > > On Thu, Aug 22, 2019 at 4:40 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >> On 22/08/2019 21:57, Rich Persaud wrote: > >>>> On Aug 22, 2019, at 09:51, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >>>> > >>>>> On 22/08/2019 03:06, Johnson, Ethan wrote: > >>>>> > >>>>> For HVM, obviously anything that can't be virtualized natively by the > >>>>> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't > >>>>> expected to be cooperative to issue PV hypercalls instead); but I would > >>>>> expect emulation to be limited to the relatively small subset of the ISA > >>>>> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c > >>>>> supports emulating just about everything. Under what circumstances does > >>>>> Xen actually need to put all that emulation code to use? > >>>> Introspection, as I said earlier, which is potentially any instruction. > >>> Could introspection-specific emulation code be disabled via KConfig? > >> Not really. > >> > >> At the point something has trapped for emulation, we must complete it in > >> a manner consistent with the x86 architecture, or the guest will crash. > >> > >> If you don't want emulation from introspection, don't start > >> introspecting in the first place, at which point guest actions won't > >> trap in the first place. > > That's incorrect, you can absolutely do introspection with vm_events > > and NOT emulate anything. You can have altp2m in place with different > > memory permissions set in different views and switch between the views > > with MTF enabled to allow the system to continue executing. This does > > not require emulation of anything. I would be behind a KCONFIG option > > that turns off parts of the emulator that are only used by a subset of > > introspection usecases. But this should not be an option that turns > > off introspection itself, the two things are NOT inter-dependent. > > I fear we are getting slightly off track here, but I'll bite... > > Introspection is a young technology, with vast potential. This is great > - it means there is a lot of novel R&D going into it. It doesn't mean > that all aspects of it are viable for use by customers today. > > I'll have an easier time believing that altp2m is close to being > production ready when I no longer fine security-relevant bugs in it > every time I go looking, and someone has made a coherent attempt to > justify it being security supported. I didn't say altp2m is security supported or that it's "production ready", only that it's a viable alternative to using the emulator. With the external-only mode I added I don't see any additional attack surface as compared to regular use of EPT, but of course I would be very interested in the security bugs you seem to be finding left and right. In my experience it's the emulator that's buggy (or simply incomplete). > > None of this alters the fact that introspection in general is one key > factor as to why we have a mostly-complete x86_emulate() (even if "x86 > emulate" is a slightly poor choice of name. "decode and replay" would > be a far more apt description of what it does for the majority of > instructions.) Which is fine, but if people find the presence of a full x86 emulator troubling and want to disable as much of it as possible, saying that it's needed for introspection is incorrect. It is not needed for introspection. So I'm not OK with using that justification for keeping it. Nor would I like to see an option that says that if you are doing introspection you _must_ have that full emulator in place. You simply don't. Tamas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-08-23 1:13 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-16 19:51 [Xen-devel] More questions about Xen memory layout/usage, access to guest memory Johnson, Ethan 2019-08-17 11:04 ` Andrew Cooper 2019-08-22 2:06 ` Johnson, Ethan 2019-08-22 13:51 ` Andrew Cooper 2019-08-22 15:06 ` Rian Quinn 2019-08-22 22:42 ` Andrew Cooper 2019-08-22 17:36 ` Tamas K Lengyel 2019-08-22 22:49 ` Andrew Cooper 2019-08-22 20:57 ` Rich Persaud 2019-08-22 22:39 ` Andrew Cooper 2019-08-22 23:06 ` Tamas K Lengyel 2019-08-23 0:03 ` Andrew Cooper 2019-08-23 1:12 ` Tamas K Lengyel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).