Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: "Johnson, Ethan" <ejohns48@cs.rochester.edu>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory
Date: Sat, 17 Aug 2019 12:04:04 +0100	[thread overview]
Message-ID: <79c7b71f-0b61-2799-4a79-644536a9c891@citrix.com> (raw)
In-Reply-To: <15a4c482-1207-1d8a-fd2a-dc4f25956c27@cs.rochester.edu>

On 16/08/2019 20:51, Johnson, Ethan wrote:
> Hi all,
>
> I have some follow-up questions about Xen's usage and layout of memory, 
> building on the ones I asked here a few weeks ago (which were quite 
> helpfully answered: see 
> https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html 
> for reference). For context on why I'm asking these questions, I'm using 
> Xen as a research platform for enforcing novel memory protection schemes 
> on hypervisors and guests.
>
> 1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 
> 0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode 
> or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 
> root pointer) must be used for each guest.

More than that.  Each vCPU.

PV guests manage their own pagetables, and have a vCR3 which the guest
kernel controls, and we must honour.

For 64bit PV guests, each time a new L4 pagetable is created, Xen sets
up its own 16 slots appropriately.  As a result, Xen itself is able to
function appropriately on all pagetable hierarchies the PV guest
creates.  See init_xen_l4_slots() which does this.

For 32bit PV guests, things are a tad more complicated.  Each vCR3 is
actually a PAE-quad of pagetable entries.  Because Xen is still
operating in 64bit mode with 4-level paging, we enforce that guests
allocate a full 4k page for the pagetable (rather than the 32 bytes it
would normally be).

In Xen, we allocate what is called a monitor table, which is per-vcpu
(set up with all the correct details for Xen), and we rewrite slot 0
each time the vCPU changes vCR3.

Not related to this question, but important for future answers.  All
pagetables are actually at a minimum per-domain, because we have
per-domain mappings to simplify certain tasks.  Contained within these
are various structures, including the hypercall compatibility
translation area.  This per-domain restriction can in principle be
lifted if we alter the way Xen chooses to lay out its memory.

> Is that also true of the host 
> (non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH 
> guest is running?

Historical context is important to answer this question.

When the first HVM support came along, there was no EPT or NPT in
hardware.  Hypervisors were required to virtualise the guests pagetable
structure, which is called Shadow Paging in Xen.  The shadow pagetables
themselves are organised per-domain so as to form a single coherent
guest physical address space, but CPUs operating in non-root mode still
needed the real CR3 pointing at the logical vCPU's CR3 which was being
virtualised.

In practice, we still allocate a monitor pagetable per vcpu for HVM
guests, even with HAP support.  I can't think of any restrictions which
would prevent us from doing this differently.

> Or is the dom0 page table left in place, assuming the 
> dom0 is PV, when an HVM/PVH guest is running, since extended paging is 
> now being used to provide the guest's view of memory? Does this change 
> if the dom0 is PVH?

Here is some (prototype) documentation prepared since your last round of
questions.

https://andrewcoop-xen.readthedocs.io/en/docs-devel/admin-guide/introduction.html

Dom0 is just a VM, like every other domU in the system.  There is
nothing special about how it is virtualised.

Dom0 defaults to having full permissions, so can successfully issue a
whole range of more interesting hypercalls, but you could easily create
dom1, set the is_priv boolean in Xen, and give dom1 all the same
permissions that dom0 has, if you wished.

> Or, to ask this from another angle: is there ever anything *but* Xen 
> living in the host-virtual address space when an HVM/PVH guest is 
> active?

No, depending on how you classify Xen's directmap in this context.

> And is the answer to this different depending on whether the 
> HVM/PVH guest is a domU vs. a PVH dom0?

Dom0 vs domU has no relevance to the question.

> 2. Do the mappings in Xen's slice of the host-virtual address space 
> differ at all between the host page tables corresponding to different 
> guests?

No (ish).

Xen has a mostly flat address space, so most of the mappings are the
same.  There is a per-domain mapping slot which is common to each vcpu
in a domain, but different across domains, and a self-linear map for
easy modification of the PTEs for the current pagetable hierarchy, and a
shadow-linear map for easy modification of the shadow PTEs for which Xen
is not in the address space at all.

> If the mappings are in fact the same, does Xen therefore share 
> lower-level page table pages between the page tables corresponding to 
> different guests?

We have many different L4's (the monitor tables, every L4 a PV guest has
allocated) which can run Xen.  Most parts of Xen's address space
converge at L3 (the M2P, the directmap, Xen
text/data/bss/fixmap/vmap/heaps/misc), and are common to all contexts.

The per-domain mapping converges at L3 and are shared between vcpus of
the same guest, but not shared across guests.

One aspect I haven't really covered is XPTI for Meltdown mitigation for
PV guests.  Here, we have a per-CPU private pagetable which ends up
being a merge of most of the guests L4, but with some pre-construct
CPU-private pagetable hierarchy to hide the majority of data in the Xen
region.

> Is any of this different for PV vs. HVM/PVH?

PV guests control their parts of their address space, and can do largely
whatever they choose.  HVM has nothing in the lower canonical half, but
do have an extended directmap (which in practice only makes a difference
on a >5TB machine).

> 3. Under what circumstances, and for what purposes, does Xen use its 
> ability to access guest memory through its direct map of host-physical 
> memory?

That is a very broad question, and currently has the unfortunate answer
of "whenever speculation goes awry in an attackers favour."  There are
steps under way to reduce the usage of the directmap so we can run
without it, and prevent this kind of leakage.

As for when Xen would normally access memory, the most common answer is
for hypercall parameters which mostly use a virtual address based ABI. 
Also, any time we need to emulate an instruction, we need to read a fair
amount of guest state, including reading the instruction under %rip.

> Similarly, to what extent does the dom0 (or other such 
> privileged domain) utilize "foreign memory maps" to reach into another 
> guest's memory? I understand that this is necessary when creating a 
> guest, for live migration, and for QEMU to emulate stuff for HVM guests; 
> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" 
> access a guest's memory?

I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
if it chooses.  There is no "force" about it.

Debuggers and/or Introspection are other reasons why dom0 might chose to
map guest RAM, but I think you've covered the common cases.

> (I ask because the research project I'm working on is seeking to protect 
> guests from a compromised hypervisor and dom0, so I need to limit 
> outside access to a guest's memory to explicitly shared pages that the 
> guest will treat as untrusted - not storing any secrets there, vetting 
> input as necessary, etc.)

Sorry to come along with roadblocks, but how on earth do you intend to
prevent a compromised Xen from accessing guest memory?  A compromised
Xen can do almost anything it likes, and without recourse.  This is
ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
are coming along, because only the hardware itself is in a position to
isolate an untrusted hypervisor/kernel from guest data.

For dom0, that's perhaps easier.  You could reference count the number
of foreign mappings into the domain as it is created, and refuse to
unpause the guests vcpus until the foreign map count has dropped to 0.

> 4. What facilities/processes does Xen provide for PV(H) guests to 
> explicitly/voluntarily share memory pages with Xen and other domains 
> (dom0, etc.)? From what I can gather from the documentation, it sounds 
> like "grant tables" are involved in this - is that how a PV-aware guest 
> is expected to set up shared memory regions for communication with other 
> domains (ring buffers, etc.)?

Yes.  Grant Tables is Xen's mechanism for the coordinated setup of
shared memory between two consenting domains.

> Does a PV(H) guest need to voluntarily 
> establish all external access to its pages, or is there ever a situation 
> where it's the other way around - where Xen itself establishes/defines a 
> region as shared and the guest is responsible for treating it accordingly?

During domain construction, two grants/events are constructed
automatically.  One is for the xenstore ring, and one is for the console
ring.  The latter is so it can get debugging out from very early code,
while both are, in practice, done like this because the guest has no
a-priori way to establish the grants/events itself.

For all other shared interfaces, the guests are expected to negotiate
which grants/events/rings/details to use via Xenstore.

> Again, this mostly boils down to: under what circumstances, if ever, 
> does Xen ever "force" access to any part of a guest's memory? 
> (Particularly for PV(H). Clearly that must happen for HVM since, by 
> definition, the guest is unaware there's a hypervisor controlling its 
> world and emulating hardware behavior, and thus is in no position to 
> cooperatively/voluntarily give the hypervisor and dom0 access to its 
> memory.)

There are cases for all guest types where Xen will need to emulate
instructions.  Xen will access guest memory in order to perfom
architecturally correct actions, which generally starts with reading the
instruction under %rip.

For PV guests, this almost entirely restricted to guest-kernel
operations which are privileged in nature.  Access to MSRs, writes to
pagetables, etc.

For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
be a complete absence of emulation.  The Local APIC is emulated by Xen
in most cases, as a bare minimum, but for example, the LMSW instruction
on AMD hardware doesn't have any intercept decoding to help the
hypervisor out when a guest uses the instruction.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel