On 22/02/2021 18:52, Kevin Negy wrote: > Hello again, > > Thank you for the helpful responses. I have several follow up questions. > > 1) > > With Shadow, Xen has to do the combination of address spaces itself - > the shadow pagetables map guest virtual to host physical address. > > > The shadow_blow_tables() call is "please recycle everything" which > is used > to throw away all shadow pagetables, which in turn will cause the > shadows to be recreated from scratch as the guest continues to run.  > > > With shadowing enabled, given a guest virtual address, how does the > hypervisor recreate the mapping to the host physical address (mfn) > from the virtual address if the shadow page tables are empty (after a > call to shadow_blow_tables, for instance)? I had been thinking of > shadow page tables as the definitive mapping between guest pages and > machine pages, but should I think of them as more of a TLB, which > implies there's another way to get/recreate the mappings if there's no > entry in the shadow table? Your observation about "being like a TLB" is correct. Lets take the most simple case, of 4-on-4 shadows.  I.e. Xen and the guest are both in 64bit mode, and using 4-level paging. Each domain also has a structure which Xen calls a P2M, for the guest physical => host physical mappings.  (For PV guests, its actually identity transform, and for HVM, it is a set of EPT or (N)PT pagetables, but the exact structure isn't important here.) The other primitive required is an emulated pagewalk.  I.e. we start at the guest's %cr3 value, and walk though the guests pagetables as hardware would.  Each step involves a lookup in the P2M, as the guest PTEs are programmed with guest physical addresses, not host physical. In reality, we always have a "top level shadow" per vcpu.  In this example, it is a level-4 pagetable, which starts out clear (i.e. no guest entries present).  We need *something* to point hardware at when we start running the guest. Once we run the guest, we immediately take a pagefault.  We look at %cr2 to find the linear address accessed, and perform a pagewalk.  In the common case, we find that the linear address is valid in the guest, so we allocate a level 3 pagetable, again clear, then point the appropriate L4e at it, then re-enter the guest. This takes an immediate pagefault again, and we allocate an L2 pagetable, re-enter then allocate an L1 pagetable, and finally point an L1e at the host physical page.  Now, we can successfully fetch the instruction (if it doesn't cross a page boundary), then repeat the process for every subsequent memory access. This example is simplified specifically to demonstrate the point.  Everything is driven from pagefaults. There is of course far more complexity.  We typically populate all the way down to an L1e in one go, because this is far more efficient than taking 4 real pagefaults.  If we walk the guest pagetables and find a violation, we have to hand #PF back to the guest kernel rather than change the shadows.  To emulate dirty bits correctly, we need to leave the shadow read-only even if the guest PTE was read/write so we can spot when hardware tries to set the D bit in the shadows, and copy it back into guest's view.  Superpages are complicated to deal with (we have to splinter to 4k pages), and 2-on-3 (legacy 32bit OS with non-PAE paging) a total nightmare because of the different format of pagetable entries. Also notice that a guest TLB flush is also implemented as "drop all shadows under this virtual cr3". > 2) I'm trying to grasp the general steps of enabling shadowing and > handling page faults. Is this correct? >     a) Normal PV - default shadowing is disabled, guest has its page > tables in r/w mode or whatever mode is considered normal for guest > page tables It would be a massive security vulnerability to let PV guests write to their own pagetables. PV guest pagetables are read-only, and all updates are made via hypercall, so they can be audited for safety.  (We do actually have pagetable emulation for PV guests for those which do write to their own pagetables, and feeds into the same logic as the hypercall, but is less efficient overall.) >     b) Shadowing is enabled - shadow memory pool allocated, all memory > accesses must now go through shadow pages in CR3. Since no entries are > in shadow tables, initial read and writes from the guest will result > in page faults. PV guest share an address space with Xen.  So actually the top level shadow for a PV guest is pre-populated with Xen's mappings, but all guest entries are faulted in on demand. >     c) As soon as the first guest memory access occurs, a mandatory > page fault occurs because there is no mapping in the shadows. Xen does > a guest page table walk for the address that caused the fault (va) and > then marks all the guest page table pages along the walk as read only. The first guest memory access is actually the instruction fetch at %cs:%rip.  Once that address is shadowed, you further have to shadow any memory operands (which can be more than one, e.g. `PUSH ptr` has a regular memory operand, and an implicit stack operand which needs shadowing.  With the AVX scatter/gather instructions, you can have an almost-arbitrary number of memory operands.) Also, be very careful with terminology.  Linear and virtual addresses are different (by the segment selector base, which is commonly but not always 0).  Lots of Xen code uses va/vaddr when it means linear addresses. >     d) Xen finds out the mfn of the guest va somehow (my first > question) and adds the mapping of the va to the shadow page table. Yes.  This is a combination of the pagewalk and P2M to identify the mfn in question for the linear address, along with suitable allocations/modifications to the shadow pagetables. >     e) If the page fault was a write, the va is now marked as > read/write but logged as dirty in the logdirty map. Actually, what we do when the VM is in global logdirty mode is always start by writing all shadow L1e's as read-only, even if the guest has them read-write.  This causes all writes to trap with #PF, which lets us see which frame is being written to, and lets us set the appropriate bit in the logdirty bitmap. >     e) Now the next page fault to any of the page tables marked > read-only in c) must have been caused by the guest writing to its > tables, which can be reflected in the shadow page tables. Writeability of the guest's actual pagetables is complicated and guest-dependent.  Under a strict TLB-like model, its not actually required to restrict writeability. In real hardware, the TLB is an explicitly non-coherent cache, and software is required to issue a TLB flush to ensure that changes to the PTEs in memory get propagated subsequently into the TLB. > 3) How do Xen/shadow page tables distinguish between two equivalent > guest virtual addresses from different guest processes? I suppose when > a guest OS tries to change page tables from one process to another, > this will cause a page fault that Xen will trap and be able to infer > that the current shadow page table should be swapped to a different > one corresponding to the new guest process? Changing processes involves writing to %cr3, which is a TLB flush, so in a strict TLB-like model, all shadows must be dropped. In reality, this is where we start using restricted writeability to our advantage.  If we know that no writes to pagetables happened, we know "the TLB" (== the currently established shadows) aren't actually stale, so may be retained and reused. We do maintain hash lists of types of pagetable, so we can locate preexisting shadows of a specific type.  This is how we can switch between already-established shadows when the guest changes %cr3. In reality, the kernel half of virtual address space doesn't change much after after boot, so there is a substantial performance win from not dropping and reshadowing these entries.  There are loads and loads of L4 pagetables (one per process), all pointing to common L3's which form the kernel half of the address space. If I'm being honestly - this is where my knowledge of exactly what Xen does breaks down - I'm not the author of the shadow code - I've merely debugged it a few times. I hope this is still informative. ~Andrew