Re: How does shadow page table work during migration?

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Kevin Negy <kevinnegy@gmail.com>
Cc: <xen-devel@lists.xenproject.org>
Subject: Re: How does shadow page table work during migration?
Date: Tue, 23 Feb 2021 00:23:18 +0000	[thread overview]
Message-ID: <6a91e41c-b7f9-f856-bc55-fd92b8188adc@citrix.com> (raw)
In-Reply-To: <CACZWC-r7fS2AztaAgGdVPv5NcJiAxZ5mvC4FQTkorPDGwOfn9g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8376 bytes --]

On 22/02/2021 18:52, Kevin Negy wrote:
> Hello again,
>
> Thank you for the helpful responses. I have several follow up questions.
>
> 1)
>
>     With Shadow, Xen has to do the combination of address spaces itself -
>     the shadow pagetables map guest virtual to host physical address.
>
>
>     The shadow_blow_tables() call is "please recycle everything" which
>     is used
>     to throw away all shadow pagetables, which in turn will cause the
>     shadows to be recreated from scratch as the guest continues to run. 
>
>
> With shadowing enabled, given a guest virtual address, how does the
> hypervisor recreate the mapping to the host physical address (mfn)
> from the virtual address if the shadow page tables are empty (after a
> call to shadow_blow_tables, for instance)? I had been thinking of
> shadow page tables as the definitive mapping between guest pages and
> machine pages, but should I think of them as more of a TLB, which
> implies there's another way to get/recreate the mappings if there's no
> entry in the shadow table?

Your observation about "being like a TLB" is correct.

Lets take the most simple case, of 4-on-4 shadows.  I.e. Xen and the
guest are both in 64bit mode, and using 4-level paging.

Each domain also has a structure which Xen calls a P2M, for the guest
physical => host physical mappings.  (For PV guests, its actually
identity transform, and for HVM, it is a set of EPT or (N)PT pagetables,
but the exact structure isn't important here.)

The other primitive required is an emulated pagewalk.  I.e. we start at
the guest's %cr3 value, and walk though the guests pagetables as
hardware would.  Each step involves a lookup in the P2M, as the guest
PTEs are programmed with guest physical addresses, not host physical.

In reality, we always have a "top level shadow" per vcpu.  In this
example, it is a level-4 pagetable, which starts out clear (i.e. no
guest entries present).  We need *something* to point hardware at when
we start running the guest.

Once we run the guest, we immediately take a pagefault.  We look at %cr2
to find the linear address accessed, and perform a pagewalk.  In the
common case, we find that the linear address is valid in the guest, so
we allocate a level 3 pagetable, again clear, then point the appropriate
L4e at it, then re-enter the guest.

This takes an immediate pagefault again, and we allocate an L2
pagetable, re-enter then allocate an L1 pagetable, and finally point an
L1e at the host physical page.  Now, we can successfully fetch the
instruction (if it doesn't cross a page boundary), then repeat the
process for every subsequent memory access.

This example is simplified specifically to demonstrate the point. 
Everything is driven from pagefaults.

There is of course far more complexity.  We typically populate all the
way down to an L1e in one go, because this is far more efficient than
taking 4 real pagefaults.  If we walk the guest pagetables and find a
violation, we have to hand #PF back to the guest kernel rather than
change the shadows.  To emulate dirty bits correctly, we need to leave
the shadow read-only even if the guest PTE was read/write so we can spot
when hardware tries to set the D bit in the shadows, and copy it back
into guest's view.  Superpages are complicated to deal with (we have to
splinter to 4k pages), and 2-on-3 (legacy 32bit OS with non-PAE paging)
a total nightmare because of the different format of pagetable entries.

Also notice that a guest TLB flush is also implemented as "drop all
shadows under this virtual cr3".

> 2) I'm trying to grasp the general steps of enabling shadowing and
> handling page faults. Is this correct?
>     a) Normal PV - default shadowing is disabled, guest has its page
> tables in r/w mode or whatever mode is considered normal for guest
> page tables

It would be a massive security vulnerability to let PV guests write to
their own pagetables.

PV guest pagetables are read-only, and all updates are made via
hypercall, so they can be audited for safety.  (We do actually have
pagetable emulation for PV guests for those which do write to their own
pagetables, and feeds into the same logic as the hypercall, but is less
efficient overall.)

>     b) Shadowing is enabled - shadow memory pool allocated, all memory
> accesses must now go through shadow pages in CR3. Since no entries are
> in shadow tables, initial read and writes from the guest will result
> in page faults.

PV guest share an address space with Xen.  So actually the top level
shadow for a PV guest is pre-populated with Xen's mappings, but all
guest entries are faulted in on demand.

>     c) As soon as the first guest memory access occurs, a mandatory
> page fault occurs because there is no mapping in the shadows. Xen does
> a guest page table walk for the address that caused the fault (va) and
> then marks all the guest page table pages along the walk as read only.

The first guest memory access is actually the instruction fetch at
%cs:%rip.  Once that address is shadowed, you further have to shadow any
memory operands (which can be more than one, e.g. `PUSH ptr` has a
regular memory operand, and an implicit stack operand which needs
shadowing.  With the AVX scatter/gather instructions, you can have an
almost-arbitrary number of memory operands.)

Also, be very careful with terminology.  Linear and virtual addresses
are different (by the segment selector base, which is commonly but not
always 0).  Lots of Xen code uses va/vaddr when it means linear addresses.

>     d) Xen finds out the mfn of the guest va somehow (my first
> question) and adds the mapping of the va to the shadow page table.

Yes.  This is a combination of the pagewalk and P2M to identify the mfn
in question for the linear address, along with suitable
allocations/modifications to the shadow pagetables.

>     e) If the page fault was a write, the va is now marked as
> read/write but logged as dirty in the logdirty map.

Actually, what we do when the VM is in global logdirty mode is always
start by writing all shadow L1e's as read-only, even if the guest has
them read-write.  This causes all writes to trap with #PF, which lets us
see which frame is being written to, and lets us set the appropriate bit
in the logdirty bitmap.

>     e) Now the next page fault to any of the page tables marked
> read-only in c) must have been caused by the guest writing to its
> tables, which can be reflected in the shadow page tables.

Writeability of the guest's actual pagetables is complicated and
guest-dependent.  Under a strict TLB-like model, its not actually
required to restrict writeability.

In real hardware, the TLB is an explicitly non-coherent cache, and
software is required to issue a TLB flush to ensure that changes to the
PTEs in memory get propagated subsequently into the TLB.

> 3) How do Xen/shadow page tables distinguish between two equivalent
> guest virtual addresses from different guest processes? I suppose when
> a guest OS tries to change page tables from one process to another,
> this will cause a page fault that Xen will trap and be able to infer
> that the current shadow page table should be swapped to a different
> one corresponding to the new guest process?

Changing processes involves writing to %cr3, which is a TLB flush, so in
a strict TLB-like model, all shadows must be dropped.

In reality, this is where we start using restricted writeability to our
advantage.  If we know that no writes to pagetables happened, we know
"the TLB" (== the currently established shadows) aren't actually stale,
so may be retained and reused.

We do maintain hash lists of types of pagetable, so we can locate
preexisting shadows of a specific type.  This is how we can switch
between already-established shadows when the guest changes %cr3.

In reality, the kernel half of virtual address space doesn't change much
after after boot, so there is a substantial performance win from not
dropping and reshadowing these entries.  There are loads and loads of L4
pagetables (one per process), all pointing to common L3's which form the
kernel half of the address space.

If I'm being honestly - this is where my knowledge of exactly what Xen
does breaks down - I'm not the author of the shadow code - I've merely
debugged it a few times.

I hope this is still informative.

~Andrew

[-- Attachment #2: Type: text/html, Size: 11893 bytes --]