All of lore.kernel.org
 help / color / mirror / Atom feed
* How does shadow page table work during migration?
@ 2021-02-19 16:10 Kevin Negy
  2021-02-19 16:36 ` Jan Beulich
  2021-02-19 20:17 ` Andrew Cooper
  0 siblings, 2 replies; 5+ messages in thread
From: Kevin Negy @ 2021-02-19 16:10 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

Hello,

I'm trying to understand how the shadow page table works in Xen,
specifically during live migration. My understanding is that after shadow
paging is enabled (sh_enable_log_dirty() in
xen/arch/x86/mm/shadow/common.c), a shadow page table is created, which is
a complete copy of the current guest page table. Then the CR3 register is
switched to use this shadow page table as the active table while the guest
page table is stored elsewhere. The guest page table itself (and not the
individual entries in the page table) is marked as read only so that any
guest memory access that requires the page table will result in a page
fault. These page faults happen and are trapped to the Xen hypervisor. Xen
will then update the shadow page table to match what the guest sees on its
page tables.

Is this understanding correct?

If so, here is where I get confused. During the migration pre-copy phase,
each pre-copy iteration reads the dirty bitmap (paging_log_dirty_op() in
xen/arch/x86/mm/paging.c) and cleans it. This process seems to destroy all
the shadow page tables of the domain with the call to shadow_blow_tables()
in sh_clean_dirty_bitmap().

How is the dirty bitmap related to shadow page tables? Why destroy the
entire shadow page table if it is the only legitimate page table in CR3 for
the domain?

Thank you,
Kevin

[-- Attachment #2: Type: text/html, Size: 1442 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How does shadow page table work during migration?
  2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy
@ 2021-02-19 16:36 ` Jan Beulich
  2021-02-19 20:17 ` Andrew Cooper
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2021-02-19 16:36 UTC (permalink / raw)
  To: Kevin Negy; +Cc: xen-devel

On 19.02.2021 17:10, Kevin Negy wrote:
> I'm trying to understand how the shadow page table works in Xen,
> specifically during live migration. My understanding is that after shadow
> paging is enabled (sh_enable_log_dirty() in
> xen/arch/x86/mm/shadow/common.c), a shadow page table is created, which is
> a complete copy of the current guest page table. Then the CR3 register is
> switched to use this shadow page table as the active table while the guest
> page table is stored elsewhere. The guest page table itself (and not the
> individual entries in the page table) is marked as read only so that any
> guest memory access that requires the page table will result in a page
> fault. These page faults happen and are trapped to the Xen hypervisor. Xen
> will then update the shadow page table to match what the guest sees on its
> page tables.
> 
> Is this understanding correct?

Partly. For HVM, shadow mode (if so used) would be active already. For
PV, page tables would be read-only already. Log-dirty mode isn't after
page table modifications alone, but to notice _any_ page that gets
written to.

> If so, here is where I get confused. During the migration pre-copy phase,
> each pre-copy iteration reads the dirty bitmap (paging_log_dirty_op() in
> xen/arch/x86/mm/paging.c) and cleans it. This process seems to destroy all
> the shadow page tables of the domain with the call to shadow_blow_tables()
> in sh_clean_dirty_bitmap().
> 
> How is the dirty bitmap related to shadow page tables?

Shadow page tables are the mechanism to populate the dirty bitmap.

> Why destroy the
> entire shadow page table if it is the only legitimate page table in CR3 for
> the domain?

Page tables will get re-populated again as the guest touches memory.
Blowing the tables is not the same as turning off shadow mode.

Jan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How does shadow page table work during migration?
  2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy
  2021-02-19 16:36 ` Jan Beulich
@ 2021-02-19 20:17 ` Andrew Cooper
  2021-02-22 18:52   ` Kevin Negy
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Cooper @ 2021-02-19 20:17 UTC (permalink / raw)
  To: Kevin Negy, xen-devel

On 19/02/2021 16:10, Kevin Negy wrote:
> Hello,
>
> I'm trying to understand how the shadow page table works in Xen,
> specifically during live migration. My understanding is that after
> shadow paging is enabled (sh_enable_log_dirty() in
> xen/arch/x86/mm/shadow/common.c), a shadow page table is created,
> which is a complete copy of the current guest page table. Then the CR3
> register is switched to use this shadow page table as the active table
> while the guest page table is stored elsewhere. The guest page table
> itself (and not the individual entries in the page table) is marked as
> read only so that any guest memory access that requires the page table
> will result in a page fault. These page faults happen and are trapped
> to the Xen hypervisor. Xen will then update the shadow page table to
> match what the guest sees on its page tables.
>
> Is this understanding correct?
>
> If so, here is where I get confused. During the migration pre-copy
> phase, each pre-copy iteration reads the dirty bitmap
> (paging_log_dirty_op() in xen/arch/x86/mm/paging.c) and cleans it.
> This process seems to destroy all the shadow page tables of the domain
> with the call to shadow_blow_tables() in sh_clean_dirty_bitmap().
>
> How is the dirty bitmap related to shadow page tables? Why destroy the
> entire shadow page table if it is the only legitimate page table in
> CR3 for the domain?

Hello,

Different types of domains use shadow pagetables in different ways, and
the interaction with migration is also type-dependent.

HVM guests use shadow (or HAP) as a fixed property from when they are
created.  Migrating an HVM domain does not dynamically affect whether
shadow is active.  PV guests do nothing by default, but do turn shadow
on dynamically for migration purposes.

Whenever shadow is active, guests do not have write access to their
pagetables.  All updates are emulated if necessary, and "the shadow
pagetables" are managed entirely by Xen behind the scenes.


Next, is the shadow memory pool.  Guests can have an unbounded quantity
of pagetables, and certain pagetable structures take more memory
allocations to shadow correctly than the quantity of RAM expended by the
guest constructing the structure in the first place.

Obviously, Xen can't be in a position where it is forced to expend more
memory for shadow pagetables than the RAM allocated to the guest in the
first place.  What we do is have a fixed sized memory pool (choosable
when you create the domain - see the shadow_memory vm parameter) and
recycle shadows on a least-recently-used basis.

In practice, this means that Xen never has all of the guest pagetables
shadowed at once.  When a guest moves off the pagetables which are
currently shadowed, a pagefault occurs and Xen shadows the new address
by recycling a pagetable which hasn't been used for a while.  The
shadow_blow_tables() call is "please recycle everything" which is used
to throw away all shadow pagetables, which in turn will cause the
shadows to be recreated from scratch as the guest continues to run.


Next, to the logdirty bitmap.  The logdirty bitmap itself is fairly easy
- it is one bit per 4k page (of guest physical address space) indicating
whether that page has been written to, since the last time we checked.

What is complicated is tracking writes, and understand why, it is
actually easier to consider the HVM HAP (i.e. non-shadow) case.  Here,
we have a Xen-maintained single set of EPT or NPT pagetables, which map
the guest physical address space.

When we turn on logdirty, we pause the VM temporarily, and mark all
guest RAM as read-only.  (Actually, we have a lazy-propagation mechanism
of this read-only-ness so we don't spend seconds of wallclock time with
large VMs paused while we make this change.)  Then, as the guest
continues to execute, it exits to Xen when a write hits a read-only
mapping.  Xen responds by marking this frame in the logdirty bitmap,
then remapping it as read-write, then letting the guest continue.

Shadow pagetables are more complicated.  With HAP, hardware helps us
maintain the guest virtual and guest physical address spaces in
logically separate ways, which eventually become combined in the TLBs. 
With Shadow, Xen has to do the combination of address spaces itself -
the shadow pagetables map guest virtual to host physical address.

Suddenly, "mark all guest RAM as read-write" isn't trivial.  The logical
operation you need is: for the shadows we have, uncombine the two
logical addresses spaces, and for the subset which map guest RAM, change
from read-write to read-only, then recombine.  The uncombine part is
actually racy, and involves reversing a one-way mapping, so is
exceedingly expensive.

It is *far* easier to just throw everything away and re-shadow from
scratch, when we want to start tracking writes.


Anyway - I hope this is informative.  It is accurate to the best of my
knowledge, but it also written off the top of my head.  In some copious
free time, I should see about putting some Sphinx docs together for it.

~Andrew


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How does shadow page table work during migration?
  2021-02-19 20:17 ` Andrew Cooper
@ 2021-02-22 18:52   ` Kevin Negy
  2021-02-23  0:23     ` Andrew Cooper
  0 siblings, 1 reply; 5+ messages in thread
From: Kevin Negy @ 2021-02-22 18:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2565 bytes --]

Hello again,

Thank you for the helpful responses. I have several follow up questions.

1)

> With Shadow, Xen has to do the combination of address spaces itself -
> the shadow pagetables map guest virtual to host physical address.


The shadow_blow_tables() call is "please recycle everything" which is used
> to throw away all shadow pagetables, which in turn will cause the
> shadows to be recreated from scratch as the guest continues to run.


With shadowing enabled, given a guest virtual address, how does the
hypervisor recreate the mapping to the host physical address (mfn) from the
virtual address if the shadow page tables are empty (after a call to
shadow_blow_tables, for instance)? I had been thinking of shadow page
tables as the definitive mapping between guest pages and machine pages, but
should I think of them as more of a TLB, which implies there's another way
to get/recreate the mappings if there's no entry in the shadow table?


2) I'm trying to grasp the general steps of enabling shadowing and handling
page faults. Is this correct?
    a) Normal PV - default shadowing is disabled, guest has its page tables
in r/w mode or whatever mode is considered normal for guest page tables
    b) Shadowing is enabled - shadow memory pool allocated, all memory
accesses must now go through shadow pages in CR3. Since no entries are in
shadow tables, initial read and writes from the guest will result in page
faults.
    c) As soon as the first guest memory access occurs, a mandatory page
fault occurs because there is no mapping in the shadows. Xen does a guest
page table walk for the address that caused the fault (va) and then marks
all the guest page table pages along the walk as read only.
    d) Xen finds out the mfn of the guest va somehow (my first question)
and adds the mapping of the va to the shadow page table.
    e) If the page fault was a write, the va is now marked as read/write
but logged as dirty in the logdirty map.
    e) Now the next page fault to any of the page tables marked read-only
in c) must have been caused by the guest writing to its tables, which can
be reflected in the shadow page tables.


3) How do Xen/shadow page tables distinguish between two equivalent guest
virtual addresses from different guest processes? I suppose when a guest OS
tries to change page tables from one process to another, this will cause a
page fault that Xen will trap and be able to infer that the current shadow
page table should be swapped to a different one corresponding to the new
guest process?

Thank you so much,
Kevin

[-- Attachment #2: Type: text/html, Size: 3059 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How does shadow page table work during migration?
  2021-02-22 18:52   ` Kevin Negy
@ 2021-02-23  0:23     ` Andrew Cooper
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Cooper @ 2021-02-23  0:23 UTC (permalink / raw)
  To: Kevin Negy; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 8376 bytes --]

On 22/02/2021 18:52, Kevin Negy wrote:
> Hello again,
>
> Thank you for the helpful responses. I have several follow up questions.
>
> 1)
>
>     With Shadow, Xen has to do the combination of address spaces itself -
>     the shadow pagetables map guest virtual to host physical address.
>
>
>     The shadow_blow_tables() call is "please recycle everything" which
>     is used
>     to throw away all shadow pagetables, which in turn will cause the
>     shadows to be recreated from scratch as the guest continues to run. 
>
>
> With shadowing enabled, given a guest virtual address, how does the
> hypervisor recreate the mapping to the host physical address (mfn)
> from the virtual address if the shadow page tables are empty (after a
> call to shadow_blow_tables, for instance)? I had been thinking of
> shadow page tables as the definitive mapping between guest pages and
> machine pages, but should I think of them as more of a TLB, which
> implies there's another way to get/recreate the mappings if there's no
> entry in the shadow table?

Your observation about "being like a TLB" is correct.

Lets take the most simple case, of 4-on-4 shadows.  I.e. Xen and the
guest are both in 64bit mode, and using 4-level paging.

Each domain also has a structure which Xen calls a P2M, for the guest
physical => host physical mappings.  (For PV guests, its actually
identity transform, and for HVM, it is a set of EPT or (N)PT pagetables,
but the exact structure isn't important here.)

The other primitive required is an emulated pagewalk.  I.e. we start at
the guest's %cr3 value, and walk though the guests pagetables as
hardware would.  Each step involves a lookup in the P2M, as the guest
PTEs are programmed with guest physical addresses, not host physical.


In reality, we always have a "top level shadow" per vcpu.  In this
example, it is a level-4 pagetable, which starts out clear (i.e. no
guest entries present).  We need *something* to point hardware at when
we start running the guest.

Once we run the guest, we immediately take a pagefault.  We look at %cr2
to find the linear address accessed, and perform a pagewalk.  In the
common case, we find that the linear address is valid in the guest, so
we allocate a level 3 pagetable, again clear, then point the appropriate
L4e at it, then re-enter the guest.

This takes an immediate pagefault again, and we allocate an L2
pagetable, re-enter then allocate an L1 pagetable, and finally point an
L1e at the host physical page.  Now, we can successfully fetch the
instruction (if it doesn't cross a page boundary), then repeat the
process for every subsequent memory access.

This example is simplified specifically to demonstrate the point. 
Everything is driven from pagefaults.

There is of course far more complexity.  We typically populate all the
way down to an L1e in one go, because this is far more efficient than
taking 4 real pagefaults.  If we walk the guest pagetables and find a
violation, we have to hand #PF back to the guest kernel rather than
change the shadows.  To emulate dirty bits correctly, we need to leave
the shadow read-only even if the guest PTE was read/write so we can spot
when hardware tries to set the D bit in the shadows, and copy it back
into guest's view.  Superpages are complicated to deal with (we have to
splinter to 4k pages), and 2-on-3 (legacy 32bit OS with non-PAE paging)
a total nightmare because of the different format of pagetable entries.

Also notice that a guest TLB flush is also implemented as "drop all
shadows under this virtual cr3".

> 2) I'm trying to grasp the general steps of enabling shadowing and
> handling page faults. Is this correct?
>     a) Normal PV - default shadowing is disabled, guest has its page
> tables in r/w mode or whatever mode is considered normal for guest
> page tables

It would be a massive security vulnerability to let PV guests write to
their own pagetables.

PV guest pagetables are read-only, and all updates are made via
hypercall, so they can be audited for safety.  (We do actually have
pagetable emulation for PV guests for those which do write to their own
pagetables, and feeds into the same logic as the hypercall, but is less
efficient overall.)

>     b) Shadowing is enabled - shadow memory pool allocated, all memory
> accesses must now go through shadow pages in CR3. Since no entries are
> in shadow tables, initial read and writes from the guest will result
> in page faults.

PV guest share an address space with Xen.  So actually the top level
shadow for a PV guest is pre-populated with Xen's mappings, but all
guest entries are faulted in on demand.

>     c) As soon as the first guest memory access occurs, a mandatory
> page fault occurs because there is no mapping in the shadows. Xen does
> a guest page table walk for the address that caused the fault (va) and
> then marks all the guest page table pages along the walk as read only.

The first guest memory access is actually the instruction fetch at
%cs:%rip.  Once that address is shadowed, you further have to shadow any
memory operands (which can be more than one, e.g. `PUSH ptr` has a
regular memory operand, and an implicit stack operand which needs
shadowing.  With the AVX scatter/gather instructions, you can have an
almost-arbitrary number of memory operands.)

Also, be very careful with terminology.  Linear and virtual addresses
are different (by the segment selector base, which is commonly but not
always 0).  Lots of Xen code uses va/vaddr when it means linear addresses.

>     d) Xen finds out the mfn of the guest va somehow (my first
> question) and adds the mapping of the va to the shadow page table.

Yes.  This is a combination of the pagewalk and P2M to identify the mfn
in question for the linear address, along with suitable
allocations/modifications to the shadow pagetables.

>     e) If the page fault was a write, the va is now marked as
> read/write but logged as dirty in the logdirty map.

Actually, what we do when the VM is in global logdirty mode is always
start by writing all shadow L1e's as read-only, even if the guest has
them read-write.  This causes all writes to trap with #PF, which lets us
see which frame is being written to, and lets us set the appropriate bit
in the logdirty bitmap.

>     e) Now the next page fault to any of the page tables marked
> read-only in c) must have been caused by the guest writing to its
> tables, which can be reflected in the shadow page tables.

Writeability of the guest's actual pagetables is complicated and
guest-dependent.  Under a strict TLB-like model, its not actually
required to restrict writeability.

In real hardware, the TLB is an explicitly non-coherent cache, and
software is required to issue a TLB flush to ensure that changes to the
PTEs in memory get propagated subsequently into the TLB.

> 3) How do Xen/shadow page tables distinguish between two equivalent
> guest virtual addresses from different guest processes? I suppose when
> a guest OS tries to change page tables from one process to another,
> this will cause a page fault that Xen will trap and be able to infer
> that the current shadow page table should be swapped to a different
> one corresponding to the new guest process?

Changing processes involves writing to %cr3, which is a TLB flush, so in
a strict TLB-like model, all shadows must be dropped.

In reality, this is where we start using restricted writeability to our
advantage.  If we know that no writes to pagetables happened, we know
"the TLB" (== the currently established shadows) aren't actually stale,
so may be retained and reused.

We do maintain hash lists of types of pagetable, so we can locate
preexisting shadows of a specific type.  This is how we can switch
between already-established shadows when the guest changes %cr3.

In reality, the kernel half of virtual address space doesn't change much
after after boot, so there is a substantial performance win from not
dropping and reshadowing these entries.  There are loads and loads of L4
pagetables (one per process), all pointing to common L3's which form the
kernel half of the address space.

If I'm being honestly - this is where my knowledge of exactly what Xen
does breaks down - I'm not the author of the shadow code - I've merely
debugged it a few times.

I hope this is still informative.

~Andrew

[-- Attachment #2: Type: text/html, Size: 11893 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-23  0:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy
2021-02-19 16:36 ` Jan Beulich
2021-02-19 20:17 ` Andrew Cooper
2021-02-22 18:52   ` Kevin Negy
2021-02-23  0:23     ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.