* How does shadow page table work during migration? @ 2021-02-19 16:10 Kevin Negy 2021-02-19 16:36 ` Jan Beulich 2021-02-19 20:17 ` Andrew Cooper 0 siblings, 2 replies; 5+ messages in thread From: Kevin Negy @ 2021-02-19 16:10 UTC (permalink / raw) To: xen-devel [-- Attachment #1: Type: text/plain, Size: 1331 bytes --] Hello, I'm trying to understand how the shadow page table works in Xen, specifically during live migration. My understanding is that after shadow paging is enabled (sh_enable_log_dirty() in xen/arch/x86/mm/shadow/common.c), a shadow page table is created, which is a complete copy of the current guest page table. Then the CR3 register is switched to use this shadow page table as the active table while the guest page table is stored elsewhere. The guest page table itself (and not the individual entries in the page table) is marked as read only so that any guest memory access that requires the page table will result in a page fault. These page faults happen and are trapped to the Xen hypervisor. Xen will then update the shadow page table to match what the guest sees on its page tables. Is this understanding correct? If so, here is where I get confused. During the migration pre-copy phase, each pre-copy iteration reads the dirty bitmap (paging_log_dirty_op() in xen/arch/x86/mm/paging.c) and cleans it. This process seems to destroy all the shadow page tables of the domain with the call to shadow_blow_tables() in sh_clean_dirty_bitmap(). How is the dirty bitmap related to shadow page tables? Why destroy the entire shadow page table if it is the only legitimate page table in CR3 for the domain? Thank you, Kevin [-- Attachment #2: Type: text/html, Size: 1442 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How does shadow page table work during migration? 2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy @ 2021-02-19 16:36 ` Jan Beulich 2021-02-19 20:17 ` Andrew Cooper 1 sibling, 0 replies; 5+ messages in thread From: Jan Beulich @ 2021-02-19 16:36 UTC (permalink / raw) To: Kevin Negy; +Cc: xen-devel On 19.02.2021 17:10, Kevin Negy wrote: > I'm trying to understand how the shadow page table works in Xen, > specifically during live migration. My understanding is that after shadow > paging is enabled (sh_enable_log_dirty() in > xen/arch/x86/mm/shadow/common.c), a shadow page table is created, which is > a complete copy of the current guest page table. Then the CR3 register is > switched to use this shadow page table as the active table while the guest > page table is stored elsewhere. The guest page table itself (and not the > individual entries in the page table) is marked as read only so that any > guest memory access that requires the page table will result in a page > fault. These page faults happen and are trapped to the Xen hypervisor. Xen > will then update the shadow page table to match what the guest sees on its > page tables. > > Is this understanding correct? Partly. For HVM, shadow mode (if so used) would be active already. For PV, page tables would be read-only already. Log-dirty mode isn't after page table modifications alone, but to notice _any_ page that gets written to. > If so, here is where I get confused. During the migration pre-copy phase, > each pre-copy iteration reads the dirty bitmap (paging_log_dirty_op() in > xen/arch/x86/mm/paging.c) and cleans it. This process seems to destroy all > the shadow page tables of the domain with the call to shadow_blow_tables() > in sh_clean_dirty_bitmap(). > > How is the dirty bitmap related to shadow page tables? Shadow page tables are the mechanism to populate the dirty bitmap. > Why destroy the > entire shadow page table if it is the only legitimate page table in CR3 for > the domain? Page tables will get re-populated again as the guest touches memory. Blowing the tables is not the same as turning off shadow mode. Jan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How does shadow page table work during migration? 2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy 2021-02-19 16:36 ` Jan Beulich @ 2021-02-19 20:17 ` Andrew Cooper 2021-02-22 18:52 ` Kevin Negy 1 sibling, 1 reply; 5+ messages in thread From: Andrew Cooper @ 2021-02-19 20:17 UTC (permalink / raw) To: Kevin Negy, xen-devel On 19/02/2021 16:10, Kevin Negy wrote: > Hello, > > I'm trying to understand how the shadow page table works in Xen, > specifically during live migration. My understanding is that after > shadow paging is enabled (sh_enable_log_dirty() in > xen/arch/x86/mm/shadow/common.c), a shadow page table is created, > which is a complete copy of the current guest page table. Then the CR3 > register is switched to use this shadow page table as the active table > while the guest page table is stored elsewhere. The guest page table > itself (and not the individual entries in the page table) is marked as > read only so that any guest memory access that requires the page table > will result in a page fault. These page faults happen and are trapped > to the Xen hypervisor. Xen will then update the shadow page table to > match what the guest sees on its page tables. > > Is this understanding correct? > > If so, here is where I get confused. During the migration pre-copy > phase, each pre-copy iteration reads the dirty bitmap > (paging_log_dirty_op() in xen/arch/x86/mm/paging.c) and cleans it. > This process seems to destroy all the shadow page tables of the domain > with the call to shadow_blow_tables() in sh_clean_dirty_bitmap(). > > How is the dirty bitmap related to shadow page tables? Why destroy the > entire shadow page table if it is the only legitimate page table in > CR3 for the domain? Hello, Different types of domains use shadow pagetables in different ways, and the interaction with migration is also type-dependent. HVM guests use shadow (or HAP) as a fixed property from when they are created. Migrating an HVM domain does not dynamically affect whether shadow is active. PV guests do nothing by default, but do turn shadow on dynamically for migration purposes. Whenever shadow is active, guests do not have write access to their pagetables. All updates are emulated if necessary, and "the shadow pagetables" are managed entirely by Xen behind the scenes. Next, is the shadow memory pool. Guests can have an unbounded quantity of pagetables, and certain pagetable structures take more memory allocations to shadow correctly than the quantity of RAM expended by the guest constructing the structure in the first place. Obviously, Xen can't be in a position where it is forced to expend more memory for shadow pagetables than the RAM allocated to the guest in the first place. What we do is have a fixed sized memory pool (choosable when you create the domain - see the shadow_memory vm parameter) and recycle shadows on a least-recently-used basis. In practice, this means that Xen never has all of the guest pagetables shadowed at once. When a guest moves off the pagetables which are currently shadowed, a pagefault occurs and Xen shadows the new address by recycling a pagetable which hasn't been used for a while. The shadow_blow_tables() call is "please recycle everything" which is used to throw away all shadow pagetables, which in turn will cause the shadows to be recreated from scratch as the guest continues to run. Next, to the logdirty bitmap. The logdirty bitmap itself is fairly easy - it is one bit per 4k page (of guest physical address space) indicating whether that page has been written to, since the last time we checked. What is complicated is tracking writes, and understand why, it is actually easier to consider the HVM HAP (i.e. non-shadow) case. Here, we have a Xen-maintained single set of EPT or NPT pagetables, which map the guest physical address space. When we turn on logdirty, we pause the VM temporarily, and mark all guest RAM as read-only. (Actually, we have a lazy-propagation mechanism of this read-only-ness so we don't spend seconds of wallclock time with large VMs paused while we make this change.) Then, as the guest continues to execute, it exits to Xen when a write hits a read-only mapping. Xen responds by marking this frame in the logdirty bitmap, then remapping it as read-write, then letting the guest continue. Shadow pagetables are more complicated. With HAP, hardware helps us maintain the guest virtual and guest physical address spaces in logically separate ways, which eventually become combined in the TLBs. With Shadow, Xen has to do the combination of address spaces itself - the shadow pagetables map guest virtual to host physical address. Suddenly, "mark all guest RAM as read-write" isn't trivial. The logical operation you need is: for the shadows we have, uncombine the two logical addresses spaces, and for the subset which map guest RAM, change from read-write to read-only, then recombine. The uncombine part is actually racy, and involves reversing a one-way mapping, so is exceedingly expensive. It is *far* easier to just throw everything away and re-shadow from scratch, when we want to start tracking writes. Anyway - I hope this is informative. It is accurate to the best of my knowledge, but it also written off the top of my head. In some copious free time, I should see about putting some Sphinx docs together for it. ~Andrew ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How does shadow page table work during migration? 2021-02-19 20:17 ` Andrew Cooper @ 2021-02-22 18:52 ` Kevin Negy 2021-02-23 0:23 ` Andrew Cooper 0 siblings, 1 reply; 5+ messages in thread From: Kevin Negy @ 2021-02-22 18:52 UTC (permalink / raw) To: Andrew Cooper; +Cc: xen-devel [-- Attachment #1: Type: text/plain, Size: 2565 bytes --] Hello again, Thank you for the helpful responses. I have several follow up questions. 1) > With Shadow, Xen has to do the combination of address spaces itself - > the shadow pagetables map guest virtual to host physical address. The shadow_blow_tables() call is "please recycle everything" which is used > to throw away all shadow pagetables, which in turn will cause the > shadows to be recreated from scratch as the guest continues to run. With shadowing enabled, given a guest virtual address, how does the hypervisor recreate the mapping to the host physical address (mfn) from the virtual address if the shadow page tables are empty (after a call to shadow_blow_tables, for instance)? I had been thinking of shadow page tables as the definitive mapping between guest pages and machine pages, but should I think of them as more of a TLB, which implies there's another way to get/recreate the mappings if there's no entry in the shadow table? 2) I'm trying to grasp the general steps of enabling shadowing and handling page faults. Is this correct? a) Normal PV - default shadowing is disabled, guest has its page tables in r/w mode or whatever mode is considered normal for guest page tables b) Shadowing is enabled - shadow memory pool allocated, all memory accesses must now go through shadow pages in CR3. Since no entries are in shadow tables, initial read and writes from the guest will result in page faults. c) As soon as the first guest memory access occurs, a mandatory page fault occurs because there is no mapping in the shadows. Xen does a guest page table walk for the address that caused the fault (va) and then marks all the guest page table pages along the walk as read only. d) Xen finds out the mfn of the guest va somehow (my first question) and adds the mapping of the va to the shadow page table. e) If the page fault was a write, the va is now marked as read/write but logged as dirty in the logdirty map. e) Now the next page fault to any of the page tables marked read-only in c) must have been caused by the guest writing to its tables, which can be reflected in the shadow page tables. 3) How do Xen/shadow page tables distinguish between two equivalent guest virtual addresses from different guest processes? I suppose when a guest OS tries to change page tables from one process to another, this will cause a page fault that Xen will trap and be able to infer that the current shadow page table should be swapped to a different one corresponding to the new guest process? Thank you so much, Kevin [-- Attachment #2: Type: text/html, Size: 3059 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: How does shadow page table work during migration? 2021-02-22 18:52 ` Kevin Negy @ 2021-02-23 0:23 ` Andrew Cooper 0 siblings, 0 replies; 5+ messages in thread From: Andrew Cooper @ 2021-02-23 0:23 UTC (permalink / raw) To: Kevin Negy; +Cc: xen-devel [-- Attachment #1: Type: text/plain, Size: 8376 bytes --] On 22/02/2021 18:52, Kevin Negy wrote: > Hello again, > > Thank you for the helpful responses. I have several follow up questions. > > 1) > > With Shadow, Xen has to do the combination of address spaces itself - > the shadow pagetables map guest virtual to host physical address. > > > The shadow_blow_tables() call is "please recycle everything" which > is used > to throw away all shadow pagetables, which in turn will cause the > shadows to be recreated from scratch as the guest continues to run. > > > With shadowing enabled, given a guest virtual address, how does the > hypervisor recreate the mapping to the host physical address (mfn) > from the virtual address if the shadow page tables are empty (after a > call to shadow_blow_tables, for instance)? I had been thinking of > shadow page tables as the definitive mapping between guest pages and > machine pages, but should I think of them as more of a TLB, which > implies there's another way to get/recreate the mappings if there's no > entry in the shadow table? Your observation about "being like a TLB" is correct. Lets take the most simple case, of 4-on-4 shadows. I.e. Xen and the guest are both in 64bit mode, and using 4-level paging. Each domain also has a structure which Xen calls a P2M, for the guest physical => host physical mappings. (For PV guests, its actually identity transform, and for HVM, it is a set of EPT or (N)PT pagetables, but the exact structure isn't important here.) The other primitive required is an emulated pagewalk. I.e. we start at the guest's %cr3 value, and walk though the guests pagetables as hardware would. Each step involves a lookup in the P2M, as the guest PTEs are programmed with guest physical addresses, not host physical. In reality, we always have a "top level shadow" per vcpu. In this example, it is a level-4 pagetable, which starts out clear (i.e. no guest entries present). We need *something* to point hardware at when we start running the guest. Once we run the guest, we immediately take a pagefault. We look at %cr2 to find the linear address accessed, and perform a pagewalk. In the common case, we find that the linear address is valid in the guest, so we allocate a level 3 pagetable, again clear, then point the appropriate L4e at it, then re-enter the guest. This takes an immediate pagefault again, and we allocate an L2 pagetable, re-enter then allocate an L1 pagetable, and finally point an L1e at the host physical page. Now, we can successfully fetch the instruction (if it doesn't cross a page boundary), then repeat the process for every subsequent memory access. This example is simplified specifically to demonstrate the point. Everything is driven from pagefaults. There is of course far more complexity. We typically populate all the way down to an L1e in one go, because this is far more efficient than taking 4 real pagefaults. If we walk the guest pagetables and find a violation, we have to hand #PF back to the guest kernel rather than change the shadows. To emulate dirty bits correctly, we need to leave the shadow read-only even if the guest PTE was read/write so we can spot when hardware tries to set the D bit in the shadows, and copy it back into guest's view. Superpages are complicated to deal with (we have to splinter to 4k pages), and 2-on-3 (legacy 32bit OS with non-PAE paging) a total nightmare because of the different format of pagetable entries. Also notice that a guest TLB flush is also implemented as "drop all shadows under this virtual cr3". > 2) I'm trying to grasp the general steps of enabling shadowing and > handling page faults. Is this correct? > a) Normal PV - default shadowing is disabled, guest has its page > tables in r/w mode or whatever mode is considered normal for guest > page tables It would be a massive security vulnerability to let PV guests write to their own pagetables. PV guest pagetables are read-only, and all updates are made via hypercall, so they can be audited for safety. (We do actually have pagetable emulation for PV guests for those which do write to their own pagetables, and feeds into the same logic as the hypercall, but is less efficient overall.) > b) Shadowing is enabled - shadow memory pool allocated, all memory > accesses must now go through shadow pages in CR3. Since no entries are > in shadow tables, initial read and writes from the guest will result > in page faults. PV guest share an address space with Xen. So actually the top level shadow for a PV guest is pre-populated with Xen's mappings, but all guest entries are faulted in on demand. > c) As soon as the first guest memory access occurs, a mandatory > page fault occurs because there is no mapping in the shadows. Xen does > a guest page table walk for the address that caused the fault (va) and > then marks all the guest page table pages along the walk as read only. The first guest memory access is actually the instruction fetch at %cs:%rip. Once that address is shadowed, you further have to shadow any memory operands (which can be more than one, e.g. `PUSH ptr` has a regular memory operand, and an implicit stack operand which needs shadowing. With the AVX scatter/gather instructions, you can have an almost-arbitrary number of memory operands.) Also, be very careful with terminology. Linear and virtual addresses are different (by the segment selector base, which is commonly but not always 0). Lots of Xen code uses va/vaddr when it means linear addresses. > d) Xen finds out the mfn of the guest va somehow (my first > question) and adds the mapping of the va to the shadow page table. Yes. This is a combination of the pagewalk and P2M to identify the mfn in question for the linear address, along with suitable allocations/modifications to the shadow pagetables. > e) If the page fault was a write, the va is now marked as > read/write but logged as dirty in the logdirty map. Actually, what we do when the VM is in global logdirty mode is always start by writing all shadow L1e's as read-only, even if the guest has them read-write. This causes all writes to trap with #PF, which lets us see which frame is being written to, and lets us set the appropriate bit in the logdirty bitmap. > e) Now the next page fault to any of the page tables marked > read-only in c) must have been caused by the guest writing to its > tables, which can be reflected in the shadow page tables. Writeability of the guest's actual pagetables is complicated and guest-dependent. Under a strict TLB-like model, its not actually required to restrict writeability. In real hardware, the TLB is an explicitly non-coherent cache, and software is required to issue a TLB flush to ensure that changes to the PTEs in memory get propagated subsequently into the TLB. > 3) How do Xen/shadow page tables distinguish between two equivalent > guest virtual addresses from different guest processes? I suppose when > a guest OS tries to change page tables from one process to another, > this will cause a page fault that Xen will trap and be able to infer > that the current shadow page table should be swapped to a different > one corresponding to the new guest process? Changing processes involves writing to %cr3, which is a TLB flush, so in a strict TLB-like model, all shadows must be dropped. In reality, this is where we start using restricted writeability to our advantage. If we know that no writes to pagetables happened, we know "the TLB" (== the currently established shadows) aren't actually stale, so may be retained and reused. We do maintain hash lists of types of pagetable, so we can locate preexisting shadows of a specific type. This is how we can switch between already-established shadows when the guest changes %cr3. In reality, the kernel half of virtual address space doesn't change much after after boot, so there is a substantial performance win from not dropping and reshadowing these entries. There are loads and loads of L4 pagetables (one per process), all pointing to common L3's which form the kernel half of the address space. If I'm being honestly - this is where my knowledge of exactly what Xen does breaks down - I'm not the author of the shadow code - I've merely debugged it a few times. I hope this is still informative. ~Andrew [-- Attachment #2: Type: text/html, Size: 11893 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-02-23 0:24 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-19 16:10 How does shadow page table work during migration? Kevin Negy 2021-02-19 16:36 ` Jan Beulich 2021-02-19 20:17 ` Andrew Cooper 2021-02-22 18:52 ` Kevin Negy 2021-02-23 0:23 ` Andrew Cooper
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.