[PATCH 00/16] x86: assorted (mostly) shadow mode adjustments

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments
@ 2023-03-22  9:28 Jan Beulich
  2023-03-22  9:29 ` [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows() Jan Beulich
                   ` (16 more replies)
  0 siblings, 17 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

This is kind of fallout from XSA-427 investigations, partly related to
there having been a more intrusive first approach. This is also the
reason why one of the patch has R-b already - that was a prereq for
the original approach.

Most patches aren't really dependent upon one another, so can probably
go in independently (as they get acked).

01: shadow: fix and improve sh_page_has_multiple_shadows()
02: shadow: fold/rename sh_unhook_*_mappings()
03: shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys"
04: shadow: replace memcmp() in sh_resync_l1()
05: shadow: reduce explicit log-dirty recording for HVM
06: shadow: purge {write,cmpxchg}_guest_entry() hooks
07: shadow: call sh_update_cr3() directly sh_page_fault()
08: shadow: use lighter weight mode checks
09: shadow: OOS mode is HVM-only
10: shadow: move OOS functions to their own file
11: shadow: drop is_hvm_...() where easily possible
12: shadow: make monitor table create/destroy more consistent
13: shadow: vCPU-s never have "no mode"
14: shadow: "monitor table" is a HVM-only concept
15: shadow: adjust monitor table prealloc amount
16: PV: conditionalize arch_set_info_guest()'s call to update_cr3()

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows()
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
@ 2023-03-22  9:29 ` Jan Beulich
  2023-03-23 11:55   ` Andrew Cooper
  2023-03-22  9:30 ` [PATCH 02/16] x86/shadow: fold/rename sh_unhook_*_mappings() Jan Beulich
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

While no caller currently invokes the function without first making sure
there is at least one shadow [1], we'd better eliminate UB here:
find_first_set_bit() requires input to be non-zero to return a well-
defined result.

Further, using find_first_set_bit() isn't very efficient in the first
place for the intended purpose.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

[1] The function has exactly two uses, and both are from OOS code, which
    is HVM-only. For HVM (but not for PV) sh_mfn_is_a_page_table(),
    guarding the call to sh_unsync(), guarantees at least one shadow.
    Hence even if sh_page_has_multiple_shadows() returned a bogus value
    when invoked for a PV domain, the subsequent is_hvm_vcpu() and
    oos_active checks (the former being redundant with the latter) will
    compensate. (Arguably that oos_active check should come first, for
    both clarity and efficiency reasons.)
---
Considering present uses, ASSERT(shadows) might be an option as well,
instead of making the null check part of the return value expression.

--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -332,7 +332,7 @@ static inline int sh_page_has_multiple_s
         return 0;
     shadows = pg->shadow_flags & SHF_page_type_mask;
     /* More than one type bit set in shadow-flags? */
-    return ( (shadows & ~(1UL << find_first_set_bit(shadows))) != 0 );
+    return shadows && (shadows & (shadows - 1));
 }
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 02/16] x86/shadow: fold/rename sh_unhook_*_mappings()
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
  2023-03-22  9:29 ` [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows() Jan Beulich
@ 2023-03-22  9:30 ` Jan Beulich
  2023-03-22  9:31 ` [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys" Jan Beulich
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

The "32b" and "pae" functions are identical at the source level (they
differ in what they get compiled to, due to differences in
SHADOW_FOREACH_L2E()), leaving aside a comment the PAE variant has and
the non-PAE one doesn't. Replace these infixes by the more usual l<N>
ones (and then also for the "64b" one for consistency; that'll also
allow for re-use once we support 5-level paging, if need be). The two
different instances are still distinguishable by their "level" suffix.

While fiddling with the names, convert the last parameter to boolean
as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
As long as the backing functions don't need to invoke each other, the
infixes could arguably also be dropped altogether. Every mode has
exactly one such function. (Thinking of it, unhooking at L4 is kind of
pointless for PV32, as the guest controlled root is L3. The only time
where we might want to unhook from L4 is when we clean up after the
guest. But even that's not strictly necessary as that would also be
taken care of by sh_detach_old_tables().)

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -898,14 +898,14 @@ void shadow_unhook_mappings(struct domai
     {
 #ifdef CONFIG_HVM
     case SH_type_l2_32_shadow:
-        SHADOW_INTERNAL_NAME(sh_unhook_32b_mappings, 2)(d, smfn, user_only);
+        SHADOW_INTERNAL_NAME(sh_unhook_l2_mappings, 2)(d, smfn, user_only);
         break;
     case SH_type_l2_pae_shadow:
-        SHADOW_INTERNAL_NAME(sh_unhook_pae_mappings, 3)(d, smfn, user_only);
+        SHADOW_INTERNAL_NAME(sh_unhook_l2_mappings, 3)(d, smfn, user_only);
         break;
 #endif
     case SH_type_l4_64_shadow:
-        SHADOW_INTERNAL_NAME(sh_unhook_64b_mappings, 4)(d, smfn, user_only);
+        SHADOW_INTERNAL_NAME(sh_unhook_l4_mappings, 4)(d, smfn, user_only);
         break;
     default:
         printk(XENLOG_ERR "Bad top-level shadow type %08x\n", sp->u.sh.type);
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1416,21 +1416,9 @@ void sh_destroy_l1_shadow(struct domain
  * This implementation is pretty crude and slow, but we hope that it won't
  * be called very often. */
 
-#if GUEST_PAGING_LEVELS == 2
+#if GUEST_PAGING_LEVELS < 4
 
-void sh_unhook_32b_mappings(struct domain *d, mfn_t sl2mfn, int user_only)
-{
-    shadow_l2e_t *sl2e;
-    SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, 0, d, {
-        if ( !user_only || (sl2e->l2 & _PAGE_USER) )
-            shadow_set_l2e(d, sl2e, shadow_l2e_empty(), sl2mfn);
-    });
-}
-
-#elif GUEST_PAGING_LEVELS == 3
-
-void sh_unhook_pae_mappings(struct domain *d, mfn_t sl2mfn, int user_only)
-/* Walk a PAE l2 shadow, unhooking entries from all the subshadows */
+void sh_unhook_l2_mappings(struct domain *d, mfn_t sl2mfn, bool user_only)
 {
     shadow_l2e_t *sl2e;
     SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, 0, d, {
@@ -1441,7 +1429,7 @@ void sh_unhook_pae_mappings(struct domai
 
 #elif GUEST_PAGING_LEVELS == 4
 
-void sh_unhook_64b_mappings(struct domain *d, mfn_t sl4mfn, int user_only)
+void sh_unhook_l4_mappings(struct domain *d, mfn_t sl4mfn, bool user_only)
 {
     shadow_l4e_t *sl4e;
     SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, 0, d, {
--- a/xen/arch/x86/mm/shadow/multi.h
+++ b/xen/arch/x86/mm/shadow/multi.h
@@ -50,14 +50,11 @@ SHADOW_INTERNAL_NAME(sh_destroy_l4_shado
     struct domain *d, mfn_t smfn);
 
 extern void
-SHADOW_INTERNAL_NAME(sh_unhook_32b_mappings, GUEST_LEVELS)
-    (struct domain *d, mfn_t sl2mfn, int user_only);
+SHADOW_INTERNAL_NAME(sh_unhook_l2_mappings, GUEST_LEVELS)
+    (struct domain *d, mfn_t sl2mfn, bool user_only);
 extern void
-SHADOW_INTERNAL_NAME(sh_unhook_pae_mappings, GUEST_LEVELS)
-    (struct domain *d, mfn_t sl3mfn, int user_only);
-extern void
-SHADOW_INTERNAL_NAME(sh_unhook_64b_mappings, GUEST_LEVELS)
-    (struct domain *d, mfn_t sl4mfn, int user_only);
+SHADOW_INTERNAL_NAME(sh_unhook_l4_mappings, GUEST_LEVELS)
+    (struct domain *d, mfn_t sl4mfn, bool user_only);
 
 int cf_check
 SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, GUEST_LEVELS)
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -232,9 +232,8 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define sh_destroy_l3_shadow       INTERNAL_NAME(sh_destroy_l3_shadow)
 #define sh_destroy_l2_shadow       INTERNAL_NAME(sh_destroy_l2_shadow)
 #define sh_destroy_l1_shadow       INTERNAL_NAME(sh_destroy_l1_shadow)
-#define sh_unhook_32b_mappings     INTERNAL_NAME(sh_unhook_32b_mappings)
-#define sh_unhook_pae_mappings     INTERNAL_NAME(sh_unhook_pae_mappings)
-#define sh_unhook_64b_mappings     INTERNAL_NAME(sh_unhook_64b_mappings)
+#define sh_unhook_l2_mappings      INTERNAL_NAME(sh_unhook_l2_mappings)
+#define sh_unhook_l4_mappings      INTERNAL_NAME(sh_unhook_l4_mappings)
 #define sh_paging_mode             INTERNAL_NAME(sh_paging_mode)
 #define sh_audit_l1_table          INTERNAL_NAME(sh_audit_l1_table)
 #define sh_audit_fl1_table         INTERNAL_NAME(sh_audit_fl1_table)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys"
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
  2023-03-22  9:29 ` [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows() Jan Beulich
  2023-03-22  9:30 ` [PATCH 02/16] x86/shadow: fold/rename sh_unhook_*_mappings() Jan Beulich
@ 2023-03-22  9:31 ` Jan Beulich
  2023-03-23 12:14   ` Andrew Cooper
  2023-03-22  9:31 ` [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1() Jan Beulich
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

SHADOW_FOREACH_L<N>E() already invokes the "body" only when the present
bit is set; no need to re-do the check.

While there also
- stop open-coding mfn_to_maddr() in code being touched (re-indented)
  anyway,
- stop open-coding mfn_eq() in code being touched or adjacent code,
- drop local variables when they're no longer used at least twice.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1289,12 +1289,8 @@ void sh_destroy_l4_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl4mfn = smfn;
     SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, 0, d, {
-        if ( shadow_l4e_get_flags(*sl4e) & _PAGE_PRESENT )
-        {
-            sh_put_ref(d, shadow_l4e_get_mfn(*sl4e),
-                       (((paddr_t)mfn_x(sl4mfn)) << PAGE_SHIFT)
-                       | ((unsigned long)sl4e & ~PAGE_MASK));
-        }
+        sh_put_ref(d, shadow_l4e_get_mfn(*sl4e),
+                   mfn_to_maddr(sl4mfn) | ((unsigned long)sl4e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1320,10 +1316,8 @@ void sh_destroy_l3_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl3mfn = smfn;
     SHADOW_FOREACH_L3E(sl3mfn, sl3e, 0, 0, {
-        if ( shadow_l3e_get_flags(*sl3e) & _PAGE_PRESENT )
-            sh_put_ref(d, shadow_l3e_get_mfn(*sl3e),
-                        (((paddr_t)mfn_x(sl3mfn)) << PAGE_SHIFT)
-                        | ((unsigned long)sl3e & ~PAGE_MASK));
+        sh_put_ref(d, shadow_l3e_get_mfn(*sl3e),
+                   mfn_to_maddr(sl3mfn) | ((unsigned long)sl3e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1352,10 +1346,8 @@ void sh_destroy_l2_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl2mfn = smfn;
     SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, 0, d, {
-        if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT )
-            sh_put_ref(d, shadow_l2e_get_mfn(*sl2e),
-                        (((paddr_t)mfn_x(sl2mfn)) << PAGE_SHIFT)
-                        | ((unsigned long)sl2e & ~PAGE_MASK));
+        sh_put_ref(d, shadow_l2e_get_mfn(*sl2e),
+                   mfn_to_maddr(sl2mfn) | ((unsigned long)sl2e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1390,11 +1382,10 @@ void sh_destroy_l1_shadow(struct domain
         /* Decrement refcounts of all the old entries */
         mfn_t sl1mfn = smfn;
         SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, 0, {
-            unsigned int sl1f = shadow_l1e_get_flags(*sl1e);
-
-            if ( (sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(*sl1e) )
+            if ( !sh_l1e_is_magic(*sl1e) )
             {
-                shadow_vram_put_mfn(shadow_l1e_get_mfn(*sl1e), sl1f,
+                shadow_vram_put_mfn(shadow_l1e_get_mfn(*sl1e),
+                                    shadow_l1e_get_flags(*sl1e),
                                     sl1mfn, sl1e, d);
                 shadow_put_page_from_l1e(*sl1e, d);
             }
@@ -3559,7 +3550,6 @@ int cf_check sh_rm_write_access_from_l1(
 {
     shadow_l1e_t *sl1e;
     int done = 0;
-    int flags;
 #if SHADOW_OPTIMIZATIONS & SHOPT_WRITABLE_HEURISTIC
     struct vcpu *curr = current;
     mfn_t base_sl1mfn = sl1mfn; /* Because sl1mfn changes in the foreach */
@@ -3567,10 +3557,8 @@ int cf_check sh_rm_write_access_from_l1(
 
     SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, done,
     {
-        flags = shadow_l1e_get_flags(*sl1e);
-        if ( (flags & _PAGE_PRESENT)
-             && (flags & _PAGE_RW)
-             && (mfn_x(shadow_l1e_get_mfn(*sl1e)) == mfn_x(readonly_mfn)) )
+        if ( (shadow_l1e_get_flags(*sl1e) & _PAGE_RW) &&
+             mfn_eq(shadow_l1e_get_mfn(*sl1e), readonly_mfn) )
         {
             shadow_l1e_t ro_sl1e = shadow_l1e_remove_flags(*sl1e, _PAGE_RW);
 
@@ -3596,13 +3584,10 @@ int cf_check sh_rm_mappings_from_l1(
 {
     shadow_l1e_t *sl1e;
     int done = 0;
-    int flags;
 
     SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, done,
     {
-        flags = shadow_l1e_get_flags(*sl1e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l1e_get_mfn(*sl1e)) == mfn_x(target_mfn)) )
+        if ( mfn_eq(shadow_l1e_get_mfn(*sl1e), target_mfn) )
         {
             shadow_set_l1e(d, sl1e, shadow_l1e_empty(), p2m_invalid, sl1mfn);
             if ( sh_check_page_has_no_refs(mfn_to_page(target_mfn)) )
@@ -3647,13 +3632,10 @@ int cf_check sh_remove_l1_shadow(struct
 {
     shadow_l2e_t *sl2e;
     int done = 0;
-    int flags;
 
     SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, done, d,
     {
-        flags = shadow_l2e_get_flags(*sl2e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l2e_get_mfn(*sl2e)) == mfn_x(sl1mfn)) )
+        if ( mfn_x(shadow_l2e_get_mfn(*sl2e)) == mfn_x(sl1mfn) )
         {
             shadow_set_l2e(d, sl2e, shadow_l2e_empty(), sl2mfn);
             if ( mfn_to_page(sl1mfn)->u.sh.type == 0 )
@@ -3670,13 +3652,10 @@ int cf_check sh_remove_l2_shadow(struct
 {
     shadow_l3e_t *sl3e;
     int done = 0;
-    int flags;
 
     SHADOW_FOREACH_L3E(sl3mfn, sl3e, 0, done,
     {
-        flags = shadow_l3e_get_flags(*sl3e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l3e_get_mfn(*sl3e)) == mfn_x(sl2mfn)) )
+        if ( mfn_x(shadow_l3e_get_mfn(*sl3e)) == mfn_x(sl2mfn) )
         {
             shadow_set_l3e(d, sl3e, shadow_l3e_empty(), sl3mfn);
             if ( mfn_to_page(sl2mfn)->u.sh.type == 0 )
@@ -3692,13 +3671,10 @@ int cf_check sh_remove_l3_shadow(struct
 {
     shadow_l4e_t *sl4e;
     int done = 0;
-    int flags;
 
     SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, done, d,
     {
-        flags = shadow_l4e_get_flags(*sl4e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l4e_get_mfn(*sl4e)) == mfn_x(sl3mfn)) )
+        if ( mfn_x(shadow_l4e_get_mfn(*sl4e)) == mfn_x(sl3mfn) )
         {
             shadow_set_l4e(d, sl4e, shadow_l4e_empty(), sl4mfn);
             if ( mfn_to_page(sl3mfn)->u.sh.type == 0 )



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1()
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2023-03-22  9:31 ` [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys" Jan Beulich
@ 2023-03-22  9:31 ` Jan Beulich
  2023-03-23 12:55   ` Andrew Cooper
  2023-03-22  9:31 ` [PATCH 05/16] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

Ordinary scalar operations are used in a multitude of other places, so
do so here as well. In fact take the opportunity and drop a local
variable then as well, first and foremost to get rid of a bogus cast.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1659,9 +1659,8 @@ void sh_resync_l1(struct vcpu *v, mfn_t
 
    SHADOW_FOREACH_L1E(sl1mfn, sl1p, &gl1p, 0, {
         guest_l1e_t gl1e = *gl1p;
-        guest_l1e_t *snpl1p = (guest_l1e_t *)snp + guest_index(gl1p);
 
-        if ( memcmp(snpl1p, &gl1e, sizeof(gl1e)) )
+        if ( snp[guest_index(gl1p)].l1 != gl1e.l1 )
         {
             gfn_t gfn;
             mfn_t gmfn = INVALID_MFN;
@@ -1677,7 +1676,7 @@ void sh_resync_l1(struct vcpu *v, mfn_t
 
             l1e_propagate_from_guest(v, gl1e, gmfn, &nsl1e, ft_prefetch, p2mt);
             rc |= shadow_set_l1e(d, sl1p, nsl1e, p2mt, sl1mfn);
-            *snpl1p = gl1e;
+            snp[guest_index(gl1p)] = gl1e;
         }
     });
 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 05/16] x86/shadow: reduce explicit log-dirty recording for HVM
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2023-03-22  9:31 ` [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1() Jan Beulich
@ 2023-03-22  9:31 ` Jan Beulich
  2023-03-22  9:32 ` [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks Jan Beulich
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

validate_guest_pt_write(), by calling sh_validate_guest_entry(), already
guarantees the needed update of log-dirty information. Move the
operation into the sole code path needing it (when SHOPT_SKIP_VERIFY is
enabled), making clear that only one such call is needed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -656,6 +656,7 @@ static void sh_emulate_unmap_dest(struct
     {
         /* Writes with this alignment constraint can't possibly cross pages. */
         ASSERT(!mfn_valid(sh_ctxt->mfn[1]));
+        paging_mark_dirty(v->domain, sh_ctxt->mfn[0]);
     }
     else
 #endif /* SHADOW_OPTIMIZATIONS & SHOPT_SKIP_VERIFY */
@@ -673,12 +674,10 @@ static void sh_emulate_unmap_dest(struct
             validate_guest_pt_write(v, sh_ctxt->mfn[1], addr + b1, b2);
     }
 
-    paging_mark_dirty(v->domain, sh_ctxt->mfn[0]);
     put_page(mfn_to_page(sh_ctxt->mfn[0]));
 
     if ( unlikely(mfn_valid(sh_ctxt->mfn[1])) )
     {
-        paging_mark_dirty(v->domain, sh_ctxt->mfn[1]);
         put_page(mfn_to_page(sh_ctxt->mfn[1]));
         vunmap((void *)((unsigned long)addr & PAGE_MASK));
     }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (4 preceding siblings ...)
  2023-03-22  9:31 ` [PATCH 05/16] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
@ 2023-03-22  9:32 ` Jan Beulich
  2023-03-23 13:13   ` Andrew Cooper
  2023-03-22  9:33 ` [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

These aren't mode dependent (see 06f04f54ba97 ["x86/shadow:
sh_{write,cmpxchg}_guest_entry() are PV-only"], where they were moved
out of multi.c) and hence there's no need to have pointers to the
functions in struct shadow_paging_mode. Due to include dependencies,
however, the "paging" wrappers need to move out of paging.h; they're
needed from PV memory management code only anyway, so by moving them
their exposure is reduced at the same time.

By carefully placing the (moved and renamed) shadow function
declarations, #ifdef can also be dropped from the "paging" wrappers
(paging_mode_shadow() is constant false when !SHADOW_PAGING).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -98,13 +98,6 @@
 
 struct shadow_paging_mode {
 #ifdef CONFIG_SHADOW_PAGING
-#ifdef CONFIG_PV
-    void          (*write_guest_entry     )(struct vcpu *v, intpte_t *p,
-                                            intpte_t new, mfn_t gmfn);
-    intpte_t      (*cmpxchg_guest_entry   )(struct vcpu *v, intpte_t *p,
-                                            intpte_t old, intpte_t new,
-                                            mfn_t gmfn);
-#endif
 #ifdef CONFIG_HVM
     int           (*guess_wrmap           )(struct vcpu *v, 
                                             unsigned long vaddr, mfn_t gmfn);
@@ -326,44 +319,6 @@ static inline void paging_update_paging_
     v->domain->arch.paging.update_paging_modes(v);
 }
 
-#ifdef CONFIG_PV
-
-/*
- * Write a new value into the guest pagetable, and update the
- * paging-assistance state appropriately.  Returns false if we page-faulted,
- * true for success.
- */
-static inline void paging_write_guest_entry(
-    struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn)
-{
-#ifdef CONFIG_SHADOW_PAGING
-    if ( unlikely(paging_mode_shadow(v->domain)) && paging_get_hostmode(v) )
-        paging_get_hostmode(v)->shadow.write_guest_entry(v, p, new, gmfn);
-    else
-#endif
-        write_atomic(p, new);
-}
-
-
-/*
- * Cmpxchg a new value into the guest pagetable, and update the
- * paging-assistance state appropriately.  Returns false if we page-faulted,
- * true if not.  N.B. caller should check the value of "old" to see if the
- * cmpxchg itself was successful.
- */
-static inline intpte_t paging_cmpxchg_guest_entry(
-    struct vcpu *v, intpte_t *p, intpte_t old, intpte_t new, mfn_t gmfn)
-{
-#ifdef CONFIG_SHADOW_PAGING
-    if ( unlikely(paging_mode_shadow(v->domain)) && paging_get_hostmode(v) )
-        return paging_get_hostmode(v)->shadow.cmpxchg_guest_entry(v, p, old,
-                                                                  new, gmfn);
-#endif
-    return cmpxchg(p, old, new);
-}
-
-#endif /* CONFIG_PV */
-
 /* Helper function that writes a pte in such a way that a concurrent read 
  * never sees a half-written entry that has _PAGE_PRESENT set */
 static inline void safe_write_pte(l1_pgentry_t *p, l1_pgentry_t new)
--- a/xen/arch/x86/include/asm/shadow.h
+++ b/xen/arch/x86/include/asm/shadow.h
@@ -248,6 +248,12 @@ static inline void pv_l1tf_domain_destro
 #endif
 }
 
+/* Functions that atomically write PV guest PT entries */
+void shadow_write_guest_entry(
+    struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn);
+intpte_t shadow_cmpxchg_guest_entry(
+    struct vcpu *v, intpte_t *p, intpte_t old, intpte_t new, mfn_t gmfn);
+
 #endif /* CONFIG_PV */
 
 /* Remove all shadows of the guest mfn. */
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -4131,10 +4131,6 @@ const struct paging_mode sh_paging_mode
 #endif
     .update_cr3                    = sh_update_cr3,
     .guest_levels                  = GUEST_PAGING_LEVELS,
-#ifdef CONFIG_PV
-    .shadow.write_guest_entry      = sh_write_guest_entry,
-    .shadow.cmpxchg_guest_entry    = sh_cmpxchg_guest_entry,
-#endif
 #ifdef CONFIG_HVM
 #if SHADOW_OPTIMIZATIONS & SHOPT_WRITABLE_HEURISTIC
     .shadow.guess_wrmap            = sh_guess_wrmap,
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -428,12 +428,6 @@ static inline int sh_remove_write_access
 }
 #endif
 
-/* Functions that atomically write PV guest PT entries */
-void cf_check sh_write_guest_entry(
-    struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn);
-intpte_t cf_check sh_cmpxchg_guest_entry(
-    struct vcpu *v, intpte_t *p, intpte_t old, intpte_t new, mfn_t gmfn);
-
 /* Unhook the non-Xen mappings in this top-level shadow mfn.
  * With user_only == 1, unhooks only the user-mode mappings. */
 void shadow_unhook_mappings(struct domain *d, mfn_t smfn, int user_only);
--- a/xen/arch/x86/mm/shadow/pv.c
+++ b/xen/arch/x86/mm/shadow/pv.c
@@ -28,8 +28,8 @@
  * Write a new value into the guest pagetable, and update the shadows
  * appropriately.
  */
-void cf_check
-sh_write_guest_entry(struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn)
+void
+shadow_write_guest_entry(struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn)
 {
     paging_lock(v->domain);
     write_atomic(p, new);
@@ -42,9 +42,9 @@ sh_write_guest_entry(struct vcpu *v, int
  * appropriately.  Returns the previous entry found, which the caller is
  * expected to check to see if the cmpxchg was successful.
  */
-intpte_t cf_check
-sh_cmpxchg_guest_entry(struct vcpu *v, intpte_t *p, intpte_t old,
-                       intpte_t new, mfn_t gmfn)
+intpte_t
+shadow_cmpxchg_guest_entry(struct vcpu *v, intpte_t *p, intpte_t old,
+                           intpte_t new, mfn_t gmfn)
 {
     intpte_t t;
 
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -1,6 +1,8 @@
 #ifndef __PV_MM_H__
 #define __PV_MM_H__
 
+#include <asm/shadow.h>
+
 l1_pgentry_t *map_guest_l1e(unsigned long linear, mfn_t *gl1mfn);
 
 int new_guest_cr3(mfn_t mfn);
@@ -32,6 +34,35 @@ static inline l1_pgentry_t guest_get_eff
 }
 
 /*
+ * Write a new value into the guest pagetable, and update the
+ * paging-assistance state appropriately.  Returns false if we page-faulted,
+ * true for success.
+ */
+static inline void paging_write_guest_entry(
+    struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn)
+{
+    if ( unlikely(paging_mode_shadow(v->domain)) )
+        shadow_write_guest_entry(v, p, new, gmfn);
+    else
+        write_atomic(p, new);
+}
+
+
+/*
+ * Cmpxchg a new value into the guest pagetable, and update the
+ * paging-assistance state appropriately.  Returns false if we page-faulted,
+ * true if not.  N.B. caller should check the value of "old" to see if the
+ * cmpxchg itself was successful.
+ */
+static inline intpte_t paging_cmpxchg_guest_entry(
+    struct vcpu *v, intpte_t *p, intpte_t old, intpte_t new, mfn_t gmfn)
+{
+    if ( unlikely(paging_mode_shadow(v->domain)) )
+        return shadow_cmpxchg_guest_entry(v, p, old, new, gmfn);
+    return cmpxchg(p, old, new);
+}
+
+/*
  * PTE updates can be done with ordinary writes except:
  *  1. Debug builds get extra checking by using CMPXCHG[8B].
  */



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (5 preceding siblings ...)
  2023-03-22  9:32 ` [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks Jan Beulich
@ 2023-03-22  9:33 ` Jan Beulich
  2023-03-23 13:34   ` Andrew Cooper
  2023-03-27 15:39   ` Tim Deegan
  2023-03-22  9:34 ` [PATCH 08/16] x86/shadow: use lighter weight mode checks Jan Beulich
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:33 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

There's no need for an indirect call here, as the mode is invariant
throughout the entire paging-locked region. All it takes to avoid it is
to have a forward declaration of sh_update_cr3() in place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I find this and the respective Win7 related comment suspicious: If we
really need to "fix up" L3 entries "on demand", wouldn't we better retry
the shadow_get_and_create_l1e() rather than exit? The spurious page
fault that the guest observes can, after all, not be known to be non-
fatal inside the guest. That's purely an OS policy.

Furthermore the sh_update_cr3() will also invalidate L3 entries which
were loaded successfully before, but invalidated by the guest
afterwards. I strongly suspect that the described hardware behavior is
_only_ to load previously not-present entries from the PDPT, but not
purge ones already marked present. IOW I think sh_update_cr3() would
need calling in an "incremental" mode here. (The alternative of doing
this in shadow_get_and_create_l3e() instead would likely be more
cumbersome.)

In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
looks wrong.

Beyond the "on demand" L3 entry creation I also can't see what guest
actions could lead to the ASSERT() being inapplicable in the PAE case.
The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
PDPTEs, and all other logic is similar to that for other modes.

(See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk
succeed but shadow walk fails"].)

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -91,6 +91,8 @@ const char *const fetch_type_names[] = {
 # define for_each_shadow_table(v, i) for ( (i) = 0; (i) < 1; ++(i) )
 #endif
 
+static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush);
+
 /* Helper to perform a local TLB flush. */
 static void sh_flush_local(const struct domain *d)
 {
@@ -2487,7 +2489,7 @@ static int cf_check sh_page_fault(
          * In any case, in the PAE case, the ASSERT is not true; it can
          * happen because of actions the guest is taking. */
 #if GUEST_PAGING_LEVELS == 3
-        v->arch.paging.mode->update_cr3(v, 0, false);
+        sh_update_cr3(v, 0, false);
 #else
         ASSERT(d->is_shutting_down);
 #endif



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 08/16] x86/shadow: use lighter weight mode checks
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (6 preceding siblings ...)
  2023-03-22  9:33 ` [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
@ 2023-03-22  9:34 ` Jan Beulich
  2023-03-22  9:34 ` [PATCH 09/16] x86/shadow: OOS mode is HVM-only Jan Beulich
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:34 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

shadow_mode_...(), with the exception of shadow_mode_enabled(), are
shorthands for shadow_mode_enabled() && paging_mode_...(). While
potentially useful outside of shadow-internal functions, when we already
know that we're dealing with a domain in shadow mode, the "paging"
checks are sufficient and cheaper. While the "shadow" ones commonly
translate to a MOV/AND/CMP/Jcc sequence, the "paging" ones typically
resolve to just TEST+Jcc.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1858,7 +1858,7 @@ int sh_remove_write_access(struct domain
      * In guest refcounting, we trust Xen to already be restricting
      * all the writes to the guest page tables, so we do not need to
      * do more. */
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         return 0;
 
     /* Early exit if it's already a pagetable, or otherwise not writeable */
@@ -2090,7 +2090,7 @@ int sh_remove_all_mappings(struct domain
          *   guest pages with an extra reference taken by
          *   prepare_ring_for_helper().
          */
-        if ( !(shadow_mode_external(d)
+        if ( !(paging_mode_external(d)
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
                    == (is_special_page(page) ||
@@ -2387,8 +2387,8 @@ static void sh_update_paging_modes(struc
     {
         const struct paging_mode *old_mode = v->arch.paging.mode;
 
-        ASSERT(shadow_mode_translate(d));
-        ASSERT(shadow_mode_external(d));
+        ASSERT(paging_mode_translate(d));
+        ASSERT(paging_mode_external(d));
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
         /* Need to resync all our pages now, because if a page goes out
@@ -2775,7 +2775,7 @@ void shadow_vcpu_teardown(struct vcpu *v
 
     sh_detach_old_tables(v);
 #ifdef CONFIG_HVM
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         mfn_t mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
 
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -526,7 +526,7 @@ _sh_propagate(struct vcpu *v,
                || (level == 1
                    && page_get_owner(mfn_to_page(target_mfn)) == dom_io);
     if ( mmio_mfn
-         && !(level == 1 && (!shadow_mode_refcounts(d)
+         && !(level == 1 && (!paging_mode_refcounts(d)
                              || p2mt == p2m_mmio_direct)) )
     {
         ASSERT((ft == ft_prefetch));
@@ -543,7 +543,7 @@ _sh_propagate(struct vcpu *v,
                        _PAGE_RW | _PAGE_PRESENT);
     if ( guest_nx_enabled(v) )
         pass_thru_flags |= _PAGE_NX_BIT;
-    if ( level == 1 && !shadow_mode_refcounts(d) && mmio_mfn )
+    if ( level == 1 && !paging_mode_refcounts(d) && mmio_mfn )
         pass_thru_flags |= PAGE_CACHE_ATTRS;
     sflags = gflags & pass_thru_flags;
 
@@ -663,7 +663,7 @@ _sh_propagate(struct vcpu *v,
      * (We handle log-dirty entirely inside the shadow code, without using the
      * p2m_ram_logdirty p2m type: only HAP uses that.)
      */
-    if ( level == 1 && unlikely(shadow_mode_log_dirty(d)) && !mmio_mfn )
+    if ( level == 1 && unlikely(paging_mode_log_dirty(d)) && !mmio_mfn )
     {
         if ( ft & FETCH_TYPE_WRITE )
             paging_mark_dirty(d, target_mfn);
@@ -819,7 +819,7 @@ do {
 #define SHADOW_FOREACH_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)     \
 do {                                                                      \
     int _i, _j;                                                           \
-    ASSERT(shadow_mode_external(_dom));                                   \
+    ASSERT(paging_mode_external(_dom));                                   \
     ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_32_shadow);      \
     for ( _j = 0; _j < 4; _j++ )                                          \
     {                                                                     \
@@ -845,7 +845,7 @@ do {
 do {                                                                       \
     int _i;                                                                \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                        \
-    ASSERT(shadow_mode_external(_dom));                                    \
+    ASSERT(paging_mode_external(_dom));                                    \
     ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_pae_shadow);      \
     for ( _i = 0; _i < SHADOW_L2_PAGETABLE_ENTRIES; _i++ )                 \
     {                                                                      \
@@ -866,7 +866,7 @@ do {
     unsigned int _i, _end = SHADOW_L2_PAGETABLE_ENTRIES;                    \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                         \
     ASSERT_VALID_L2(mfn_to_page(_sl2mfn)->u.sh.type);                       \
-    if ( is_pv_32bit_domain(_dom) /* implies !shadow_mode_external */ &&    \
+    if ( is_pv_32bit_domain(_dom) /* implies !paging_mode_external */ &&    \
          mfn_to_page(_sl2mfn)->u.sh.type != SH_type_l2_64_shadow )          \
         _end = COMPAT_L2_PAGETABLE_FIRST_XEN_SLOT(_dom);                    \
     for ( _i = 0; _i < _end; ++_i )                                         \
@@ -908,7 +908,7 @@ do {
 #define SHADOW_FOREACH_L4E(_sl4mfn, _sl4e, _gl4p, _done, _dom, _code)   \
 do {                                                                    \
     shadow_l4e_t *_sp = map_domain_page((_sl4mfn));                     \
-    int _xen = !shadow_mode_external(_dom);                             \
+    int _xen = !paging_mode_external(_dom);                             \
     int _i;                                                             \
     ASSERT(mfn_to_page(_sl4mfn)->u.sh.type == SH_type_l4_64_shadow);\
     for ( _i = 0; _i < SHADOW_L4_PAGETABLE_ENTRIES; _i++ )              \
@@ -977,7 +977,7 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
 #endif
 
     // Create the Xen mappings...
-    if ( !shadow_mode_external(d) )
+    if ( !paging_mode_external(d) )
     {
         switch (shadow_type)
         {
@@ -1379,7 +1379,7 @@ void sh_destroy_l1_shadow(struct domain
         shadow_demote(d, gmfn, t);
     }
 
-    if ( shadow_mode_refcounts(d) )
+    if ( paging_mode_refcounts(d) )
     {
         /* Decrement refcounts of all the old entries */
         mfn_t sl1mfn = smfn;
@@ -1476,7 +1476,7 @@ static int cf_check validate_gl4e(
     l4e_propagate_from_guest(v, new_gl4e, sl3mfn, &new_sl4e, ft_prefetch);
 
     // check for updates to xen reserved slots
-    if ( !shadow_mode_external(d) )
+    if ( !paging_mode_external(d) )
     {
         int shadow_index = (((unsigned long)sl4p & ~PAGE_MASK) /
                             sizeof(shadow_l4e_t));
@@ -2399,7 +2399,7 @@ static int cf_check sh_page_fault(
     gfn = guest_walk_to_gfn(&gw);
     gmfn = get_gfn(d, gfn, &p2mt);
 
-    if ( shadow_mode_refcounts(d) &&
+    if ( paging_mode_refcounts(d) &&
          ((!p2m_is_valid(p2mt) && !p2m_is_grant(p2mt)) ||
           (!p2m_is_mmio(p2mt) && !mfn_valid(gmfn))) )
     {
@@ -2623,7 +2623,7 @@ static int cf_check sh_page_fault(
     return EXCRET_fault_fixed;
 
  emulate:
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         goto not_a_shadow_fault;
 
 #ifdef CONFIG_HVM
@@ -3067,7 +3067,7 @@ sh_update_linear_entries(struct vcpu *v)
      */
 
     /* Don't try to update the monitor table if it doesn't exist */
-    if ( !shadow_mode_external(d) ||
+    if ( !paging_mode_external(d) ||
          pagetable_get_pfn(v->arch.hvm.monitor_table) == 0 )
         return;
 
@@ -3216,7 +3216,7 @@ static void cf_check sh_update_cr3(struc
     /* Double-check that the HVM code has sent us a sane guest_table */
     if ( is_hvm_domain(d) )
     {
-        ASSERT(shadow_mode_external(d));
+        ASSERT(paging_mode_external(d));
         if ( hvm_paging_enabled(v) )
             ASSERT(pagetable_get_pfn(v->arch.guest_table));
         else
@@ -3241,7 +3241,7 @@ static void cf_check sh_update_cr3(struc
      * table.  We cache the current state of that table and shadow that,
      * until the next CR3 write makes us refresh our cache.
      */
-    ASSERT(shadow_mode_external(d));
+    ASSERT(paging_mode_external(d));
 
     /*
      * Find where in the page the l3 table is, but ignore the low 2 bits of
@@ -3272,7 +3272,7 @@ static void cf_check sh_update_cr3(struc
         ASSERT(d->is_dying || d->is_shutting_down);
         return;
     }
-    if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
+    if ( !paging_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.paging.shadow.shadow_table[0]);
 
@@ -3366,7 +3366,7 @@ static void cf_check sh_update_cr3(struc
     ///
     /// v->arch.cr3
     ///
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         make_cr3(v, pagetable_get_mfn(v->arch.hvm.monitor_table));
     }
@@ -3383,7 +3383,7 @@ static void cf_check sh_update_cr3(struc
     ///
     /// v->arch.hvm.hw_cr[3]
     ///
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         ASSERT(is_hvm_domain(d));
 #if SHADOW_PAGING_LEVELS == 3
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -423,7 +423,7 @@ static inline int sh_remove_write_access
                                          unsigned int level,
                                          unsigned long fault_addr)
 {
-    ASSERT(!shadow_mode_refcounts(d));
+    ASSERT(!paging_mode_refcounts(d));
     return 0;
 }
 #endif
@@ -536,8 +536,8 @@ sh_mfn_is_a_page_table(mfn_t gmfn)
         return 0;
 
     owner = page_get_owner(page);
-    if ( owner && shadow_mode_refcounts(owner)
-         && (page->count_info & PGC_shadowed_pt) )
+    if ( owner && paging_mode_refcounts(owner) &&
+         (page->count_info & PGC_shadowed_pt) )
         return 1;
 
     type_info = page->u.inuse.type_info & PGT_type_mask;
--- a/xen/arch/x86/mm/shadow/set.c
+++ b/xen/arch/x86/mm/shadow/set.c
@@ -93,7 +93,7 @@ shadow_get_page_from_l1e(shadow_l1e_t sl
     struct domain *owner = NULL;
 
     ASSERT(!sh_l1e_is_magic(sl1e));
-    ASSERT(shadow_mode_refcounts(d));
+    ASSERT(paging_mode_refcounts(d));
 
     if ( mfn_valid(mfn) )
     {
@@ -354,7 +354,7 @@ int shadow_set_l1e(struct domain *d, sha
          !sh_l1e_is_magic(new_sl1e) )
     {
         /* About to install a new reference */
-        if ( shadow_mode_refcounts(d) )
+        if ( paging_mode_refcounts(d) )
         {
 #define PAGE_FLIPPABLE (_PAGE_RW | _PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
             int rc;
@@ -387,7 +387,7 @@ int shadow_set_l1e(struct domain *d, sha
 
     old_sl1f = shadow_l1e_get_flags(old_sl1e);
     if ( (old_sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(old_sl1e) &&
-         shadow_mode_refcounts(d) )
+         paging_mode_refcounts(d) )
     {
         /*
          * We lost a reference to an old mfn.
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -274,7 +274,7 @@ int shadow_set_l4e(struct domain *d, sha
 static void inline
 shadow_put_page_from_l1e(shadow_l1e_t sl1e, struct domain *d)
 {
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         return;
 
     put_page_from_l1e(sl1e, d);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 09/16] x86/shadow: OOS mode is HVM-only
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (7 preceding siblings ...)
  2023-03-22  9:34 ` [PATCH 08/16] x86/shadow: use lighter weight mode checks Jan Beulich
@ 2023-03-22  9:34 ` Jan Beulich
  2023-03-23 13:42   ` Andrew Cooper
  2023-03-22  9:35 ` [PATCH 10/16] x86/shadow: move OOS functions to their own file Jan Beulich
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:34 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

XEN_DOMCTL_CDF_oos_off is forced set for PV domains, so the logic can't
ever be engaged for them. Conditionalize respective fields and remove
the respective bit from SHADOW_OPTIMIZATIONS when !HVM. As a result the
SH_type_oos_snapshot constant can disappear altogether in that case, and
a couple of #ifdef-s can also be dropped/combined.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -113,10 +113,10 @@ struct shadow_domain {
     /* Fast MMIO path heuristic */
     bool has_fast_mmio_entries;
 
+#ifdef CONFIG_HVM
     /* OOS */
     bool_t oos_active;
 
-#ifdef CONFIG_HVM
     /* Has this domain ever used HVMOP_pagetable_dying? */
     bool_t pagetable_dying_op;
 #endif
@@ -152,7 +152,6 @@ struct shadow_vcpu {
     unsigned long last_emulated_frame;
     /* Last MFN that we emulated a write successfully */
     unsigned long last_emulated_mfn;
-#endif
 
     /* Shadow out-of-sync: pages that this vcpu has let go out of sync */
     mfn_t oos[SHADOW_OOS_PAGES];
@@ -163,7 +162,6 @@ struct shadow_vcpu {
         unsigned long off[SHADOW_OOS_FIXUPS];
     } oos_fixup[SHADOW_OOS_PAGES];
 
-#ifdef CONFIG_HVM
     bool_t pagetable_dying;
 #endif
 #endif
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -296,7 +296,6 @@ static int sh_remove_write_access_from_s
 
     switch ( mfn_to_page(smfn)->u.sh.type )
     {
-#ifdef CONFIG_HVM
     case SH_type_l1_32_shadow:
     case SH_type_fl1_32_shadow:
         return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
@@ -306,7 +305,6 @@ static int sh_remove_write_access_from_s
     case SH_type_fl1_pae_shadow:
         return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
             (d, gmfn, smfn, off);
-#endif
 
     case SH_type_l1_64_shadow:
     case SH_type_fl1_64_shadow:
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -69,7 +69,8 @@ extern int shadow_audit_enable;
 #ifdef CONFIG_HVM
 #define SHADOW_OPTIMIZATIONS     0x1ff
 #else
-#define SHADOW_OPTIMIZATIONS     (0x1ff & ~SHOPT_FAST_EMULATION)
+#define SHADOW_OPTIMIZATIONS     (0x1ff & ~(SHOPT_OUT_OF_SYNC | \
+                                            SHOPT_FAST_EMULATION))
 #endif
 
 
@@ -205,8 +206,7 @@ extern void shadow_audit_tables(struct v
 #define SH_type_max_shadow     6U
 #define SH_type_p2m_table      7U /* in use as the p2m table */
 #define SH_type_monitor_table  8U /* in use as a monitor table */
-#define SH_type_oos_snapshot   9U /* in use as OOS snapshot */
-#define SH_type_unused        10U
+#define SH_type_unused         9U
 #endif
 
 #ifndef CONFIG_PV32 /* Unused (but uglier to #ifdef above): */
@@ -323,8 +323,6 @@ static inline void sh_terminate_list(str
 #define SHF_out_of_sync (1u << (SH_type_max_shadow + 1))
 #define SHF_oos_may_write (1u << (SH_type_max_shadow + 2))
 
-#endif /* (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) */
-
 static inline int sh_page_has_multiple_shadows(struct page_info *pg)
 {
     u32 shadows;
@@ -335,7 +333,6 @@ static inline int sh_page_has_multiple_s
     return shadows && (shadows & (shadows - 1));
 }
 
-#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
 /* The caller must verify this is reasonable to call; i.e., valid mfn,
  * domain is translated, &c */
 static inline int page_is_out_of_sync(struct page_info *p)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 10/16] x86/shadow: move OOS functions to their own file
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (8 preceding siblings ...)
  2023-03-22  9:34 ` [PATCH 09/16] x86/shadow: OOS mode is HVM-only Jan Beulich
@ 2023-03-22  9:35 ` Jan Beulich
  2023-03-23 14:30   ` Andrew Cooper
  2023-03-22  9:35 ` [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

The code has been identified as HVM-only, and its main functions are
pretty well isolated. Move them to their own file. While moving, besides
making two functions non-static, do a few style adjustments, mainly
comment formatting, but leave the code otherwise untouched.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/Makefile
+++ b/xen/arch/x86/mm/shadow/Makefile
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_SHADOW_PAGING),y)
 obj-y += common.o set.o
-obj-$(CONFIG_HVM) += hvm.o guest_2.o guest_3.o guest_4.o
+obj-$(CONFIG_HVM) += hvm.o guest_2.o guest_3.o guest_4.o oos.o
 obj-$(CONFIG_PV) += pv.o guest_4.o
 else
 obj-y += none.o
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -152,576 +152,6 @@ static int __init cf_check shadow_audit_
 __initcall(shadow_audit_key_init);
 #endif /* SHADOW_AUDIT */
 
-#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
-/**************************************************************************/
-/* Out-of-sync shadows. */
-
-/* From time to time, we let a shadowed pagetable page go out of sync
- * with its shadow: the guest is allowed to write directly to the page,
- * and those writes are not synchronously reflected in the shadow.
- * This lets us avoid many emulations if the guest is writing a lot to a
- * pagetable, but it relaxes a pretty important invariant in the shadow
- * pagetable design.  Therefore, some rules:
- *
- * 1. Only L1 pagetables may go out of sync: any page that is shadowed
- *    at at higher level must be synchronously updated.  This makes
- *    using linear shadow pagetables much less dangerous.
- *    That means that: (a) unsyncing code needs to check for higher-level
- *    shadows, and (b) promotion code needs to resync.
- *
- * 2. All shadow operations on a guest page require the page to be brought
- *    back into sync before proceeding.  This must be done under the
- *    paging lock so that the page is guaranteed to remain synced until
- *    the operation completes.
- *
- *    Exceptions to this rule: the pagefault and invlpg handlers may
- *    update only one entry on an out-of-sync page without resyncing it.
- *
- * 3. Operations on shadows that do not start from a guest page need to
- *    be aware that they may be handling an out-of-sync shadow.
- *
- * 4. Operations that do not normally take the paging lock (fast-path
- *    #PF handler, INVLPG) must fall back to a locking, syncing version
- *    if they see an out-of-sync table.
- *
- * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
- *    must explicitly resync all relevant pages or update their
- *    shadows.
- *
- * Currently out-of-sync pages are listed in a simple open-addressed
- * hash table with a second chance (must resist temptation to radically
- * over-engineer hash tables...)  The virtual address of the access
- * which caused us to unsync the page is also kept in the hash table, as
- * a hint for finding the writable mappings later.
- *
- * We keep a hash per vcpu, because we want as much as possible to do
- * the re-sync on the save vcpu we did the unsync on, so the VA hint
- * will be valid.
- */
-
-static void sh_oos_audit(struct domain *d)
-{
-    unsigned int idx, expected_idx, expected_idx_alt;
-    struct page_info *pg;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        {
-            mfn_t *oos = v->arch.paging.shadow.oos;
-            if ( mfn_eq(oos[idx], INVALID_MFN) )
-                continue;
-
-            expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES;
-            expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES);
-            if ( idx != expected_idx && idx != expected_idx_alt )
-            {
-                printk("%s: idx %x contains gmfn %lx, expected at %x or %x.\n",
-                       __func__, idx, mfn_x(oos[idx]),
-                       expected_idx, expected_idx_alt);
-                BUG();
-            }
-            pg = mfn_to_page(oos[idx]);
-            if ( !(pg->count_info & PGC_shadowed_pt) )
-            {
-                printk("%s: idx %x gmfn %lx not a pt (count %lx)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->count_info);
-                BUG();
-            }
-            if ( !(pg->shadow_flags & SHF_out_of_sync) )
-            {
-                printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
-                BUG();
-            }
-            if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) )
-            {
-                printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
-                BUG();
-            }
-        }
-    }
-}
-
-#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
-void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    struct vcpu *v;
-    mfn_t *oos;
-
-    ASSERT(mfn_is_out_of_sync(gmfn));
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-
-        if ( mfn_eq(oos[idx], gmfn) )
-            return;
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-#endif
-
-/* Update the shadow, but keep the page out of sync. */
-static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)
-{
-    struct page_info *pg = mfn_to_page(gmfn);
-
-    ASSERT(mfn_valid(gmfn));
-    ASSERT(page_is_out_of_sync(pg));
-
-    /* Call out to the appropriate per-mode resyncing function */
-    if ( pg->shadow_flags & SHF_L1_32 )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn);
-    else if ( pg->shadow_flags & SHF_L1_PAE )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn);
-    else if ( pg->shadow_flags & SHF_L1_64 )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn);
-}
-
-static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
-                                            mfn_t smfn, unsigned long off)
-{
-    ASSERT(mfn_valid(smfn));
-    ASSERT(mfn_valid(gmfn));
-
-    switch ( mfn_to_page(smfn)->u.sh.type )
-    {
-    case SH_type_l1_32_shadow:
-    case SH_type_fl1_32_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
-            (d, gmfn, smfn, off);
-
-    case SH_type_l1_pae_shadow:
-    case SH_type_fl1_pae_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
-            (d, gmfn, smfn, off);
-
-    case SH_type_l1_64_shadow:
-    case SH_type_fl1_64_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4)
-            (d, gmfn, smfn, off);
-
-    default:
-        return 0;
-    }
-}
-
-/*
- * Fixup arrays: We limit the maximum number of writable mappings to
- * SHADOW_OOS_FIXUPS and store enough information to remove them
- * quickly on resync.
- */
-
-static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn,
-                                       struct oos_fixup *fixup)
-{
-    struct domain *d = v->domain;
-    int i;
-    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-    {
-        if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) )
-        {
-            sh_remove_write_access_from_sl1p(d, gmfn,
-                                             fixup->smfn[i],
-                                             fixup->off[i]);
-            fixup->smfn[i] = INVALID_MFN;
-        }
-    }
-
-    /* Always flush the TLBs. See comment on oos_fixup_add(). */
-    return 1;
-}
-
-void oos_fixup_add(struct domain *d, mfn_t gmfn,
-                   mfn_t smfn,  unsigned long off)
-{
-    int idx, next;
-    mfn_t *oos;
-    struct oos_fixup *oos_fixup;
-    struct vcpu *v;
-
-    perfc_incr(shadow_oos_fixup_add);
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_fixup = v->arch.paging.shadow.oos_fixup;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            int i;
-            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-            {
-                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
-                     && (oos_fixup[idx].off[i] == off) )
-                    return;
-            }
-
-            next = oos_fixup[idx].next;
-
-            if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) )
-            {
-                TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT);
-
-                /* Reuse this slot and remove current writable mapping. */
-                sh_remove_write_access_from_sl1p(d, gmfn,
-                                                 oos_fixup[idx].smfn[next],
-                                                 oos_fixup[idx].off[next]);
-                perfc_incr(shadow_oos_fixup_evict);
-                /* We should flush the TLBs now, because we removed a
-                   writable mapping, but since the shadow is already
-                   OOS we have no problem if another vcpu write to
-                   this page table. We just have to be very careful to
-                   *always* flush the tlbs on resync. */
-            }
-
-            oos_fixup[idx].smfn[next] = smfn;
-            oos_fixup[idx].off[next] = off;
-            oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS;
-
-            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD);
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn,
-                                   struct oos_fixup *fixup)
-{
-    struct domain *d = v->domain;
-    int ftlb = 0;
-
-    ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup);
-
-    switch ( sh_remove_write_access(d, gmfn, 0, 0) )
-    {
-    default:
-    case 0:
-        break;
-
-    case 1:
-        ftlb |= 1;
-        break;
-
-    case -1:
-        /* An unfindable writeable typecount has appeared, probably via a
-         * grant table entry: can't shoot the mapping, so try to unshadow
-         * the page.  If that doesn't work either, the guest is granting
-         * his pagetables and must be killed after all.
-         * This will flush the tlb, so we can return with no worries. */
-        shadow_remove_all_shadows(d, gmfn);
-        return 1;
-    }
-
-    if ( ftlb )
-        guest_flush_tlb_mask(d, d->dirty_cpumask);
-
-    return 0;
-}
-
-
-static inline void trace_resync(int event, mfn_t gmfn)
-{
-    if ( tb_init_done )
-    {
-        /* Convert gmfn to gfn */
-        gfn_t gfn = mfn_to_gfn(current->domain, gmfn);
-
-        __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn);
-    }
-}
-
-/* Pull all the entries on an out-of-sync page back into sync. */
-static void _sh_resync(struct vcpu *v, mfn_t gmfn,
-                       struct oos_fixup *fixup, mfn_t snp)
-{
-    struct page_info *pg = mfn_to_page(gmfn);
-
-    ASSERT(paging_locked_by_me(v->domain));
-    ASSERT(mfn_is_out_of_sync(gmfn));
-    /* Guest page must be shadowed *only* as L1 when out of sync. */
-    ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask
-             & ~SHF_L1_ANY));
-    ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn)));
-
-    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
-
-    /* Need to pull write access so the page *stays* in sync. */
-    if ( oos_remove_write_access(v, gmfn, fixup) )
-    {
-        /* Page has been unshadowed. */
-        return;
-    }
-
-    /* No more writable mappings of this page, please */
-    pg->shadow_flags &= ~SHF_oos_may_write;
-
-    /* Update the shadows with current guest entries. */
-    _sh_resync_l1(v, gmfn, snp);
-
-    /* Now we know all the entries are synced, and will stay that way */
-    pg->shadow_flags &= ~SHF_out_of_sync;
-    perfc_incr(shadow_resync);
-    trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn);
-}
-
-
-/* Add an MFN to the list of out-of-sync guest pagetables */
-static void oos_hash_add(struct vcpu *v, mfn_t gmfn)
-{
-    int i, idx, oidx, swap = 0;
-    mfn_t *oos = v->arch.paging.shadow.oos;
-    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
-    struct oos_fixup fixup = { .next = 0 };
-
-    for (i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-        fixup.smfn[i] = INVALID_MFN;
-
-    idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-    oidx = idx;
-
-    if ( !mfn_eq(oos[idx], INVALID_MFN)
-         && (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx )
-    {
-        /* Punt the current occupant into the next slot */
-        SWAP(oos[idx], gmfn);
-        SWAP(oos_fixup[idx], fixup);
-        swap = 1;
-        idx = (idx + 1) % SHADOW_OOS_PAGES;
-    }
-    if ( !mfn_eq(oos[idx], INVALID_MFN) )
-    {
-        /* Crush the current occupant. */
-        _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-        perfc_incr(shadow_unsync_evict);
-    }
-    oos[idx] = gmfn;
-    oos_fixup[idx] = fixup;
-
-    if ( swap )
-        SWAP(oos_snapshot[idx], oos_snapshot[oidx]);
-
-    copy_domain_page(oos_snapshot[oidx], oos[oidx]);
-}
-
-/* Remove an MFN from the list of out-of-sync guest pagetables */
-static void oos_hash_remove(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    struct vcpu *v;
-
-    SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn));
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            oos[idx] = INVALID_MFN;
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    mfn_t *oos_snapshot;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            return oos_snapshot[idx];
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-/* Pull a single guest page back into sync */
-void sh_resync(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    mfn_t *oos_snapshot;
-    struct oos_fixup *oos_fixup;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_fixup = v->arch.paging.shadow.oos_fixup;
-        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]);
-            oos[idx] = INVALID_MFN;
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-/* Figure out whether it's definitely safe not to sync this l1 table,
- * by making a call out to the mode in which that shadow was made. */
-static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn)
-{
-    struct page_info *pg = mfn_to_page(gl1mfn);
-    if ( pg->shadow_flags & SHF_L1_32 )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn);
-    else if ( pg->shadow_flags & SHF_L1_PAE )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn);
-    else if ( pg->shadow_flags & SHF_L1_64 )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn);
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n",
-           mfn_x(gl1mfn));
-    BUG();
-}
-
-
-/* Pull all out-of-sync pages back into sync.  Pages brought out of sync
- * on other vcpus are allowed to remain out of sync, but their contents
- * will be made safe (TLB flush semantics); pages unsynced by this vcpu
- * are brought back into sync and write-protected.  If skip != 0, we try
- * to avoid resyncing at all if we think we can get away with it. */
-void sh_resync_all(struct vcpu *v, int skip, int this, int others)
-{
-    int idx;
-    struct vcpu *other;
-    mfn_t *oos = v->arch.paging.shadow.oos;
-    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
-
-    SHADOW_PRINTK("%pv\n", v);
-
-    ASSERT(paging_locked_by_me(v->domain));
-
-    if ( !this )
-        goto resync_others;
-
-    /* First: resync all of this vcpu's oos pages */
-    for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        if ( !mfn_eq(oos[idx], INVALID_MFN) )
-        {
-            /* Write-protect and sync contents */
-            _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-            oos[idx] = INVALID_MFN;
-        }
-
- resync_others:
-    if ( !others )
-        return;
-
-    /* Second: make all *other* vcpus' oos pages safe. */
-    for_each_vcpu(v->domain, other)
-    {
-        if ( v == other )
-            continue;
-
-        oos = other->arch.paging.shadow.oos;
-        oos_fixup = other->arch.paging.shadow.oos_fixup;
-        oos_snapshot = other->arch.paging.shadow.oos_snapshot;
-
-        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        {
-            if ( mfn_eq(oos[idx], INVALID_MFN) )
-                continue;
-
-            if ( skip )
-            {
-                /* Update the shadows and leave the page OOS. */
-                if ( sh_skip_sync(v, oos[idx]) )
-                    continue;
-                trace_resync(TRC_SHADOW_RESYNC_ONLY, oos[idx]);
-                _sh_resync_l1(other, oos[idx], oos_snapshot[idx]);
-            }
-            else
-            {
-                /* Write-protect and sync contents */
-                _sh_resync(other, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-                oos[idx] = INVALID_MFN;
-            }
-        }
-    }
-}
-
-/* Allow a shadowed page to go out of sync. Unsyncs are traced in
- * multi.c:sh_page_fault() */
-int sh_unsync(struct vcpu *v, mfn_t gmfn)
-{
-    struct page_info *pg;
-
-    ASSERT(paging_locked_by_me(v->domain));
-
-    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
-
-    pg = mfn_to_page(gmfn);
-
-    /* Guest page must be shadowed *only* as L1 and *only* once when out
-     * of sync.  Also, get out now if it's already out of sync.
-     * Also, can't safely unsync if some vcpus have paging disabled.*/
-    if ( pg->shadow_flags &
-         ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
-         || sh_page_has_multiple_shadows(pg)
-         || !is_hvm_vcpu(v)
-         || !v->domain->arch.paging.shadow.oos_active )
-        return 0;
-
-    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_out_of_sync);
-    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_oos_may_write);
-
-    pg->shadow_flags |= SHF_out_of_sync|SHF_oos_may_write;
-    oos_hash_add(v, gmfn);
-    perfc_incr(shadow_unsync);
-    TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_UNSYNC);
-    return 1;
-}
-
-#endif /* (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) */
-
-
 /**************************************************************************/
 /* Code for "promoting" a guest page to the point where the shadow code is
  * willing to let it be treated as a guest page table.  This generally
--- /dev/null
+++ b/xen/arch/x86/mm/shadow/oos.c
@@ -0,0 +1,603 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/******************************************************************************
+ * arch/x86/mm/shadow/oos.c
+ *
+ * Shadow code dealing with out-of-sync shadows.
+ * Parts of this code are Copyright (c) 2006 by XenSource Inc.
+ * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
+ * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
+ */
+
+#include "private.h"
+
+#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
+
+#include <xen/trace.h>
+
+#include <asm/shadow.h>
+
+/*
+ * From time to time, we let a shadowed pagetable page go out of sync
+ * with its shadow: the guest is allowed to write directly to the page,
+ * and those writes are not synchronously reflected in the shadow.
+ * This lets us avoid many emulations if the guest is writing a lot to a
+ * pagetable, but it relaxes a pretty important invariant in the shadow
+ * pagetable design.  Therefore, some rules:
+ *
+ * 1. Only L1 pagetables may go out of sync: any page that is shadowed
+ *    at at higher level must be synchronously updated.  This makes
+ *    using linear shadow pagetables much less dangerous.
+ *    That means that: (a) unsyncing code needs to check for higher-level
+ *    shadows, and (b) promotion code needs to resync.
+ *
+ * 2. All shadow operations on a guest page require the page to be brought
+ *    back into sync before proceeding.  This must be done under the
+ *    paging lock so that the page is guaranteed to remain synced until
+ *    the operation completes.
+ *
+ *    Exceptions to this rule: the pagefault and invlpg handlers may
+ *    update only one entry on an out-of-sync page without resyncing it.
+ *
+ * 3. Operations on shadows that do not start from a guest page need to
+ *    be aware that they may be handling an out-of-sync shadow.
+ *
+ * 4. Operations that do not normally take the paging lock (fast-path
+ *    #PF handler, INVLPG) must fall back to a locking, syncing version
+ *    if they see an out-of-sync table.
+ *
+ * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
+ *    must explicitly resync all relevant pages or update their
+ *    shadows.
+ *
+ * Currently out-of-sync pages are listed in a simple open-addressed
+ * hash table with a second chance (must resist temptation to radically
+ * over-engineer hash tables...)  The virtual address of the access
+ * which caused us to unsync the page is also kept in the hash table, as
+ * a hint for finding the writable mappings later.
+ *
+ * We keep a hash per vcpu, because we want as much as possible to do
+ * the re-sync on the save vcpu we did the unsync on, so the VA hint
+ * will be valid.
+ */
+
+#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES_FULL
+void sh_oos_audit(struct domain *d)
+{
+    unsigned int idx, expected_idx, expected_idx_alt;
+    struct page_info *pg;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        {
+            mfn_t *oos = v->arch.paging.shadow.oos;
+            if ( mfn_eq(oos[idx], INVALID_MFN) )
+                continue;
+
+            expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES;
+            expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES);
+            if ( idx != expected_idx && idx != expected_idx_alt )
+            {
+                printk("%s: idx %x contains gmfn %lx, expected at %x or %x.\n",
+                       __func__, idx, mfn_x(oos[idx]),
+                       expected_idx, expected_idx_alt);
+                BUG();
+            }
+            pg = mfn_to_page(oos[idx]);
+            if ( !(pg->count_info & PGC_shadowed_pt) )
+            {
+                printk("%s: idx %x gmfn %lx not a pt (count %lx)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->count_info);
+                BUG();
+            }
+            if ( !(pg->shadow_flags & SHF_out_of_sync) )
+            {
+                printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
+                BUG();
+            }
+            if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) )
+            {
+                printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
+                BUG();
+            }
+        }
+    }
+}
+#endif
+
+#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
+void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    struct vcpu *v;
+    mfn_t *oos;
+
+    ASSERT(mfn_is_out_of_sync(gmfn));
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+
+        if ( mfn_eq(oos[idx], gmfn) )
+            return;
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+#endif
+
+/* Update the shadow, but keep the page out of sync. */
+static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)
+{
+    struct page_info *pg = mfn_to_page(gmfn);
+
+    ASSERT(mfn_valid(gmfn));
+    ASSERT(page_is_out_of_sync(pg));
+
+    /* Call out to the appropriate per-mode resyncing function */
+    if ( pg->shadow_flags & SHF_L1_32 )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn);
+    else if ( pg->shadow_flags & SHF_L1_PAE )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn);
+    else if ( pg->shadow_flags & SHF_L1_64 )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn);
+}
+
+static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
+                                            mfn_t smfn, unsigned long off)
+{
+    ASSERT(mfn_valid(smfn));
+    ASSERT(mfn_valid(gmfn));
+
+    switch ( mfn_to_page(smfn)->u.sh.type )
+    {
+    case SH_type_l1_32_shadow:
+    case SH_type_fl1_32_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
+            (d, gmfn, smfn, off);
+
+    case SH_type_l1_pae_shadow:
+    case SH_type_fl1_pae_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
+            (d, gmfn, smfn, off);
+
+    case SH_type_l1_64_shadow:
+    case SH_type_fl1_64_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4)
+            (d, gmfn, smfn, off);
+
+    default:
+        return 0;
+    }
+}
+
+/*
+ * Fixup arrays: We limit the maximum number of writable mappings to
+ * SHADOW_OOS_FIXUPS and store enough information to remove them
+ * quickly on resync.
+ */
+
+static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn,
+                                       struct oos_fixup *fixup)
+{
+    struct domain *d = v->domain;
+    int i;
+    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+    {
+        if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) )
+        {
+            sh_remove_write_access_from_sl1p(d, gmfn,
+                                             fixup->smfn[i],
+                                             fixup->off[i]);
+            fixup->smfn[i] = INVALID_MFN;
+        }
+    }
+
+    /* Always flush the TLBs. See comment on oos_fixup_add(). */
+    return 1;
+}
+
+void oos_fixup_add(struct domain *d, mfn_t gmfn,
+                   mfn_t smfn,  unsigned long off)
+{
+    int idx, next;
+    mfn_t *oos;
+    struct oos_fixup *oos_fixup;
+    struct vcpu *v;
+
+    perfc_incr(shadow_oos_fixup_add);
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_fixup = v->arch.paging.shadow.oos_fixup;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            int i;
+            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+            {
+                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
+                     && (oos_fixup[idx].off[i] == off) )
+                    return;
+            }
+
+            next = oos_fixup[idx].next;
+
+            if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) )
+            {
+                TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT);
+
+                /* Reuse this slot and remove current writable mapping. */
+                sh_remove_write_access_from_sl1p(d, gmfn,
+                                                 oos_fixup[idx].smfn[next],
+                                                 oos_fixup[idx].off[next]);
+                perfc_incr(shadow_oos_fixup_evict);
+                /*
+                 * We should flush the TLBs now, because we removed a
+                 * writable mapping, but since the shadow is already
+                 * OOS we have no problem if another vcpu write to
+                 * this page table. We just have to be very careful to
+                 * *always* flush the tlbs on resync.
+                 */
+            }
+
+            oos_fixup[idx].smfn[next] = smfn;
+            oos_fixup[idx].off[next] = off;
+            oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS;
+
+            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD);
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn,
+                                   struct oos_fixup *fixup)
+{
+    struct domain *d = v->domain;
+    int ftlb = 0;
+
+    ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup);
+
+    switch ( sh_remove_write_access(d, gmfn, 0, 0) )
+    {
+    default:
+    case 0:
+        break;
+
+    case 1:
+        ftlb |= 1;
+        break;
+
+    case -1:
+        /*
+         * An unfindable writeable typecount has appeared, probably via a
+         * grant table entry: can't shoot the mapping, so try to unshadow
+         * the page.  If that doesn't work either, the guest is granting
+         * his pagetables and must be killed after all.
+         * This will flush the tlb, so we can return with no worries.
+         */
+        shadow_remove_all_shadows(d, gmfn);
+        return 1;
+    }
+
+    if ( ftlb )
+        guest_flush_tlb_mask(d, d->dirty_cpumask);
+
+    return 0;
+}
+
+static inline void trace_resync(int event, mfn_t gmfn)
+{
+    if ( tb_init_done )
+    {
+        /* Convert gmfn to gfn */
+        gfn_t gfn = mfn_to_gfn(current->domain, gmfn);
+
+        __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn);
+    }
+}
+
+/* Pull all the entries on an out-of-sync page back into sync. */
+static void _sh_resync(struct vcpu *v, mfn_t gmfn,
+                       struct oos_fixup *fixup, mfn_t snp)
+{
+    struct page_info *pg = mfn_to_page(gmfn);
+
+    ASSERT(paging_locked_by_me(v->domain));
+    ASSERT(mfn_is_out_of_sync(gmfn));
+    /* Guest page must be shadowed *only* as L1 when out of sync. */
+    ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask
+             & ~SHF_L1_ANY));
+    ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn)));
+
+    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
+
+    /* Need to pull write access so the page *stays* in sync. */
+    if ( oos_remove_write_access(v, gmfn, fixup) )
+    {
+        /* Page has been unshadowed. */
+        return;
+    }
+
+    /* No more writable mappings of this page, please */
+    pg->shadow_flags &= ~SHF_oos_may_write;
+
+    /* Update the shadows with current guest entries. */
+    _sh_resync_l1(v, gmfn, snp);
+
+    /* Now we know all the entries are synced, and will stay that way */
+    pg->shadow_flags &= ~SHF_out_of_sync;
+    perfc_incr(shadow_resync);
+    trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn);
+}
+
+/* Add an MFN to the list of out-of-sync guest pagetables */
+static void oos_hash_add(struct vcpu *v, mfn_t gmfn)
+{
+    int i, idx, oidx, swap = 0;
+    mfn_t *oos = v->arch.paging.shadow.oos;
+    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
+    struct oos_fixup fixup = { .next = 0 };
+
+    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+        fixup.smfn[i] = INVALID_MFN;
+
+    idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+    oidx = idx;
+
+    if ( !mfn_eq(oos[idx], INVALID_MFN)
+         && (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx )
+    {
+        /* Punt the current occupant into the next slot */
+        SWAP(oos[idx], gmfn);
+        SWAP(oos_fixup[idx], fixup);
+        swap = 1;
+        idx = (idx + 1) % SHADOW_OOS_PAGES;
+    }
+    if ( !mfn_eq(oos[idx], INVALID_MFN) )
+    {
+        /* Crush the current occupant. */
+        _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+        perfc_incr(shadow_unsync_evict);
+    }
+    oos[idx] = gmfn;
+    oos_fixup[idx] = fixup;
+
+    if ( swap )
+        SWAP(oos_snapshot[idx], oos_snapshot[oidx]);
+
+    copy_domain_page(oos_snapshot[oidx], oos[oidx]);
+}
+
+/* Remove an MFN from the list of out-of-sync guest pagetables */
+void oos_hash_remove(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    struct vcpu *v;
+
+    SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn));
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            oos[idx] = INVALID_MFN;
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    mfn_t *oos_snapshot;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            return oos_snapshot[idx];
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+/* Pull a single guest page back into sync */
+void sh_resync(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    mfn_t *oos_snapshot;
+    struct oos_fixup *oos_fixup;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_fixup = v->arch.paging.shadow.oos_fixup;
+        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]);
+            oos[idx] = INVALID_MFN;
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+/*
+ * Figure out whether it's definitely safe not to sync this l1 table,
+ * by making a call out to the mode in which that shadow was made.
+ */
+static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn)
+{
+    struct page_info *pg = mfn_to_page(gl1mfn);
+    if ( pg->shadow_flags & SHF_L1_32 )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn);
+    else if ( pg->shadow_flags & SHF_L1_PAE )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn);
+    else if ( pg->shadow_flags & SHF_L1_64 )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn);
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n",
+           mfn_x(gl1mfn));
+    BUG();
+}
+
+/*
+ * Pull all out-of-sync pages back into sync.  Pages brought out of sync
+ * on other vcpus are allowed to remain out of sync, but their contents
+ * will be made safe (TLB flush semantics); pages unsynced by this vcpu
+ * are brought back into sync and write-protected.  If skip != 0, we try
+ * to avoid resyncing at all if we think we can get away with it.
+ */
+void sh_resync_all(struct vcpu *v, int skip, int this, int others)
+{
+    int idx;
+    struct vcpu *other;
+    mfn_t *oos = v->arch.paging.shadow.oos;
+    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
+
+    SHADOW_PRINTK("%pv\n", v);
+
+    ASSERT(paging_locked_by_me(v->domain));
+
+    if ( !this )
+        goto resync_others;
+
+    /* First: resync all of this vcpu's oos pages */
+    for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        if ( !mfn_eq(oos[idx], INVALID_MFN) )
+        {
+            /* Write-protect and sync contents */
+            _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+            oos[idx] = INVALID_MFN;
+        }
+
+ resync_others:
+    if ( !others )
+        return;
+
+    /* Second: make all *other* vcpus' oos pages safe. */
+    for_each_vcpu(v->domain, other)
+    {
+        if ( v == other )
+            continue;
+
+        oos = other->arch.paging.shadow.oos;
+        oos_fixup = other->arch.paging.shadow.oos_fixup;
+        oos_snapshot = other->arch.paging.shadow.oos_snapshot;
+
+        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        {
+            if ( mfn_eq(oos[idx], INVALID_MFN) )
+                continue;
+
+            if ( skip )
+            {
+                /* Update the shadows and leave the page OOS. */
+                if ( sh_skip_sync(v, oos[idx]) )
+                    continue;
+                trace_resync(TRC_SHADOW_RESYNC_ONLY, oos[idx]);
+                _sh_resync_l1(other, oos[idx], oos_snapshot[idx]);
+            }
+            else
+            {
+                /* Write-protect and sync contents */
+                _sh_resync(other, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+                oos[idx] = INVALID_MFN;
+            }
+        }
+    }
+}
+
+/*
+ * Allow a shadowed page to go out of sync. Unsyncs are traced in
+ * multi.c:sh_page_fault()
+ */
+int sh_unsync(struct vcpu *v, mfn_t gmfn)
+{
+    struct page_info *pg;
+
+    ASSERT(paging_locked_by_me(v->domain));
+
+    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
+
+    pg = mfn_to_page(gmfn);
+
+    /*
+     * Guest page must be shadowed *only* as L1 and *only* once when out
+     * of sync.  Also, get out now if it's already out of sync.
+     * Also, can't safely unsync if some vcpus have paging disabled.
+     */
+    if ( pg->shadow_flags &
+         ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
+         || sh_page_has_multiple_shadows(pg)
+         || !is_hvm_vcpu(v)
+         || !v->domain->arch.paging.shadow.oos_active )
+        return 0;
+
+    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_out_of_sync);
+    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_oos_may_write);
+
+    pg->shadow_flags |= SHF_out_of_sync|SHF_oos_may_write;
+    oos_hash_add(v, gmfn);
+    perfc_incr(shadow_unsync);
+    TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_UNSYNC);
+    return 1;
+}
+
+#endif /* (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -452,6 +452,7 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
 /* Pull an out-of-sync page back into sync. */
 void sh_resync(struct domain *d, mfn_t gmfn);
 
+void oos_hash_remove(struct domain *d, mfn_t gmfn);
 void oos_fixup_add(struct domain *d, mfn_t gmfn, mfn_t smfn, unsigned long off);
 
 /* Pull all out-of-sync shadows back into sync.  If skip != 0, we try
@@ -477,6 +478,7 @@ shadow_sync_other_vcpus(struct vcpu *v)
     sh_resync_all(v, 1 /* skip */, 0 /* this */, 1 /* others */);
 }
 
+void sh_oos_audit(struct domain *d);
 void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn);
 mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn);
 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (9 preceding siblings ...)
  2023-03-22  9:35 ` [PATCH 10/16] x86/shadow: move OOS functions to their own file Jan Beulich
@ 2023-03-22  9:35 ` Jan Beulich
  2023-03-23 18:18   ` Andrew Cooper
  2023-03-22  9:35 ` [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

Emulation related functions are involved in HVM handling only, and in
some cases they even invoke such checks after having already done things
which are valid for HVM domains only. OOS active also implies HVM. In
sh_remove_all_mappings() one of the two checks is redundant with an
earlier paging_mode_external() one (the other, however, needs to stay).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
                    == (is_special_page(page) ||
-                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
+                       is_ioreq_server_page(d, page)))) )
             printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
                    " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
                    mfn_x(gmfn), gfn_x(gfn),
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -204,10 +204,6 @@ hvm_emulate_write(enum x86_segment seg,
     if ( rc || !bytes )
         return rc;
 
-    /* Unaligned writes are only acceptable on HVM */
-    if ( (addr & (bytes - 1)) && !is_hvm_vcpu(v)  )
-        return X86EMUL_UNHANDLEABLE;
-
     ptr = sh_emulate_map_dest(v, addr, bytes, sh_ctxt);
     if ( IS_ERR(ptr) )
         return ~PTR_ERR(ptr);
@@ -258,10 +254,6 @@ hvm_emulate_cmpxchg(enum x86_segment seg
     if ( rc )
         return rc;
 
-    /* Unaligned writes are only acceptable on HVM */
-    if ( (addr & (bytes - 1)) && !is_hvm_vcpu(v)  )
-        return X86EMUL_UNHANDLEABLE;
-
     ptr = sh_emulate_map_dest(v, addr, bytes, sh_ctxt);
     if ( IS_ERR(ptr) )
         return ~PTR_ERR(ptr);
@@ -457,8 +449,7 @@ static void *sh_emulate_map_dest(struct
 
 #ifndef NDEBUG
     /* We don't emulate user-mode writes to page tables. */
-    if ( is_hvm_domain(d) ? hvm_get_cpl(v) == 3
-                          : !guest_kernel_mode(v, guest_cpu_user_regs()) )
+    if ( hvm_get_cpl(v) == 3 )
     {
         gdprintk(XENLOG_DEBUG, "User-mode write to pagetable reached "
                  "emulate_map_dest(). This should never happen!\n");
@@ -487,15 +478,6 @@ static void *sh_emulate_map_dest(struct
         sh_ctxt->mfn[1] = INVALID_MFN;
         map = map_domain_page(sh_ctxt->mfn[0]) + (vaddr & ~PAGE_MASK);
     }
-    else if ( !is_hvm_domain(d) )
-    {
-        /*
-         * Cross-page emulated writes are only supported for HVM guests;
-         * PV guests ought to know better.
-         */
-        put_page(mfn_to_page(sh_ctxt->mfn[0]));
-        return MAPPING_UNHANDLEABLE;
-    }
     else
     {
         /* This write crosses a page boundary. Translate the second page. */
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3441,7 +3441,7 @@ int sh_rm_write_access_from_sl1p(struct
 
 #ifdef CONFIG_HVM
     /* Remember if we've been told that this process is being torn down */
-    if ( curr->domain == d && is_hvm_domain(d) )
+    if ( curr->domain == d )
         curr->arch.paging.shadow.pagetable_dying
             = mfn_to_page(gmfn)->pagetable_dying;
 #endif
--- a/xen/arch/x86/mm/shadow/oos.c
+++ b/xen/arch/x86/mm/shadow/oos.c
@@ -577,7 +577,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
     if ( pg->shadow_flags &
          ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
          || sh_page_has_multiple_shadows(pg)
-         || !is_hvm_vcpu(v)
          || !v->domain->arch.paging.shadow.oos_active )
         return 0;
 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (10 preceding siblings ...)
  2023-03-22  9:35 ` [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
@ 2023-03-22  9:35 ` Jan Beulich
  2023-03-23 18:28   ` Andrew Cooper
  2023-03-22  9:36 ` [PATCH 13/16] x86/shadow: vCPU-s never have "no mode" Jan Beulich
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

While benign at present, it is still a little fragile to operate on a
wrong "old_mode" value in sh_update_paging_modes(). This can happen when
no monitor table was present initially - we'd create one for the new
mode without updating old_mode. Correct this two ways, each of which
would be sufficient on its own: Once by adding "else" to the second of
the involved if()s in the function, and then by setting the correct
initial mode for HVM domains in shadow_vcpu_init().

Further use the same predicate (paging_mode_external()) consistently
when dealing with shadow mode init/update/cleanup, rather than a mix of
is_hvm_vcpu() (init), is_hvm_domain() (update), and
paging_mode_external() (cleanup).

Finally drop a redundant is_hvm_domain() from inside the bigger if()
(which is being converted to paging_mode_external()) in
sh_update_paging_modes().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -129,8 +129,8 @@ void shadow_vcpu_init(struct vcpu *v)
     }
 #endif
 
-    v->arch.paging.mode = is_hvm_vcpu(v) ?
-                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3) :
+    v->arch.paging.mode = paging_mode_external(v->domain) ?
+                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 2) :
                           &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);
 }
 
@@ -1811,7 +1811,7 @@ static void sh_update_paging_modes(struc
         sh_detach_old_tables(v);
 
 #ifdef CONFIG_HVM
-    if ( is_hvm_domain(d) )
+    if ( paging_mode_external(d) )
     {
         const struct paging_mode *old_mode = v->arch.paging.mode;
 
@@ -1864,13 +1864,12 @@ static void sh_update_paging_modes(struc
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
         }
-
-        if ( v->arch.paging.mode != old_mode )
+        else if ( v->arch.paging.mode != old_mode )
         {
             SHADOW_PRINTK("new paging mode: %pv pe=%d gl=%u "
                           "sl=%u (was g=%u s=%u)\n",
                           v,
-                          is_hvm_domain(d) ? hvm_paging_enabled(v) : 1,
+                          hvm_paging_enabled(v),
                           v->arch.paging.mode->guest_levels,
                           v->arch.paging.mode->shadow.shadow_levels,
                           old_mode ? old_mode->guest_levels : 0,



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 13/16] x86/shadow: vCPU-s never have "no mode"
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (11 preceding siblings ...)
  2023-03-22  9:35 ` [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
@ 2023-03-22  9:36 ` Jan Beulich
  2023-03-23 18:29   ` Andrew Cooper
  2023-03-22  9:37 ` [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

With an initial mode installed by shadow_vcpu_init(), there's no need
for sh_update_paging_modes() to deal with the "mode is still unset"
case. Leave an assertion, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1864,6 +1864,8 @@ static void sh_update_paging_modes(struc
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
         }
+        else if ( !old_mode )
+            ASSERT_UNREACHABLE();
         else if ( v->arch.paging.mode != old_mode )
         {
             SHADOW_PRINTK("new paging mode: %pv pe=%d gl=%u "
@@ -1872,11 +1874,10 @@ static void sh_update_paging_modes(struc
                           hvm_paging_enabled(v),
                           v->arch.paging.mode->guest_levels,
                           v->arch.paging.mode->shadow.shadow_levels,
-                          old_mode ? old_mode->guest_levels : 0,
-                          old_mode ? old_mode->shadow.shadow_levels : 0);
-            if ( old_mode &&
-                 (v->arch.paging.mode->shadow.shadow_levels !=
-                  old_mode->shadow.shadow_levels) )
+                          old_mode->guest_levels,
+                          old_mode->shadow.shadow_levels);
+            if ( v->arch.paging.mode->shadow.shadow_levels !=
+                 old_mode->shadow.shadow_levels )
             {
                 /* Need to make a new monitor table for the new mode */
                 mfn_t new_mfn, old_mfn;



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (12 preceding siblings ...)
  2023-03-22  9:36 ` [PATCH 13/16] x86/shadow: vCPU-s never have "no mode" Jan Beulich
@ 2023-03-22  9:37 ` Jan Beulich
  2023-03-23 18:30   ` Andrew Cooper
  2023-03-22  9:37 ` [PATCH 15/16] x86/shadow: adjust monitor table prealloc amount Jan Beulich
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

It looks like in the combination of aff8bf94ce65 ('x86/shadow: only
4-level guest code needs building when !HVM') and 0b841314dace
('x86/shadow: sh_{make,destroy}_monitor_table() are "even more" HVM-
only') I didn't go quite far enough: SH_type_monitor_table is also
effectively unused when !HVM.

The assertion early in sh_destroy_shadow() can have the type dropped
altogether: it shouldn't make it here in the first place. Pages of
this type are freed directly from sh_destroy_monitor_table() only.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1190,7 +1190,6 @@ void sh_destroy_shadow(struct domain *d,
     ASSERT(t == SH_type_fl1_32_shadow  ||
            t == SH_type_fl1_pae_shadow ||
            t == SH_type_fl1_64_shadow  ||
-           t == SH_type_monitor_table  ||
            (is_pv_32bit_domain(d) && t == SH_type_l4_64_shadow) ||
            (page_get_owner(mfn_to_page(backpointer(sp))) == d));
 
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -205,8 +205,7 @@ extern void shadow_audit_tables(struct v
 #define SH_type_l4_64_shadow   6U /* shadowing a 64-bit L4 page */
 #define SH_type_max_shadow     6U
 #define SH_type_p2m_table      7U /* in use as the p2m table */
-#define SH_type_monitor_table  8U /* in use as a monitor table */
-#define SH_type_unused         9U
+#define SH_type_unused         8U
 #endif
 
 #ifndef CONFIG_PV32 /* Unused (but uglier to #ifdef above): */



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 15/16] x86/shadow: adjust monitor table prealloc amount
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (13 preceding siblings ...)
  2023-03-22  9:37 ` [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
@ 2023-03-22  9:37 ` Jan Beulich
  2023-03-22  9:38 ` [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3() Jan Beulich
  2023-03-23 11:40 ` [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Andrew Cooper
  16 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

While 670d6b908ff2 ('x86 shadow: Move the shadow linear mapping for
n-on-3-on-4 shadows so') bumped the amount by one too little for the
32-on-64 case (which luckily was dead code, and hence a bump wasn't
necessary in the first place), 0b841314dace ('x86/shadow:
sh_{make,destroy}_monitor_table() are "even more" HVM-only'), dropping
the dead code, then didn't adjust the amount back. Yet even the original
amount was too high in certain cases. Switch to pre-allocating just as
much as is going to be needed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -738,7 +738,7 @@ mfn_t sh_make_monitor_table(const struct
     ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table));
 
     /* Guarantee we can get the memory we need */
-    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+    if ( !shadow_prealloc(d, SH_type_monitor_table, shadow_levels < 4 ? 3 : 1) )
         return INVALID_MFN;
 
     m4mfn = shadow_alloc(d, SH_type_monitor_table, 0);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3()
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (14 preceding siblings ...)
  2023-03-22  9:37 ` [PATCH 15/16] x86/shadow: adjust monitor table prealloc amount Jan Beulich
@ 2023-03-22  9:38 ` Jan Beulich
  2023-03-23 18:34   ` Andrew Cooper
  2023-03-23 11:40 ` [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Andrew Cooper
  16 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-22  9:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

sh_update_paging_modes() as its last action already invokes
sh_update_cr3(). Therefore there is no reason to invoke update_cr3()
another time immediately after calling paging_update_paging_modes(),
the more that sh_update_cr3() does not short-circuit the "nothing
changed" case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1445,8 +1445,8 @@ int arch_set_info_guest(
 
     if ( paging_mode_enabled(d) )
         paging_update_paging_modes(v);
-
-    update_cr3(v);
+    else
+        update_cr3(v);
 #endif /* CONFIG_PV */
 
  out:



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments
  2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
                   ` (15 preceding siblings ...)
  2023-03-22  9:38 ` [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3() Jan Beulich
@ 2023-03-23 11:40 ` Andrew Cooper
  16 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 11:40 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:28 am, Jan Beulich wrote:
> This is kind of fallout from XSA-427 investigations, partly related to
> there having been a more intrusive first approach. This is also the
> reason why one of the patch has R-b already - that was a prereq for
> the original approach.
>
> Most patches aren't really dependent upon one another, so can probably
> go in independently (as they get acked).
>
> 01: shadow: fix and improve sh_page_has_multiple_shadows()
> 02: shadow: fold/rename sh_unhook_*_mappings()
> 03: shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys"
> 04: shadow: replace memcmp() in sh_resync_l1()
> 05: shadow: reduce explicit log-dirty recording for HVM
> 06: shadow: purge {write,cmpxchg}_guest_entry() hooks
> 07: shadow: call sh_update_cr3() directly sh_page_fault()
> 08: shadow: use lighter weight mode checks
> 09: shadow: OOS mode is HVM-only
> 10: shadow: move OOS functions to their own file
> 11: shadow: drop is_hvm_...() where easily possible
> 12: shadow: make monitor table create/destroy more consistent
> 13: shadow: vCPU-s never have "no mode"
> 14: shadow: "monitor table" is a HVM-only concept
> 15: shadow: adjust monitor table prealloc amount
> 16: PV: conditionalize arch_set_info_guest()'s call to update_cr3()

Out of interest, I looked at the net delta from this, and it's quite
interesting.

For data:
add/remove: 0/0 grow/shrink: 1/7 up/down: 8/-112 (-104)
__func__                                    2986    2994      +8
sh_paging_mode__guest_4                       96      80     -16
...

which is nice all around.  (Shame that __func__ is being merged
everywhere but oh well.)

For code, two notable exerts:
add/remove: 6/5 grow/shrink: 5/39 up/down: 1549/-3499 (-1950)
Function                                     old     new   delta
mod_l1_entry                                2120    2008    -112
do_mmu_update                               6548    6209    -339

I can't see any patch which obviously makes that change in one go, so I
can only assume it's combination of various small things.

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows()
  2023-03-22  9:29 ` [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows() Jan Beulich
@ 2023-03-23 11:55   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 11:55 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:29 am, Jan Beulich wrote:
> While no caller currently invokes the function without first making sure
> there is at least one shadow [1], we'd better eliminate UB here:
> find_first_set_bit() requires input to be non-zero to return a well-
> defined result.
>
> Further, using find_first_set_bit() isn't very efficient in the first
> place for the intended purpose.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys"
  2023-03-22  9:31 ` [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys" Jan Beulich
@ 2023-03-23 12:14   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 12:14 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:31 am, Jan Beulich wrote:
> SHADOW_FOREACH_L<N>E() already invokes the "body" only when the present
> bit is set; no need to re-do the check.
>
> While there also
> - stop open-coding mfn_to_maddr() in code being touched (re-indented)
>   anyway,
> - stop open-coding mfn_eq() in code being touched or adjacent code,
> - drop local variables when they're no longer used at least twice.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

While I agree this is the current behaviour, it's another spiketrap
waiting to happen.

There needs to be a rename of these macros to something like
FOREACH_PRESENT_SL?E(...).  The SHADOW prefix is a bit verbose and can
be ineligibly encoded with the L?E acronym.

After which this patch be trivially correct, and the resulting code will
read safely in context.

>
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -3647,13 +3632,10 @@ int cf_check sh_remove_l1_shadow(struct
>  {
>      shadow_l2e_t *sl2e;
>      int done = 0;
> -    int flags;
>  
>      SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, done, d,
>      {
> -        flags = shadow_l2e_get_flags(*sl2e);
> -        if ( (flags & _PAGE_PRESENT)
> -             && (mfn_x(shadow_l2e_get_mfn(*sl2e)) == mfn_x(sl1mfn)) )
> +        if ( mfn_x(shadow_l2e_get_mfn(*sl2e)) == mfn_x(sl1mfn) )

You've done mfn_eq() conversions thus far, but the final 3 hunks of the
patch could do with a conversion too.

With that, and subject to a suitable rename of the macros ahead of this
change, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1()
  2023-03-22  9:31 ` [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1() Jan Beulich
@ 2023-03-23 12:55   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 12:55 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:31 am, Jan Beulich wrote:
> Ordinary scalar operations are used in a multitude of other places, so
> do so here as well. In fact take the opportunity and drop a local
> variable then as well, first and foremost to get rid of a bogus cast.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks
  2023-03-22  9:32 ` [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks Jan Beulich
@ 2023-03-23 13:13   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 13:13 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:32 am, Jan Beulich wrote:
> --- a/xen/arch/x86/pv/mm.h
> +++ b/xen/arch/x86/pv/mm.h
> @@ -32,6 +34,35 @@ static inline l1_pgentry_t guest_get_eff
>  }
>  
>  /*
> + * Write a new value into the guest pagetable, and update the
> + * paging-assistance state appropriately.  Returns false if we page-faulted,
> + * true for success.
> + */

I know you're just moving the comments as-are, but more than half of
this is definitely wrong now, and another part is wholly redundant.

"Write a new value into the guest pagetable" is about the best I can
think of, but it is borderline whether it even needs a comment.

> +static inline void paging_write_guest_entry(
> +    struct vcpu *v, intpte_t *p, intpte_t new, mfn_t gmfn)
> +{
> +    if ( unlikely(paging_mode_shadow(v->domain)) )
> +        shadow_write_guest_entry(v, p, new, gmfn);
> +    else
> +        write_atomic(p, new);
> +}
> +
> +
> +/*
> + * Cmpxchg a new value into the guest pagetable, and update the
> + * paging-assistance state appropriately.  Returns false if we page-faulted,
> + * true if not.  N.B. caller should check the value of "old" to see if the
> + * cmpxchg itself was successful.
> + */

"Compare and exchange a guest pagetable entry.  Returns the old value." 
We don't need to teach people how to use cmpxchg as a primitive here...

The comment next to shadow_cmpxchg_guest_entry() ideally wants the
grammar fix in the first clause too, and this is probably the right
patch to do it in.

For the content of the change, definitely an improvement.  With the
comments suitably adjusted, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-22  9:33 ` [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
@ 2023-03-23 13:34   ` Andrew Cooper
  2023-03-27 15:39   ` Tim Deegan
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 13:34 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:33 am, Jan Beulich wrote:
> There's no need for an indirect call here, as the mode is invariant
> throughout the entire paging-locked region. All it takes to avoid it is
> to have a forward declaration of sh_update_cr3() in place.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I find this and the respective Win7 related comment suspicious: If we
> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> the shadow_get_and_create_l1e() rather than exit? The spurious page
> fault that the guest observes can, after all, not be known to be non-
> fatal inside the guest. That's purely an OS policy.
>
> Furthermore the sh_update_cr3() will also invalidate L3 entries which
> were loaded successfully before, but invalidated by the guest
> afterwards. I strongly suspect that the described hardware behavior is
> _only_ to load previously not-present entries from the PDPT, but not
> purge ones already marked present. IOW I think sh_update_cr3() would
> need calling in an "incremental" mode here. (The alternative of doing
> this in shadow_get_and_create_l3e() instead would likely be more
> cumbersome.)
>
> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
> looks wrong.
>
> Beyond the "on demand" L3 entry creation I also can't see what guest
> actions could lead to the ASSERT() being inapplicable in the PAE case.
> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
> PDPTEs, and all other logic is similar to that for other modes.
>
> (See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk
> succeed but shadow walk fails"].)

I recall that there was a complicated bug, ultimately discovered because
Win7 changed behaviour vs older versions, and the shadow logic had been
written to AMD's PAE behaviour, not Intel's.

Remember that Intel and AMD differer in how PAE paging works between
root and non-root mode, and it is to do with whether all PDPTRs get
cached at once, or on demand.

Off the top of my head:
* 32bit PV guests get on-demand semantics (as they're really 4-level)
* VT-x strictly use architectural "PDPTRs get cached on mov to CR3"
semantics
* SVM with NPT have on-demand semantics
* SVM with shadow is model-specific as to which semantics is uses, IIRC
Fam15h and later are on-demand

These differences still manifest bugs in the common HVM code, the PTE
caching code, and the pagewalk code.

Looking at the comment specifically, I'm pretty sure it's wrong.  I
think that suggests we've got even more PDPTR bugs than I'd previously
identified.  In some copious free time, I do need to extend the
pagetable-emul XTF test to include PDPTR updates because it's the one
area of pagewalking that doesn't have any suitable testing right now.

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/16] x86/shadow: OOS mode is HVM-only
  2023-03-22  9:34 ` [PATCH 09/16] x86/shadow: OOS mode is HVM-only Jan Beulich
@ 2023-03-23 13:42   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 13:42 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:34 am, Jan Beulich wrote:
> XEN_DOMCTL_CDF_oos_off is forced set for PV domains, so the logic can't
> ever be engaged for them. Conditionalize respective fields and remove
> the respective bit from SHADOW_OPTIMIZATIONS when !HVM. As a result the
> SH_type_oos_snapshot constant can disappear altogether in that case, and
> a couple of #ifdef-s can also be dropped/combined.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/16] x86/shadow: move OOS functions to their own file
  2023-03-22  9:35 ` [PATCH 10/16] x86/shadow: move OOS functions to their own file Jan Beulich
@ 2023-03-23 14:30   ` Andrew Cooper
  2023-03-23 15:40     ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 14:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:35 am, Jan Beulich wrote:
> --- /dev/null
> +++ b/xen/arch/x86/mm/shadow/oos.c
> @@ -0,0 +1,603 @@
> +/* SPDX-License-Identifier: GPL-2.0 */

GPL-2.0-only

The unqualified form in deprecated.

> +/******************************************************************************
> + * arch/x86/mm/shadow/oos.c
> + *
> + * Shadow code dealing with out-of-sync shadows.
> + * Parts of this code are Copyright (c) 2006 by XenSource Inc.
> + * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
> + * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
> + */
> +
> +#include "private.h"
> +
> +#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
> +
> +#include <xen/trace.h>
> +
> +#include <asm/shadow.h>
> +
> +/*
> + * From time to time, we let a shadowed pagetable page go out of sync
> + * with its shadow: the guest is allowed to write directly to the page,
> + * and those writes are not synchronously reflected in the shadow.
> + * This lets us avoid many emulations if the guest is writing a lot to a
> + * pagetable, but it relaxes a pretty important invariant in the shadow
> + * pagetable design.  Therefore, some rules:
> + *
> + * 1. Only L1 pagetables may go out of sync: any page that is shadowed
> + *    at at higher level must be synchronously updated.  This makes
> + *    using linear shadow pagetables much less dangerous.
> + *    That means that: (a) unsyncing code needs to check for higher-level
> + *    shadows, and (b) promotion code needs to resync.
> + *
> + * 2. All shadow operations on a guest page require the page to be brought
> + *    back into sync before proceeding.  This must be done under the
> + *    paging lock so that the page is guaranteed to remain synced until
> + *    the operation completes.
> + *
> + *    Exceptions to this rule: the pagefault and invlpg handlers may
> + *    update only one entry on an out-of-sync page without resyncing it.
> + *
> + * 3. Operations on shadows that do not start from a guest page need to
> + *    be aware that they may be handling an out-of-sync shadow.
> + *
> + * 4. Operations that do not normally take the paging lock (fast-path
> + *    #PF handler, INVLPG) must fall back to a locking, syncing version
> + *    if they see an out-of-sync table.
> + *
> + * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
> + *    must explicitly resync all relevant pages or update their
> + *    shadows.
> + *
> + * Currently out-of-sync pages are listed in a simple open-addressed
> + * hash table with a second chance (must resist temptation to radically
> + * over-engineer hash tables...)  The virtual address of the access
> + * which caused us to unsync the page is also kept in the hash table, as
> + * a hint for finding the writable mappings later.
> + *
> + * We keep a hash per vcpu, because we want as much as possible to do
> + * the re-sync on the save vcpu we did the unsync on, so the VA hint
> + * will be valid.
> + */
> +
> +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES_FULL
> +void sh_oos_audit(struct domain *d)
> +{
> +    unsigned int idx, expected_idx, expected_idx_alt;
> +    struct page_info *pg;
> +    struct vcpu *v;
> +
> +    for_each_vcpu(d, v)
> +    {
> +        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
> +        {
> +            mfn_t *oos = v->arch.paging.shadow.oos;

Newline.

But the variable placement is weird.  oos ought to be one scope further
out to prevent recalculation in the for() loop, while pg and the two
expected could be at the inter-most scope.

> +            if ( mfn_eq(oos[idx], INVALID_MFN) )
> +                continue;
> +
> +            expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES;
> +            expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES);
> +            if ( idx != expected_idx && idx != expected_idx_alt )
> +            {
> +                printk("%s: idx %x contains gmfn %lx, expected at %x or %x.\n",
> +                       __func__, idx, mfn_x(oos[idx]),
> +                       expected_idx, expected_idx_alt);
> +                BUG();
> +            }
> +            pg = mfn_to_page(oos[idx]);
> +            if ( !(pg->count_info & PGC_shadowed_pt) )
> +            {
> +                printk("%s: idx %x gmfn %lx not a pt (count %lx)\n",
> +                       __func__, idx, mfn_x(oos[idx]), pg->count_info);
> +                BUG();
> +            }
> +            if ( !(pg->shadow_flags & SHF_out_of_sync) )
> +            {
> +                printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n",
> +                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
> +                BUG();
> +            }
> +            if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) )
> +            {
> +                printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n",
> +                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
> +                BUG();
> +            }
> +        }
> +    }
> +}
> +#endif
> +
> +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
> +void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
> +{
> +    int idx;
> +    struct vcpu *v;
> +    mfn_t *oos;
> +
> +    ASSERT(mfn_is_out_of_sync(gmfn));
> +
> +    for_each_vcpu(d, v)
> +    {
> +        oos = v->arch.paging.shadow.oos;
> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;

Same for oos and idx here, which would shrink this function overall.

As a tangent, do we really want all these modulo 3's all over the
place?  It's a lot of divide instructions in paths that are fast-ish for
shadow guests.

> +        if ( !mfn_eq(oos[idx], gmfn) )
> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
> +
> +        if ( mfn_eq(oos[idx], gmfn) )
> +            return;
> +    }
> +
> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
> +           mfn_x(gmfn));
> +    BUG();
> +}
> +#endif
> +
> +/* Update the shadow, but keep the page out of sync. */
> +static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)

inline can go.

> +{
> +    struct page_info *pg = mfn_to_page(gmfn);
> +
> +    ASSERT(mfn_valid(gmfn));
> +    ASSERT(page_is_out_of_sync(pg));
> +
> +    /* Call out to the appropriate per-mode resyncing function */
> +    if ( pg->shadow_flags & SHF_L1_32 )
> +        SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn);
> +    else if ( pg->shadow_flags & SHF_L1_PAE )
> +        SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn);
> +    else if ( pg->shadow_flags & SHF_L1_64 )
> +        SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn);
> +}
> +
> +static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
> +                                            mfn_t smfn, unsigned long off)
> +{
> +    ASSERT(mfn_valid(smfn));
> +    ASSERT(mfn_valid(gmfn));
> +
> +    switch ( mfn_to_page(smfn)->u.sh.type )
> +    {
> +    case SH_type_l1_32_shadow:
> +    case SH_type_fl1_32_shadow:
> +        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
> +            (d, gmfn, smfn, off);
> +
> +    case SH_type_l1_pae_shadow:
> +    case SH_type_fl1_pae_shadow:
> +        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
> +            (d, gmfn, smfn, off);
> +
> +    case SH_type_l1_64_shadow:
> +    case SH_type_fl1_64_shadow:
> +        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4)
> +            (d, gmfn, smfn, off);
> +
> +    default:
> +        return 0;
> +    }
> +}
> +
> +/*
> + * Fixup arrays: We limit the maximum number of writable mappings to
> + * SHADOW_OOS_FIXUPS and store enough information to remove them
> + * quickly on resync.
> + */
> +
> +static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn,
> +                                       struct oos_fixup *fixup)

inline

> +{
> +    struct domain *d = v->domain;
> +    int i;
> +    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
> +    {
> +        if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) )
> +        {
> +            sh_remove_write_access_from_sl1p(d, gmfn,
> +                                             fixup->smfn[i],
> +                                             fixup->off[i]);
> +            fixup->smfn[i] = INVALID_MFN;
> +        }
> +    }
> +
> +    /* Always flush the TLBs. See comment on oos_fixup_add(). */
> +    return 1;
> +}

This looks suspiciously like it ought to be a void function.  [edit, yes
- see later]

> +
> +void oos_fixup_add(struct domain *d, mfn_t gmfn,
> +                   mfn_t smfn,  unsigned long off)
> +{
> +    int idx, next;
> +    mfn_t *oos;
> +    struct oos_fixup *oos_fixup;
> +    struct vcpu *v;
> +
> +    perfc_incr(shadow_oos_fixup_add);
> +
> +    for_each_vcpu(d, v)
> +    {
> +        oos = v->arch.paging.shadow.oos;
> +        oos_fixup = v->arch.paging.shadow.oos_fixup;
> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> +        if ( !mfn_eq(oos[idx], gmfn) )
> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
> +        if ( mfn_eq(oos[idx], gmfn) )
> +        {
> +            int i;
> +            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )

This is a case where "for ( int i = ..." would definitely read better. 
Luckily, this example is simple enough that the compiler has already
optimised properly.

> +            {
> +                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
> +                     && (oos_fixup[idx].off[i] == off) )

Given that you mention style in the commit message, it would be nice to
move the && onto the previous line.

> +                    return;
> +            }
> +
> +            next = oos_fixup[idx].next;
> +
> +            if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) )
> +            {
> +                TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT);
> +
> +                /* Reuse this slot and remove current writable mapping. */
> +                sh_remove_write_access_from_sl1p(d, gmfn,
> +                                                 oos_fixup[idx].smfn[next],
> +                                                 oos_fixup[idx].off[next]);
> +                perfc_incr(shadow_oos_fixup_evict);
> +                /*
> +                 * We should flush the TLBs now, because we removed a
> +                 * writable mapping, but since the shadow is already
> +                 * OOS we have no problem if another vcpu write to
> +                 * this page table. We just have to be very careful to
> +                 * *always* flush the tlbs on resync.
> +                 */
> +            }
> +
> +            oos_fixup[idx].smfn[next] = smfn;
> +            oos_fixup[idx].off[next] = off;
> +            oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS;
> +
> +            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD);
> +            return;
> +        }
> +    }
> +
> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
> +           mfn_x(gmfn));
> +    BUG();
> +}
> +
> +static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn,
> +                                   struct oos_fixup *fixup)
> +{
> +    struct domain *d = v->domain;
> +    int ftlb = 0;
> +
> +    ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup);

Oof yes.

With oos_fixup_flush_gmfn() changed to being void, it is now obvious
that ftlb is unconditionally 1 throughout this function, which can be
simplified to just:

    oos_fixup_flush_gmfn(v, gmfn, fixup);
    if ( sh_remove_write_access(d, gmfn, 0, 0) == -1 )
    {
        shadow_remove_all_shadows(d, gmfn);
        return 1;
    }

    guest_flush_tlb_mask(d, d->dirty_cpumask);
    return 0;

Maybe we don't want to go that far, but it is overly complex in its
current form.

> +
> +    switch ( sh_remove_write_access(d, gmfn, 0, 0) )
> +    {
> +    default:
> +    case 0:
> +        break;
> +
> +    case 1:
> +        ftlb |= 1;
> +        break;
> +
> +    case -1:
> +        /*
> +         * An unfindable writeable typecount has appeared, probably via a
> +         * grant table entry: can't shoot the mapping, so try to unshadow
> +         * the page.  If that doesn't work either, the guest is granting
> +         * his pagetables and must be killed after all.
> +         * This will flush the tlb, so we can return with no worries.
> +         */
> +        shadow_remove_all_shadows(d, gmfn);
> +        return 1;
> +    }
> +
> +    if ( ftlb )
> +        guest_flush_tlb_mask(d, d->dirty_cpumask);
> +
> +    return 0;
> +}
> +
> +static inline void trace_resync(int event, mfn_t gmfn)

inline, and this reminds me that I *still* need to sort my series to
avoid stack rubble leakage in the trace subsystem.

"int" event really ought to become unsigned int, but it doesn't matter
in this case because the timestamp (bit 31) doesn't need setting.

> +{
> +    if ( tb_init_done )
> +    {
> +        /* Convert gmfn to gfn */
> +        gfn_t gfn = mfn_to_gfn(current->domain, gmfn);
> +
> +        __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn);
> +    }
> +}
> +
> +/* Pull all the entries on an out-of-sync page back into sync. */
> +static void _sh_resync(struct vcpu *v, mfn_t gmfn,
> +                       struct oos_fixup *fixup, mfn_t snp)
> +{
> +    struct page_info *pg = mfn_to_page(gmfn);
> +
> +    ASSERT(paging_locked_by_me(v->domain));
> +    ASSERT(mfn_is_out_of_sync(gmfn));
> +    /* Guest page must be shadowed *only* as L1 when out of sync. */
> +    ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask
> +             & ~SHF_L1_ANY));
> +    ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn)));
> +
> +    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
> +
> +    /* Need to pull write access so the page *stays* in sync. */
> +    if ( oos_remove_write_access(v, gmfn, fixup) )
> +    {
> +        /* Page has been unshadowed. */
> +        return;
> +    }
> +
> +    /* No more writable mappings of this page, please */
> +    pg->shadow_flags &= ~SHF_oos_may_write;
> +
> +    /* Update the shadows with current guest entries. */
> +    _sh_resync_l1(v, gmfn, snp);
> +
> +    /* Now we know all the entries are synced, and will stay that way */
> +    pg->shadow_flags &= ~SHF_out_of_sync;
> +    perfc_incr(shadow_resync);
> +    trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn);
> +}
> +
> +/* Add an MFN to the list of out-of-sync guest pagetables */
> +static void oos_hash_add(struct vcpu *v, mfn_t gmfn)
> +{
> +    int i, idx, oidx, swap = 0;
> +    mfn_t *oos = v->arch.paging.shadow.oos;
> +    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
> +    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
> +    struct oos_fixup fixup = { .next = 0 };
> +
> +    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
> +        fixup.smfn[i] = INVALID_MFN;
> +
> +    idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> +    oidx = idx;
> +
> +    if ( !mfn_eq(oos[idx], INVALID_MFN)
> +         && (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx )
> +    {
> +        /* Punt the current occupant into the next slot */
> +        SWAP(oos[idx], gmfn);
> +        SWAP(oos_fixup[idx], fixup);
> +        swap = 1;
> +        idx = (idx + 1) % SHADOW_OOS_PAGES;
> +    }
> +    if ( !mfn_eq(oos[idx], INVALID_MFN) )
> +    {
> +        /* Crush the current occupant. */
> +        _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
> +        perfc_incr(shadow_unsync_evict);
> +    }
> +    oos[idx] = gmfn;
> +    oos_fixup[idx] = fixup;
> +
> +    if ( swap )
> +        SWAP(oos_snapshot[idx], oos_snapshot[oidx]);
> +
> +    copy_domain_page(oos_snapshot[oidx], oos[oidx]);
> +}
> +
> +/* Remove an MFN from the list of out-of-sync guest pagetables */
> +void oos_hash_remove(struct domain *d, mfn_t gmfn)
> +{
> +    int idx;
> +    mfn_t *oos;
> +    struct vcpu *v;
> +
> +    SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn));
> +
> +    for_each_vcpu(d, v)
> +    {
> +        oos = v->arch.paging.shadow.oos;
> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> +        if ( !mfn_eq(oos[idx], gmfn) )
> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
> +        if ( mfn_eq(oos[idx], gmfn) )
> +        {
> +            oos[idx] = INVALID_MFN;
> +            return;
> +        }
> +    }
> +
> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
> +           mfn_x(gmfn));
> +    BUG();
> +}
> +
> +mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn)
> +{
> +    int idx;
> +    mfn_t *oos;
> +    mfn_t *oos_snapshot;
> +    struct vcpu *v;
> +
> +    for_each_vcpu(d, v)
> +    {
> +        oos = v->arch.paging.shadow.oos;
> +        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> +        if ( !mfn_eq(oos[idx], gmfn) )
> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
> +        if ( mfn_eq(oos[idx], gmfn) )
> +        {
> +            return oos_snapshot[idx];
> +        }
> +    }
> +
> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
> +           mfn_x(gmfn));
> +    BUG();
> +}
> +
> +/* Pull a single guest page back into sync */
> +void sh_resync(struct domain *d, mfn_t gmfn)
> +{
> +    int idx;
> +    mfn_t *oos;
> +    mfn_t *oos_snapshot;
> +    struct oos_fixup *oos_fixup;
> +    struct vcpu *v;
> +
> +    for_each_vcpu(d, v)
> +    {
> +        oos = v->arch.paging.shadow.oos;
> +        oos_fixup = v->arch.paging.shadow.oos_fixup;
> +        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> +        if ( !mfn_eq(oos[idx], gmfn) )
> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
> +
> +        if ( mfn_eq(oos[idx], gmfn) )
> +        {
> +            _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]);
> +            oos[idx] = INVALID_MFN;
> +            return;
> +        }
> +    }
> +
> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
> +           mfn_x(gmfn));
> +    BUG();
> +}
> +
> +/*
> + * Figure out whether it's definitely safe not to sync this l1 table,
> + * by making a call out to the mode in which that shadow was made.
> + */
> +static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn)
> +{
> +    struct page_info *pg = mfn_to_page(gl1mfn);

Newline here, and ...

> +    if ( pg->shadow_flags & SHF_L1_32 )
> +        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn);
> +    else if ( pg->shadow_flags & SHF_L1_PAE )
> +        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn);
> +    else if ( pg->shadow_flags & SHF_L1_64 )
> +        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn);

here IMO.

> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n",
> +           mfn_x(gl1mfn));
> +    BUG();
> +}
> +
> +/*
> + * Pull all out-of-sync pages back into sync.  Pages brought out of sync
> + * on other vcpus are allowed to remain out of sync, but their contents
> + * will be made safe (TLB flush semantics); pages unsynced by this vcpu
> + * are brought back into sync and write-protected.  If skip != 0, we try
> + * to avoid resyncing at all if we think we can get away with it.
> + */
> +void sh_resync_all(struct vcpu *v, int skip, int this, int others)

This is begging to become

void sh_resync_all(struct vcpu *v, unsigned int flags)

because, if nothing else, it changes the callers to be:

sh_resync_all(v, RESYNC_THIS | RESYNC_OTHERS);
sh_resync_all(v, RESYNC_THIS);
sh_resync_all(v, RESYNC_SKIP | RESYNC_THIS);

which is far more readable that the raw numbers currently there.

I don't mind doing the patch, but I'll probably wait until you've got
this committed to avoid churn.

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/16] x86/shadow: move OOS functions to their own file
  2023-03-23 14:30   ` Andrew Cooper
@ 2023-03-23 15:40     ` Jan Beulich
  2023-03-23 17:44       ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-23 15:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 23.03.2023 15:30, Andrew Cooper wrote:
> On 22/03/2023 9:35 am, Jan Beulich wrote:
>> --- /dev/null
>> +++ b/xen/arch/x86/mm/shadow/oos.c
>> @@ -0,0 +1,603 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
> 
> GPL-2.0-only
> 
> The unqualified form in deprecated.
> 
>> +/******************************************************************************
>> + * arch/x86/mm/shadow/oos.c
>> + *
>> + * Shadow code dealing with out-of-sync shadows.
>> + * Parts of this code are Copyright (c) 2006 by XenSource Inc.
>> + * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
>> + * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
>> + */
>> +
>> +#include "private.h"
>> +
>> +#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
>> +
>> +#include <xen/trace.h>
>> +
>> +#include <asm/shadow.h>
>> +
>> +/*
>> + * From time to time, we let a shadowed pagetable page go out of sync
>> + * with its shadow: the guest is allowed to write directly to the page,
>> + * and those writes are not synchronously reflected in the shadow.
>> + * This lets us avoid many emulations if the guest is writing a lot to a
>> + * pagetable, but it relaxes a pretty important invariant in the shadow
>> + * pagetable design.  Therefore, some rules:
>> + *
>> + * 1. Only L1 pagetables may go out of sync: any page that is shadowed
>> + *    at at higher level must be synchronously updated.  This makes
>> + *    using linear shadow pagetables much less dangerous.
>> + *    That means that: (a) unsyncing code needs to check for higher-level
>> + *    shadows, and (b) promotion code needs to resync.
>> + *
>> + * 2. All shadow operations on a guest page require the page to be brought
>> + *    back into sync before proceeding.  This must be done under the
>> + *    paging lock so that the page is guaranteed to remain synced until
>> + *    the operation completes.
>> + *
>> + *    Exceptions to this rule: the pagefault and invlpg handlers may
>> + *    update only one entry on an out-of-sync page without resyncing it.
>> + *
>> + * 3. Operations on shadows that do not start from a guest page need to
>> + *    be aware that they may be handling an out-of-sync shadow.
>> + *
>> + * 4. Operations that do not normally take the paging lock (fast-path
>> + *    #PF handler, INVLPG) must fall back to a locking, syncing version
>> + *    if they see an out-of-sync table.
>> + *
>> + * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
>> + *    must explicitly resync all relevant pages or update their
>> + *    shadows.
>> + *
>> + * Currently out-of-sync pages are listed in a simple open-addressed
>> + * hash table with a second chance (must resist temptation to radically
>> + * over-engineer hash tables...)  The virtual address of the access
>> + * which caused us to unsync the page is also kept in the hash table, as
>> + * a hint for finding the writable mappings later.
>> + *
>> + * We keep a hash per vcpu, because we want as much as possible to do
>> + * the re-sync on the save vcpu we did the unsync on, so the VA hint
>> + * will be valid.
>> + */
>> +
>> +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES_FULL
>> +void sh_oos_audit(struct domain *d)
>> +{
>> +    unsigned int idx, expected_idx, expected_idx_alt;
>> +    struct page_info *pg;
>> +    struct vcpu *v;
>> +
>> +    for_each_vcpu(d, v)
>> +    {
>> +        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
>> +        {
>> +            mfn_t *oos = v->arch.paging.shadow.oos;
> 
> Newline.

I'm happy to add the newlines you're asking for (also below). But ...

> But the variable placement is weird.  oos ought to be one scope further
> out to prevent recalculation in the for() loop, while pg and the two
> expected could be at the inter-most scope.

... I don't want to go farther than that - see "but leave the code
otherwise untouched" in the description. This also goes for several
of your requests further down.

>> +#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
>> +void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
>> +{
>> +    int idx;
>> +    struct vcpu *v;
>> +    mfn_t *oos;
>> +
>> +    ASSERT(mfn_is_out_of_sync(gmfn));
>> +
>> +    for_each_vcpu(d, v)
>> +    {
>> +        oos = v->arch.paging.shadow.oos;
>> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
> 
> Same for oos and idx here, which would shrink this function overall.
> 
> As a tangent, do we really want all these modulo 3's all over the
> place?  It's a lot of divide instructions in paths that are fast-ish for
> shadow guests.

I don't think the compiler translates division by constant to DIV / IDIV.
It's multiplication by suitable "inverse" and then using the top bits of
the result iirc.

>> +        if ( !mfn_eq(oos[idx], gmfn) )
>> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
>> +
>> +        if ( mfn_eq(oos[idx], gmfn) )
>> +            return;
>> +    }
>> +
>> +    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
>> +           mfn_x(gmfn));
>> +    BUG();
>> +}
>> +#endif
>> +
>> +/* Update the shadow, but keep the page out of sync. */
>> +static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)
> 
> inline can go.

I'm feeling on the edge with "inline". I'd prefer to leave them to keep
"untouched" reasonable true, but if you insist I'd be willing to include
their dropping.

>> +void oos_fixup_add(struct domain *d, mfn_t gmfn,
>> +                   mfn_t smfn,  unsigned long off)
>> +{
>> +    int idx, next;
>> +    mfn_t *oos;
>> +    struct oos_fixup *oos_fixup;
>> +    struct vcpu *v;
>> +
>> +    perfc_incr(shadow_oos_fixup_add);
>> +
>> +    for_each_vcpu(d, v)
>> +    {
>> +        oos = v->arch.paging.shadow.oos;
>> +        oos_fixup = v->arch.paging.shadow.oos_fixup;
>> +        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
>> +        if ( !mfn_eq(oos[idx], gmfn) )
>> +            idx = (idx + 1) % SHADOW_OOS_PAGES;
>> +        if ( mfn_eq(oos[idx], gmfn) )
>> +        {
>> +            int i;
>> +            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
> 
> This is a case where "for ( int i = ..." would definitely read better. 
> Luckily, this example is simple enough that the compiler has already
> optimised properly.
> 
>> +            {
>> +                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
>> +                     && (oos_fixup[idx].off[i] == off) )
> 
> Given that you mention style in the commit message, it would be nice to
> move the && onto the previous line.

Sure, done (and there was a 2nd instance).

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/16] x86/shadow: move OOS functions to their own file
  2023-03-23 15:40     ` Jan Beulich
@ 2023-03-23 17:44       ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 17:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 23/03/2023 3:40 pm, Jan Beulich wrote:
> On 23.03.2023 15:30, Andrew Cooper wrote:
>> On 22/03/2023 9:35 am, Jan Beulich wrote:
>>> +            {
>>> +                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
>>> +                     && (oos_fixup[idx].off[i] == off) )
>> Given that you mention style in the commit message, it would be nice to
>> move the && onto the previous line.
> Sure, done (and there was a 2nd instance).

There are plenty more.  I just stopped reporting them.

I guess if its "just" whitespace fixes (newlines, rearranging of
operands/indents) then that's fine.  Everything else can come later, and
there is plenty.

Subject to the SPDX tweak, Acked-by: Andrew Cooper
<andrew.cooper3@citrix.com>

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-22  9:35 ` [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
@ 2023-03-23 18:18   ` Andrew Cooper
  2023-03-24  7:38     ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 18:18 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:35 am, Jan Beulich wrote:
> Emulation related functions are involved in HVM handling only, and in
> some cases they even invoke such checks after having already done things
> which are valid for HVM domains only. OOS active also implies HVM. In
> sh_remove_all_mappings() one of the two checks is redundant with an
> earlier paging_mode_external() one (the other, however, needs to stay).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
>                 && (page->count_info & PGC_count_mask) <= 3
>                 && ((page->u.inuse.type_info & PGT_count_mask)
>                     == (is_special_page(page) ||
> -                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
> +                       is_ioreq_server_page(d, page)))) )
>              printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
>                     " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
>                     mfn_x(gmfn), gfn_x(gfn),

Out of context here needs an equivalent adjustment.

But in this case, I'm not sure the commit message covers the relevant
details.  ioreq servers have been made fully common since this code was
written, and *that* is a better reason for dropping the predicates IMO
than the redundancy with paging_mode_external().

That said...  I'm not sure the logic here is correct any more.  It used
to be the case that ioreq pages were in the p2m, but they're outside of
the p2m these days, so don't see how there can be any interaction with
unexpected refcounts any more.

I suspect that one way or another, this change wants to be in a separate
patch.

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -3441,7 +3441,7 @@ int sh_rm_write_access_from_sl1p(struct
>  
>  #ifdef CONFIG_HVM
>      /* Remember if we've been told that this process is being torn down */
> -    if ( curr->domain == d && is_hvm_domain(d) )
> +    if ( curr->domain == d )
>          curr->arch.paging.shadow.pagetable_dying
>              = mfn_to_page(gmfn)->pagetable_dying;
>  #endif

This one is dangerous.

After tracing, I can see that sh_rm_write_access_from_sl1p() is only
called from OOS functions, but this function itself does its very best
to look like it has mixed PV + HVM usage, and dropping this conditional
means that pagetable_dying can, in principle at least, become non-NULL
for a PV guest.

I think this function needs to be made far more obviously HVM-only first.

> --- a/xen/arch/x86/mm/shadow/oos.c
> +++ b/xen/arch/x86/mm/shadow/oos.c
> @@ -577,7 +577,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
>      if ( pg->shadow_flags &
>           ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
>           || sh_page_has_multiple_shadows(pg)
> -         || !is_hvm_vcpu(v)
>           || !v->domain->arch.paging.shadow.oos_active )

This is reachable for PV guests as far as I can see.  What am I missing ?

The changes in hvm.c are all fine, and for those alone, consider it R-by
if you end up splitting the patch.

~Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent
  2023-03-22  9:35 ` [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
@ 2023-03-23 18:28   ` Andrew Cooper
  2023-03-24  7:52     ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 18:28 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:35 am, Jan Beulich wrote:
> While benign at present, it is still a little fragile to operate on a
> wrong "old_mode" value in sh_update_paging_modes(). This can happen when
> no monitor table was present initially - we'd create one for the new
> mode without updating old_mode. Correct this two ways, each of which

I think you mean "Correct this in two ways" ?

> would be sufficient on its own: Once by adding "else" to the second of
> the involved if()s in the function, and then by setting the correct
> initial mode for HVM domains in shadow_vcpu_init().
>
> Further use the same predicate (paging_mode_external()) consistently
> when dealing with shadow mode init/update/cleanup, rather than a mix of
> is_hvm_vcpu() (init), is_hvm_domain() (update), and
> paging_mode_external() (cleanup).
>
> Finally drop a redundant is_hvm_domain() from inside the bigger if()
> (which is being converted to paging_mode_external()) in
> sh_update_paging_modes().
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -129,8 +129,8 @@ void shadow_vcpu_init(struct vcpu *v)
>      }
>  #endif
>  
> -    v->arch.paging.mode = is_hvm_vcpu(v) ?
> -                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3) :
> +    v->arch.paging.mode = paging_mode_external(v->domain) ?
> +                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 2) :
>                            &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);

As you're changing this, reposition the ? and : to the start of the
following lines?

But, is 2-level mode actually right?  It's better than 3 certainly, and
is what sh_update_paging_modes() selects, but isn't that only right for
the IDENT_PT case?

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 13/16] x86/shadow: vCPU-s never have "no mode"
  2023-03-22  9:36 ` [PATCH 13/16] x86/shadow: vCPU-s never have "no mode" Jan Beulich
@ 2023-03-23 18:29   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 18:29 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:36 am, Jan Beulich wrote:
> With an initial mode installed by shadow_vcpu_init(), there's no need
> for sh_update_paging_modes() to deal with the "mode is still unset"
> case. Leave an assertion, though.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept
  2023-03-22  9:37 ` [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
@ 2023-03-23 18:30   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 18:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:37 am, Jan Beulich wrote:
> It looks like in the combination of aff8bf94ce65 ('x86/shadow: only
> 4-level guest code needs building when !HVM') and 0b841314dace
> ('x86/shadow: sh_{make,destroy}_monitor_table() are "even more" HVM-
> only') I didn't go quite far enough: SH_type_monitor_table is also
> effectively unused when !HVM.
>
> The assertion early in sh_destroy_shadow() can have the type dropped
> altogether: it shouldn't make it here in the first place. Pages of
> this type are freed directly from sh_destroy_monitor_table() only.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3()
  2023-03-22  9:38 ` [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3() Jan Beulich
@ 2023-03-23 18:34   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2023-03-23 18:34 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 22/03/2023 9:38 am, Jan Beulich wrote:
> sh_update_paging_modes() as its last action already invokes
> sh_update_cr3(). Therefore there is no reason to invoke update_cr3()
> another time immediately after calling paging_update_paging_modes(),
> the more that sh_update_cr3() does not short-circuit the "nothing
> changed" case.

I've finally figured out why "more than" feels like weird grammar.  In
this case, "especially as sh_update_cr3() does not..." would be the more
normal expression.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-23 18:18   ` Andrew Cooper
@ 2023-03-24  7:38     ` Jan Beulich
  2023-03-28 13:57       ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-24  7:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 23.03.2023 19:18, Andrew Cooper wrote:
> On 22/03/2023 9:35 am, Jan Beulich wrote:
>> Emulation related functions are involved in HVM handling only, and in
>> some cases they even invoke such checks after having already done things
>> which are valid for HVM domains only. OOS active also implies HVM. In
>> sh_remove_all_mappings() one of the two checks is redundant with an
>> earlier paging_mode_external() one (the other, however, needs to stay).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/mm/shadow/common.c
>> +++ b/xen/arch/x86/mm/shadow/common.c
>> @@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
>>                 && (page->count_info & PGC_count_mask) <= 3
>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>                     == (is_special_page(page) ||
>> -                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
>> +                       is_ioreq_server_page(d, page)))) )
>>              printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
>>                     " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
>>                     mfn_x(gmfn), gfn_x(gfn),
> 
> Out of context here needs an equivalent adjustment.

I'm afraid I don't seen any further is_hvm_*() in this function.

> But in this case, I'm not sure the commit message covers the relevant
> details.  ioreq servers have been made fully common since this code was
> written, and *that* is a better reason for dropping the predicates IMO
> than the redundancy with paging_mode_external().

How does "fully common" matter? It's still a HVM-only thing, hence the
paging_mode_external() check just out of context. Also note that the
ioreq-server-page check is only one side of || (and I realize that by
correcting indentation here at the same time this might be better
visible).

> That said...  I'm not sure the logic here is correct any more.  It used
> to be the case that ioreq pages were in the p2m, but they're outside of
> the p2m these days, so don't see how there can be any interaction with
> unexpected refcounts any more.
> 
> I suspect that one way or another, this change wants to be in a separate
> patch.

I think that if there are further adjustments to make (like dropping
is_ioreq_server_page() altogether, as you appear to suggest), that would
want to be in a separate patch, but the change as done fully fits the
given justification. (Of course in such a patch both _could_ also be
dropped at the same time.)

>> --- a/xen/arch/x86/mm/shadow/multi.c
>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> @@ -3441,7 +3441,7 @@ int sh_rm_write_access_from_sl1p(struct
>>  
>>  #ifdef CONFIG_HVM
>>      /* Remember if we've been told that this process is being torn down */
>> -    if ( curr->domain == d && is_hvm_domain(d) )
>> +    if ( curr->domain == d )
>>          curr->arch.paging.shadow.pagetable_dying
>>              = mfn_to_page(gmfn)->pagetable_dying;
>>  #endif
> 
> This one is dangerous.
> 
> After tracing, I can see that sh_rm_write_access_from_sl1p() is only
> called from OOS functions, but this function itself does its very best
> to look like it has mixed PV + HVM usage, and dropping this conditional
> means that pagetable_dying can, in principle at least, become non-NULL
> for a PV guest.
> 
> I think this function needs to be made far more obviously HVM-only first.

Oh, sure - the #ifdef inside the functions can be replaced collectively
by one around it, now that OOS code is built separately and for HVM only.

>> --- a/xen/arch/x86/mm/shadow/oos.c
>> +++ b/xen/arch/x86/mm/shadow/oos.c
>> @@ -577,7 +577,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
>>      if ( pg->shadow_flags &
>>           ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
>>           || sh_page_has_multiple_shadows(pg)
>> -         || !is_hvm_vcpu(v)
>>           || !v->domain->arch.paging.shadow.oos_active )
> 
> This is reachable for PV guests as far as I can see.  What am I missing ?

Well, the footnote in patch 1 ("x86/shadow: fix and improve
sh_page_has_multiple_shadows()") kind of explains this wrt the safety
of the sh_page_has_multiple_shadows() use here: Since PV guests can't
have OOS pages, there's no way SHF_out_of_sync could be set.

> The changes in hvm.c are all fine, and for those alone, consider it R-by
> if you end up splitting the patch.

Thanks, but for now I'm not meaning to split the patch, as per above.
There will be a new prereq patch as per your request.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent
  2023-03-23 18:28   ` Andrew Cooper
@ 2023-03-24  7:52     ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-24  7:52 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 23.03.2023 19:28, Andrew Cooper wrote:
> On 22/03/2023 9:35 am, Jan Beulich wrote:
>> While benign at present, it is still a little fragile to operate on a
>> wrong "old_mode" value in sh_update_paging_modes(). This can happen when
>> no monitor table was present initially - we'd create one for the new
>> mode without updating old_mode. Correct this two ways, each of which
> 
> I think you mean "Correct this in two ways" ?
> 
>> would be sufficient on its own: Once by adding "else" to the second of
>> the involved if()s in the function, and then by setting the correct
>> initial mode for HVM domains in shadow_vcpu_init().
>>
>> Further use the same predicate (paging_mode_external()) consistently
>> when dealing with shadow mode init/update/cleanup, rather than a mix of
>> is_hvm_vcpu() (init), is_hvm_domain() (update), and
>> paging_mode_external() (cleanup).
>>
>> Finally drop a redundant is_hvm_domain() from inside the bigger if()
>> (which is being converted to paging_mode_external()) in
>> sh_update_paging_modes().
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/mm/shadow/common.c
>> +++ b/xen/arch/x86/mm/shadow/common.c
>> @@ -129,8 +129,8 @@ void shadow_vcpu_init(struct vcpu *v)
>>      }
>>  #endif
>>  
>> -    v->arch.paging.mode = is_hvm_vcpu(v) ?
>> -                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3) :
>> +    v->arch.paging.mode = paging_mode_external(v->domain) ?
>> +                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 2) :
>>                            &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);
> 
> As you're changing this, reposition the ? and : to the start of the
> following lines?

Sure.

> But, is 2-level mode actually right?  It's better than 3 certainly, and
> is what sh_update_paging_modes() selects, but isn't that only right for
> the IDENT_PT case?

Which is how HVM vCPU-s start, isn't it? For PVH there clearly needs to
be a separate (later) paging mode update anyway (or else what - as you
say - sh_update_paging_modes() selects would also be wrong), which is
going to be whenever we see CR0.PG becoming set by the toolstack.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-22  9:33 ` [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
  2023-03-23 13:34   ` Andrew Cooper
@ 2023-03-27 15:39   ` Tim Deegan
  2023-03-28 10:37     ` Jan Beulich
  1 sibling, 1 reply; 39+ messages in thread
From: Tim Deegan @ 2023-03-27 15:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

Hi,

At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
> There's no need for an indirect call here, as the mode is invariant
> throughout the entire paging-locked region. All it takes to avoid it is
> to have a forward declaration of sh_update_cr3() in place.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I find this and the respective Win7 related comment suspicious: If we
> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> the shadow_get_and_create_l1e() rather than exit? The spurious page
> fault that the guest observes can, after all, not be known to be non-
> fatal inside the guest. That's purely an OS policy.

I think it has to be non-fatal because it can happen on real hardware,
even if the hardware *does* fill the TLB here (which it is not
required to).

Filling just one sl3e sounds plausible, though we don't want to go
back to the idea of having L3 shadows on PAE!

> Furthermore the sh_update_cr3() will also invalidate L3 entries which
> were loaded successfully before, but invalidated by the guest
> afterwards. I strongly suspect that the described hardware behavior is
> _only_ to load previously not-present entries from the PDPT, but not
> purge ones already marked present.

Very likely, but we *are* allowed to forget old entries whenever we
want to, so this is at worst a performance problem.

> IOW I think sh_update_cr3() would
> need calling in an "incremental" mode here.

This would be the best way of updating just the one entry - but as far
as I can tell the existing code is correct so I wouldn't add any more
complexity unless we know that this path is causing perf problems.

> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
> looks wrong.

Yep.

> Beyond the "on demand" L3 entry creation I also can't see what guest
> actions could lead to the ASSERT() being inapplicable in the PAE case.
> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
> PDPTEs, and all other logic is similar to that for other modes.

The assert's not true here because the guest can push us down this
path by doing exactly what Win 7 does here - loading CR3 with a
missing L3E, then filling the L3E and causing a page fault that uses
the now-filled L3E.  (Or maybe that's not true any more; my mental
model of the pagetable walker code might be out of date)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-27 15:39   ` Tim Deegan
@ 2023-03-28 10:37     ` Jan Beulich
  2023-03-29  6:49       ` Tim Deegan
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2023-03-28 10:37 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

On 27.03.2023 17:39, Tim Deegan wrote:
> At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
>> There's no need for an indirect call here, as the mode is invariant
>> throughout the entire paging-locked region. All it takes to avoid it is
>> to have a forward declaration of sh_update_cr3() in place.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I find this and the respective Win7 related comment suspicious: If we
>> really need to "fix up" L3 entries "on demand", wouldn't we better retry
>> the shadow_get_and_create_l1e() rather than exit? The spurious page
>> fault that the guest observes can, after all, not be known to be non-
>> fatal inside the guest. That's purely an OS policy.
> 
> I think it has to be non-fatal because it can happen on real hardware,
> even if the hardware *does* fill the TLB here (which it is not
> required to).

But if hardware filled the TLB, it won't raise #PF. If it didn't fill
the TLB (e.g. because of not even doing a walk when a PDPTE is non-
present), a #PF would be legitimate, and the handler would then need
to reload CR3. But such a #PF is what, according to the comment, Win7
chokes on. So it can only be the former case, yet what we do here is
fill the (virtual) TLB _and_ raise #PF. Win7 apparently ignores this
as spurious, but that's not required behavior for an OS afaik.

> Filling just one sl3e sounds plausible, though we don't want to go
> back to the idea of having L3 shadows on PAE!

Of course.

>> Furthermore the sh_update_cr3() will also invalidate L3 entries which
>> were loaded successfully before, but invalidated by the guest
>> afterwards. I strongly suspect that the described hardware behavior is
>> _only_ to load previously not-present entries from the PDPT, but not
>> purge ones already marked present.
> 
> Very likely, but we *are* allowed to forget old entries whenever we
> want to, so this is at worst a performance problem.

That depends on the model, I think: In the original Intel model PDPTEs
cannot be "forgotten". In some AMD variants, where L3 is walked normally,
they of course can be.

>> IOW I think sh_update_cr3() would
>> need calling in an "incremental" mode here.
> 
> This would be the best way of updating just the one entry - but as far
> as I can tell the existing code is correct so I wouldn't add any more
> complexity unless we know that this path is causing perf problems.

If it was/is just performance - sure.

>> In any event emitting a TRC_SHADOW_DOMF_DYING trace record in this case
>> looks wrong.
> 
> Yep.

Will add another patch to the series then.

>> Beyond the "on demand" L3 entry creation I also can't see what guest
>> actions could lead to the ASSERT() being inapplicable in the PAE case.
>> The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
>> PDPTEs, and all other logic is similar to that for other modes.
> 
> The assert's not true here because the guest can push us down this
> path by doing exactly what Win 7 does here - loading CR3 with a
> missing L3E, then filling the L3E and causing a page fault that uses
> the now-filled L3E.  (Or maybe that's not true any more; my mental
> model of the pagetable walker code might be out of date)

Well, I specifically started the paragraph with 'Beyond the "on demand"
L3 entry creation'. IOW the assertion would look applicable to me if
we, as suggested higher up, retried shadow_get_and_create_l1e() and it
failed again. As the comment there says, "we know the guest entries are
OK", so the missing L3 entry must have appeared.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-24  7:38     ` Jan Beulich
@ 2023-03-28 13:57       ` Andrew Cooper
  2023-03-28 14:41         ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2023-03-28 13:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 24/03/2023 7:38 am, Jan Beulich wrote:
> On 23.03.2023 19:18, Andrew Cooper wrote:
>> On 22/03/2023 9:35 am, Jan Beulich wrote:
>>> Emulation related functions are involved in HVM handling only, and in
>>> some cases they even invoke such checks after having already done things
>>> which are valid for HVM domains only. OOS active also implies HVM. In
>>> sh_remove_all_mappings() one of the two checks is redundant with an
>>> earlier paging_mode_external() one (the other, however, needs to stay).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> --- a/xen/arch/x86/mm/shadow/common.c
>>> +++ b/xen/arch/x86/mm/shadow/common.c
>>> @@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>                     == (is_special_page(page) ||
>>> -                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
>>> +                       is_ioreq_server_page(d, page)))) )
>>>              printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
>>>                     " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
>>>                     mfn_x(gmfn), gfn_x(gfn),
>> Out of context here needs an equivalent adjustment.
> I'm afraid I don't seen any further is_hvm_*() in this function.

Final parameter to the printk(), calculating the i=%d diagnostic.

>> But in this case, I'm not sure the commit message covers the relevant
>> details.  ioreq servers have been made fully common since this code was
>> written, and *that* is a better reason for dropping the predicates IMO
>> than the redundancy with paging_mode_external().
> How does "fully common" matter? It's still a HVM-only thing,

ioreq server info used to be in d->arch.hvm.ioreq.

The is_hvm_domain() was guarding against speculative type confusion in
the HVM union, and should have been removed by the ARM work that made it
common.

This isn't really related to the outer paging_mode_external().

>> That said...  I'm not sure the logic here is correct any more.  It used
>> to be the case that ioreq pages were in the p2m, but they're outside of
>> the p2m these days, so don't see how there can be any interaction with
>> unexpected refcounts any more.
>>
>> I suspect that one way or another, this change wants to be in a separate
>> patch.
> I think that if there are further adjustments to make (like dropping
> is_ioreq_server_page() altogether, as you appear to suggest), that would
> want to be in a separate patch, but the change as done fully fits the
> given justification. (Of course in such a patch both _could_ also be
> dropped at the same time.)

I'd still suggest doing it all separately.  It's sufficiently unrelated
to the justification for the other hunks of the patch.

>>> --- a/xen/arch/x86/mm/shadow/oos.c
>>> +++ b/xen/arch/x86/mm/shadow/oos.c
>>> @@ -577,7 +577,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
>>>      if ( pg->shadow_flags &
>>>           ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
>>>           || sh_page_has_multiple_shadows(pg)
>>> -         || !is_hvm_vcpu(v)
>>>           || !v->domain->arch.paging.shadow.oos_active )
>> This is reachable for PV guests as far as I can see.  What am I missing ?
> Well, the footnote in patch 1 ("x86/shadow: fix and improve
> sh_page_has_multiple_shadows()") kind of explains this wrt the safety
> of the sh_page_has_multiple_shadows() use here: Since PV guests can't
> have OOS pages, there's no way SHF_out_of_sync could be set.

Hmm, I suppose.  We enter sh_unsync() directly from a demand write, but
it is only meaningful when OOS is active to begin with.

Although having looked through this, there ought to be an early exit for
oos_active even ahead of the SHADOW_PRINTK(), and the single caller of
this function doesn't check the return value.  (This appears to be a
common theme...)

~Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-28 13:57       ` Andrew Cooper
@ 2023-03-28 14:41         ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2023-03-28 14:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan, xen-devel

On 28.03.2023 15:57, Andrew Cooper wrote:
> On 24/03/2023 7:38 am, Jan Beulich wrote:
>> On 23.03.2023 19:18, Andrew Cooper wrote:
>>> On 22/03/2023 9:35 am, Jan Beulich wrote:
>>>> Emulation related functions are involved in HVM handling only, and in
>>>> some cases they even invoke such checks after having already done things
>>>> which are valid for HVM domains only. OOS active also implies HVM. In
>>>> sh_remove_all_mappings() one of the two checks is redundant with an
>>>> earlier paging_mode_external() one (the other, however, needs to stay).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> --- a/xen/arch/x86/mm/shadow/common.c
>>>> +++ b/xen/arch/x86/mm/shadow/common.c
>>>> @@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>>                     == (is_special_page(page) ||
>>>> -                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
>>>> +                       is_ioreq_server_page(d, page)))) )
>>>>              printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
>>>>                     " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
>>>>                     mfn_x(gmfn), gfn_x(gfn),
>>> Out of context here needs an equivalent adjustment.
>> I'm afraid I don't seen any further is_hvm_*() in this function.
> 
> Final parameter to the printk(), calculating the i=%d diagnostic.

That one. This needs to stay, as calling is_ioreq_server_page() isn't
safe for PV domains. ioreq_domain_init(), which initializes the involved
spinlock, is called only for HVM domains. This is why I'm insisting ...

>>> But in this case, I'm not sure the commit message covers the relevant
>>> details.  ioreq servers have been made fully common since this code was
>>> written, and *that* is a better reason for dropping the predicates IMO
>>> than the redundancy with paging_mode_external().
>> How does "fully common" matter? It's still a HVM-only thing,
> 
> ioreq server info used to be in d->arch.hvm.ioreq.
> 
> The is_hvm_domain() was guarding against speculative type confusion in
> the HVM union, and should have been removed by the ARM work that made it
> common.
> 
> This isn't really related to the outer paging_mode_external().

... that the dropping that I can safely do here is solely because of
the earlier paging_mode_external(). This is irrespective to what the
check may (also) have been used for before.

>>>> --- a/xen/arch/x86/mm/shadow/oos.c
>>>> +++ b/xen/arch/x86/mm/shadow/oos.c
>>>> @@ -577,7 +577,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
>>>>      if ( pg->shadow_flags &
>>>>           ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
>>>>           || sh_page_has_multiple_shadows(pg)
>>>> -         || !is_hvm_vcpu(v)
>>>>           || !v->domain->arch.paging.shadow.oos_active )
>>> This is reachable for PV guests as far as I can see.  What am I missing ?
>> Well, the footnote in patch 1 ("x86/shadow: fix and improve
>> sh_page_has_multiple_shadows()") kind of explains this wrt the safety
>> of the sh_page_has_multiple_shadows() use here: Since PV guests can't
>> have OOS pages, there's no way SHF_out_of_sync could be set.
> 
> Hmm, I suppose.  We enter sh_unsync() directly from a demand write, but
> it is only meaningful when OOS is active to begin with.
> 
> Although having looked through this, there ought to be an early exit for
> oos_active even ahead of the SHADOW_PRINTK(), and the single caller of
> this function doesn't check the return value.  (This appears to be a
> common theme...)

Well, of course there's much more cleanup left than done by this series.

Jan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-28 10:37     ` Jan Beulich
@ 2023-03-29  6:49       ` Tim Deegan
  0 siblings, 0 replies; 39+ messages in thread
From: Tim Deegan @ 2023-03-29  6:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

At 12:37 +0200 on 28 Mar (1680007032), Jan Beulich wrote:
> On 27.03.2023 17:39, Tim Deegan wrote:
> > At 10:33 +0100 on 22 Mar (1679481226), Jan Beulich wrote:
> >> There's no need for an indirect call here, as the mode is invariant
> >> throughout the entire paging-locked region. All it takes to avoid it is
> >> to have a forward declaration of sh_update_cr3() in place.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I find this and the respective Win7 related comment suspicious: If we
> >> really need to "fix up" L3 entries "on demand", wouldn't we better retry
> >> the shadow_get_and_create_l1e() rather than exit? The spurious page
> >> fault that the guest observes can, after all, not be known to be non-
> >> fatal inside the guest. That's purely an OS policy.
> > 
> > I think it has to be non-fatal because it can happen on real hardware,
> > even if the hardware *does* fill the TLB here (which it is not
> > required to).
> 
> But if hardware filled the TLB, it won't raise #PF. If it didn't fill
> the TLB (e.g. because of not even doing a walk when a PDPTE is non-
> present), a #PF would be legitimate, and the handler would then need
> to reload CR3. But such a #PF is what, according to the comment, Win7
> chokes on.

IIRC the Win7 behaviour is to accept and discard the #PF as spurious
(because it sees the PDPTE is present) *without* reloading CR3, so it
gets stuck in a loop taking pagefaults.  Here, we reload CR3 for it,
to unstick it.

I can't think of a mental model of the MMU that would have a problem
here -- either the L3Es are special in which case the pagefault is
correct (and we shouldn't even read the new contents) or they're just
like other PTEs in which case the spurious fault is fine.

> > The assert's not true here because the guest can push us down this
> > path by doing exactly what Win 7 does here - loading CR3 with a
> > missing L3E, then filling the L3E and causing a page fault that uses
> > the now-filled L3E.  (Or maybe that's not true any more; my mental
> > model of the pagetable walker code might be out of date)
> 
> Well, I specifically started the paragraph with 'Beyond the "on demand"
> L3 entry creation'. IOW the assertion would look applicable to me if
> we, as suggested higher up, retried shadow_get_and_create_l1e() and it
> failed again. As the comment there says, "we know the guest entries are
> OK", so the missing L3 entry must have appeared.

Ah, I didn't quite understand you.  Yes, if we changed the handler to
rebuild the SL3E then I think the assertion would be valid again.

Cheers,

Tim.


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-03-29  6:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-22  9:28 [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Jan Beulich
2023-03-22  9:29 ` [PATCH 01/16] x86/shadow: fix and improve sh_page_has_multiple_shadows() Jan Beulich
2023-03-23 11:55   ` Andrew Cooper
2023-03-22  9:30 ` [PATCH 02/16] x86/shadow: fold/rename sh_unhook_*_mappings() Jan Beulich
2023-03-22  9:31 ` [PATCH 03/16] x86/shadow: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodys" Jan Beulich
2023-03-23 12:14   ` Andrew Cooper
2023-03-22  9:31 ` [PATCH 04/16] x86/shadow: replace memcmp() in sh_resync_l1() Jan Beulich
2023-03-23 12:55   ` Andrew Cooper
2023-03-22  9:31 ` [PATCH 05/16] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
2023-03-22  9:32 ` [PATCH 06/16] x86/shadow: purge {write,cmpxchg}_guest_entry() hooks Jan Beulich
2023-03-23 13:13   ` Andrew Cooper
2023-03-22  9:33 ` [PATCH 07/16] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
2023-03-23 13:34   ` Andrew Cooper
2023-03-27 15:39   ` Tim Deegan
2023-03-28 10:37     ` Jan Beulich
2023-03-29  6:49       ` Tim Deegan
2023-03-22  9:34 ` [PATCH 08/16] x86/shadow: use lighter weight mode checks Jan Beulich
2023-03-22  9:34 ` [PATCH 09/16] x86/shadow: OOS mode is HVM-only Jan Beulich
2023-03-23 13:42   ` Andrew Cooper
2023-03-22  9:35 ` [PATCH 10/16] x86/shadow: move OOS functions to their own file Jan Beulich
2023-03-23 14:30   ` Andrew Cooper
2023-03-23 15:40     ` Jan Beulich
2023-03-23 17:44       ` Andrew Cooper
2023-03-22  9:35 ` [PATCH 11/16] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
2023-03-23 18:18   ` Andrew Cooper
2023-03-24  7:38     ` Jan Beulich
2023-03-28 13:57       ` Andrew Cooper
2023-03-28 14:41         ` Jan Beulich
2023-03-22  9:35 ` [PATCH 12/16] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
2023-03-23 18:28   ` Andrew Cooper
2023-03-24  7:52     ` Jan Beulich
2023-03-22  9:36 ` [PATCH 13/16] x86/shadow: vCPU-s never have "no mode" Jan Beulich
2023-03-23 18:29   ` Andrew Cooper
2023-03-22  9:37 ` [PATCH 14/16] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
2023-03-23 18:30   ` Andrew Cooper
2023-03-22  9:37 ` [PATCH 15/16] x86/shadow: adjust monitor table prealloc amount Jan Beulich
2023-03-22  9:38 ` [PATCH 16/16] x86/PV: conditionalize arch_set_info_guest()'s call to update_cr3() Jan Beulich
2023-03-23 18:34   ` Andrew Cooper
2023-03-23 11:40 ` [PATCH 00/16] x86: assorted (mostly) shadow mode adjustments Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.