xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus ...
@ 2021-01-14 15:01 Jan Beulich
  2021-01-14 15:03 ` [PATCH 01/17] x86/shadow: use __put_user() instead of __copy_to_user() Jan Beulich
                   ` (17 more replies)
  0 siblings, 18 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

... shadow adjustments towards not building 2- and 3-level code
when !HVM. While the latter isn't functionally related to the
former, it depends on some of the rearrangements done there.

01: shadow: use __put_user() instead of __copy_to_user()
02: split __{get,put}_user() into "guest" and "unsafe" variants
03: split __copy_{from,to}_user() into "guest" and "unsafe" variants
04: PV: harden guest memory accesses against speculative abuse
05: rename {get,put}_user() to {get,put}_guest()
06: gdbsx: convert "user" to "guest" accesses
07: rename copy_{from,to}_user() to copy_{from,to}_guest_pv()
08: move stac()/clac() from {get,put}_unsafe_asm() ...
09: PV: use get_unsafe() instead of copy_from_unsafe()
10: shadow: use get_unsafe() instead of copy_from_unsafe()
11: shadow: polish shadow_write_entries()
12: shadow: move shadow_set_l<N>e() to their own source file
13: shadow: don't open-code SHF_* shorthands
14: shadow: SH_type_l2h_shadow is PV-only
15: shadow: drop SH_type_l2h_pae_shadow
16: shadow: only 4-level guest code needs building when !HVM
17: shadow: adjust is_pv_*() checks

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/17] x86/shadow: use __put_user() instead of __copy_to_user()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
@ 2021-01-14 15:03 ` Jan Beulich
  2021-01-14 15:04 ` [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants Jan Beulich
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:03 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

In a subsequent patch I would almost have broken the logic here, if I
hadn't happened to read through the comment at the top of
safe_write_entry(): __copy_from_user() does not provide a guarantee
shadow_write_entries() requires - it's only an optimization that it
makes use of __put_user_size() for certain sizes. Use __put_user()
directly, which does expand to a single (memory accessing) insn.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
In a future patch I guess I'll make this write store the intended data
instead of doing this "no-op" write, making the subsequent loop start
from 1 in the success case. In fact I also think safe_write_entry()
would better go away, in favor of direct use of write_atomic().

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -776,9 +776,9 @@ shadow_write_entries(void *d, void *s, i
     /* Because we mirror access rights at all levels in the shadow, an
      * l2 (or higher) entry with the RW bit cleared will leave us with
      * no write access through the linear map.
-     * We detect that by writing to the shadow with copy_to_user() and
+     * We detect that by writing to the shadow with __put_user() and
      * using map_domain_page() to get a writeable mapping if we need to. */
-    if ( __copy_to_user(d, d, sizeof (unsigned long)) != 0 )
+    if ( __put_user(*dst, dst) )
     {
         perfc_incr(shadow_linear_map_failed);
         map = map_domain_page(mfn);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
  2021-01-14 15:03 ` [PATCH 01/17] x86/shadow: use __put_user() instead of __copy_to_user() Jan Beulich
@ 2021-01-14 15:04 ` Jan Beulich
  2021-02-05 15:43   ` Roger Pau Monné
  2021-02-09 14:55   ` Roger Pau Monné
  2021-01-14 15:04 ` [PATCH 03/17] x86: split __copy_{from,to}_user() " Jan Beulich
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are not. (For
descriptor table accesses the low bits of the addresses may still be
guest controlled, but this still won't allow speculation to "escape"
into unwanted areas.) Subsequently we will want them to have different
behavior, so as first step identify which one is which. For now, both
groups of constructs alias one another.

Double underscore prefixes are retained only on __{get,put}_guest(), to
allow still distinguishing them from their "checking" counterparts once
they also get renamed (to {get,put}_guest()).

Since for them it's almost a full re-write, move what becomes
{get,put}_unsafe_size() into the "common" uaccess.h (x86_64/*.h should
disappear at some point anyway).

In __copy_to_user() one of the two casts in each put_guest_size()
invocation gets dropped. They're not needed and did break symmetry with
__copy_from_user().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -776,9 +776,9 @@ shadow_write_entries(void *d, void *s, i
     /* Because we mirror access rights at all levels in the shadow, an
      * l2 (or higher) entry with the RW bit cleared will leave us with
      * no write access through the linear map.
-     * We detect that by writing to the shadow with __put_user() and
+     * We detect that by writing to the shadow with put_unsafe() and
      * using map_domain_page() to get a writeable mapping if we need to. */
-    if ( __put_user(*dst, dst) )
+    if ( put_unsafe(*dst, dst) )
     {
         perfc_incr(shadow_linear_map_failed);
         map = map_domain_page(mfn);
--- a/xen/arch/x86/pv/emul-gate-op.c
+++ b/xen/arch/x86/pv/emul-gate-op.c
@@ -40,7 +40,7 @@ static int read_gate_descriptor(unsigned
          ((gate_sel >> 3) + !is_pv_32bit_vcpu(v) >=
           (gate_sel & 4 ? v->arch.pv.ldt_ents
                         : v->arch.pv.gdt_ents)) ||
-         __get_user(desc, pdesc) )
+         get_unsafe(desc, pdesc) )
         return 0;
 
     *sel = (desc.a >> 16) & 0x0000fffc;
@@ -59,7 +59,7 @@ static int read_gate_descriptor(unsigned
     {
         if ( (*ar & 0x1f00) != 0x0c00 ||
              /* Limit check done above already. */
-             __get_user(desc, pdesc + 1) ||
+             get_unsafe(desc, pdesc + 1) ||
              (desc.b & 0x1f00) )
             return 0;
 
@@ -294,7 +294,7 @@ void pv_emulate_gate_op(struct cpu_user_
         { \
             --stkp; \
             esp -= 4; \
-            rc = __put_user(item, stkp); \
+            rc = __put_guest(item, stkp); \
             if ( rc ) \
             { \
                 pv_inject_page_fault(PFEC_write_access, \
@@ -362,7 +362,7 @@ void pv_emulate_gate_op(struct cpu_user_
                     unsigned int parm;
 
                     --ustkp;
-                    rc = __get_user(parm, ustkp);
+                    rc = __get_guest(parm, ustkp);
                     if ( rc )
                     {
                         pv_inject_page_fault(0, (unsigned long)(ustkp + 1) - rc);
--- a/xen/arch/x86/pv/emulate.c
+++ b/xen/arch/x86/pv/emulate.c
@@ -34,13 +34,13 @@ int pv_emul_read_descriptor(unsigned int
     if ( sel < 4 ||
          /*
           * Don't apply the GDT limit here, as the selector may be a Xen
-          * provided one. __get_user() will fail (without taking further
+          * provided one. get_unsafe() will fail (without taking further
           * action) for ones falling in the gap between guest populated
           * and Xen ones.
           */
          ((sel & 4) && (sel >> 3) >= v->arch.pv.ldt_ents) )
         desc.b = desc.a = 0;
-    else if ( __get_user(desc, gdt_ldt_desc_ptr(sel)) )
+    else if ( get_unsafe(desc, gdt_ldt_desc_ptr(sel)) )
         return 0;
     if ( !insn_fetch )
         desc.b &= ~_SEGMENT_L;
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -114,15 +114,15 @@ unsigned int compat_iret(void)
     regs->rsp = (u32)regs->rsp;
 
     /* Restore EAX (clobbered by hypercall). */
-    if ( unlikely(__get_user(regs->eax, (u32 *)regs->rsp)) )
+    if ( unlikely(__get_guest(regs->eax, (u32 *)regs->rsp)) )
     {
         domain_crash(v->domain);
         return 0;
     }
 
     /* Restore CS and EIP. */
-    if ( unlikely(__get_user(regs->eip, (u32 *)regs->rsp + 1)) ||
-        unlikely(__get_user(regs->cs, (u32 *)regs->rsp + 2)) )
+    if ( unlikely(__get_guest(regs->eip, (u32 *)regs->rsp + 1)) ||
+        unlikely(__get_guest(regs->cs, (u32 *)regs->rsp + 2)) )
     {
         domain_crash(v->domain);
         return 0;
@@ -132,7 +132,7 @@ unsigned int compat_iret(void)
      * Fix up and restore EFLAGS. We fix up in a local staging area
      * to avoid firing the BUG_ON(IOPL) check in arch_get_info_guest.
      */
-    if ( unlikely(__get_user(eflags, (u32 *)regs->rsp + 3)) )
+    if ( unlikely(__get_guest(eflags, (u32 *)regs->rsp + 3)) )
     {
         domain_crash(v->domain);
         return 0;
@@ -164,16 +164,16 @@ unsigned int compat_iret(void)
         {
             for (i = 1; i < 10; ++i)
             {
-                rc |= __get_user(x, (u32 *)regs->rsp + i);
-                rc |= __put_user(x, (u32 *)(unsigned long)ksp + i);
+                rc |= __get_guest(x, (u32 *)regs->rsp + i);
+                rc |= __put_guest(x, (u32 *)(unsigned long)ksp + i);
             }
         }
         else if ( ksp > regs->esp )
         {
             for ( i = 9; i > 0; --i )
             {
-                rc |= __get_user(x, (u32 *)regs->rsp + i);
-                rc |= __put_user(x, (u32 *)(unsigned long)ksp + i);
+                rc |= __get_guest(x, (u32 *)regs->rsp + i);
+                rc |= __put_guest(x, (u32 *)(unsigned long)ksp + i);
             }
         }
         if ( rc )
@@ -189,7 +189,7 @@ unsigned int compat_iret(void)
             eflags &= ~X86_EFLAGS_IF;
         regs->eflags &= ~(X86_EFLAGS_VM|X86_EFLAGS_RF|
                           X86_EFLAGS_NT|X86_EFLAGS_TF);
-        if ( unlikely(__put_user(0, (u32 *)regs->rsp)) )
+        if ( unlikely(__put_guest(0, (u32 *)regs->rsp)) )
         {
             domain_crash(v->domain);
             return 0;
@@ -205,8 +205,8 @@ unsigned int compat_iret(void)
     else if ( ring_1(regs) )
         regs->esp += 16;
     /* Return to ring 2/3: restore ESP and SS. */
-    else if ( __get_user(regs->ss, (u32 *)regs->rsp + 5) ||
-              __get_user(regs->esp, (u32 *)regs->rsp + 4) )
+    else if ( __get_guest(regs->ss, (u32 *)regs->rsp + 5) ||
+              __get_guest(regs->esp, (u32 *)regs->rsp + 4) )
     {
         domain_crash(v->domain);
         return 0;
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -274,7 +274,7 @@ static void compat_show_guest_stack(stru
     {
         if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
             break;
-        if ( __get_user(addr, stack) )
+        if ( get_unsafe(addr, stack) )
         {
             if ( i != 0 )
                 printk("\n    ");
@@ -343,7 +343,7 @@ static void show_guest_stack(struct vcpu
     {
         if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
             break;
-        if ( __get_user(addr, stack) )
+        if ( get_unsafe(addr, stack) )
         {
             if ( i != 0 )
                 printk("\n    ");
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -60,13 +60,11 @@ extern void __put_user_bad(void);
   __put_user_check((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr)))
 
 /**
- * __get_user: - Get a simple variable from user space, with less checking.
+ * __get_guest: - Get a simple variable from guest space, with less checking.
  * @x:   Variable to store result.
- * @ptr: Source address, in user space.
+ * @ptr: Source address, in guest space.
  *
- * Context: User context only.  This function may sleep.
- *
- * This macro copies a single simple variable from user space to kernel
+ * This macro copies a single simple variable from guest space to hypervisor
  * space.  It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
@@ -79,17 +77,15 @@ extern void __put_user_bad(void);
  * Returns zero on success, or -EFAULT on error.
  * On error, the variable @x is set to zero.
  */
-#define __get_user(x,ptr) \
-  __get_user_nocheck((x),(ptr),sizeof(*(ptr)))
+#define __get_guest(x, ptr) get_guest_nocheck(x, ptr, sizeof(*(ptr)))
+#define get_unsafe __get_guest
 
 /**
- * __put_user: - Write a simple value into user space, with less checking.
- * @x:   Value to copy to user space.
- * @ptr: Destination address, in user space.
+ * __put_guest: - Write a simple value into guest space, with less checking.
+ * @x:   Value to store in guest space.
+ * @ptr: Destination address, in guest space.
  *
- * Context: User context only.  This function may sleep.
- *
- * This macro copies a single simple value from kernel space to user
+ * This macro copies a single simple value from hypervisor space to guest
  * space.  It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
@@ -101,13 +97,14 @@ extern void __put_user_bad(void);
  *
  * Returns zero on success, or -EFAULT on error.
  */
-#define __put_user(x,ptr) \
-  __put_user_nocheck((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr)))
+#define __put_guest(x, ptr) \
+    put_guest_nocheck((__typeof__(*(ptr)))(x), ptr, sizeof(*(ptr)))
+#define put_unsafe __put_guest
 
-#define __put_user_nocheck(x, ptr, size)				\
+#define put_guest_nocheck(x, ptr, size)					\
 ({									\
 	int err_; 							\
-	__put_user_size(x, ptr, size, err_, -EFAULT);			\
+	put_guest_size(x, ptr, size, err_, -EFAULT);			\
 	err_;								\
 })
 
@@ -115,14 +112,14 @@ extern void __put_user_bad(void);
 ({									\
 	__typeof__(*(ptr)) __user *ptr_ = (ptr);			\
 	__typeof__(size) size_ = (size);				\
-	access_ok(ptr_, size_) ? __put_user_nocheck(x, ptr_, size_)	\
+	access_ok(ptr_, size_) ? put_guest_nocheck(x, ptr_, size_)	\
 			       : -EFAULT;				\
 })
 
-#define __get_user_nocheck(x, ptr, size)				\
+#define get_guest_nocheck(x, ptr, size)					\
 ({									\
 	int err_; 							\
-	__get_user_size(x, ptr, size, err_, -EFAULT);			\
+	get_guest_size(x, ptr, size, err_, -EFAULT);			\
 	err_;								\
 })
 
@@ -130,7 +127,7 @@ extern void __put_user_bad(void);
 ({									\
 	__typeof__(*(ptr)) __user *ptr_ = (ptr);			\
 	__typeof__(size) size_ = (size);				\
-	access_ok(ptr_, size_) ? __get_user_nocheck(x, ptr_, size_)	\
+	access_ok(ptr_, size_) ? get_guest_nocheck(x, ptr_, size_)	\
 			       : -EFAULT;				\
 })
 
@@ -142,7 +139,7 @@ struct __large_struct { unsigned long bu
  * we do not write to any memory gcc knows about, so there are no
  * aliasing issues.
  */
-#define __put_user_asm(x, addr, err, itype, rtype, ltype, errret)	\
+#define put_unsafe_asm(x, addr, err, itype, rtype, ltype, errret)	\
 	stac();								\
 	__asm__ __volatile__(						\
 		"1:	mov"itype" %"rtype"1,%2\n"			\
@@ -156,7 +153,7 @@ struct __large_struct { unsigned long bu
 		: ltype (x), "m"(__m(addr)), "i"(errret), "0"(err));	\
 	clac()
 
-#define __get_user_asm(x, addr, err, itype, rtype, ltype, errret)	\
+#define get_unsafe_asm(x, addr, err, itype, rtype, ltype, errret)	\
 	stac();								\
 	__asm__ __volatile__(						\
 		"1:	mov"itype" %2,%"rtype"1\n"			\
@@ -171,6 +168,34 @@ struct __large_struct { unsigned long bu
 		: "m"(__m(addr)), "i"(errret), "0"(err));		\
 	clac()
 
+#define put_unsafe_size(x, ptr, size, retval, errret)                      \
+do {                                                                       \
+    retval = 0;                                                            \
+    switch ( size )                                                        \
+    {                                                                      \
+    case 1: put_unsafe_asm(x, ptr, retval, "b", "b", "iq", errret); break; \
+    case 2: put_unsafe_asm(x, ptr, retval, "w", "w", "ir", errret); break; \
+    case 4: put_unsafe_asm(x, ptr, retval, "l", "k", "ir", errret); break; \
+    case 8: put_unsafe_asm(x, ptr, retval, "q",  "", "ir", errret); break; \
+    default: __put_user_bad();                                             \
+    }                                                                      \
+} while ( false )
+#define put_guest_size put_unsafe_size
+
+#define get_unsafe_size(x, ptr, size, retval, errret)                      \
+do {                                                                       \
+    retval = 0;                                                            \
+    switch ( size )                                                        \
+    {                                                                      \
+    case 1: get_unsafe_asm(x, ptr, retval, "b", "b", "=q", errret); break; \
+    case 2: get_unsafe_asm(x, ptr, retval, "w", "w", "=r", errret); break; \
+    case 4: get_unsafe_asm(x, ptr, retval, "l", "k", "=r", errret); break; \
+    case 8: get_unsafe_asm(x, ptr, retval, "q",  "", "=r", errret); break; \
+    default: __get_user_bad();                                             \
+    }                                                                      \
+} while ( false )
+#define get_guest_size get_unsafe_size
+
 /**
  * __copy_to_user: - Copy a block of data into user space, with less checking
  * @to:   Destination address, in user space.
@@ -193,16 +218,16 @@ __copy_to_user(void __user *to, const vo
 
         switch (n) {
         case 1:
-            __put_user_size(*(const u8 *)from, (u8 __user *)to, 1, ret, 1);
+            put_guest_size(*(const uint8_t *)from, to, 1, ret, 1);
             return ret;
         case 2:
-            __put_user_size(*(const u16 *)from, (u16 __user *)to, 2, ret, 2);
+            put_guest_size(*(const uint16_t *)from, to, 2, ret, 2);
             return ret;
         case 4:
-            __put_user_size(*(const u32 *)from, (u32 __user *)to, 4, ret, 4);
+            put_guest_size(*(const uint32_t *)from, to, 4, ret, 4);
             return ret;
         case 8:
-            __put_user_size(*(const u64 *)from, (u64 __user *)to, 8, ret, 8);
+            put_guest_size(*(const uint64_t *)from, to, 8, ret, 8);
             return ret;
         }
     }
@@ -234,16 +259,16 @@ __copy_from_user(void *to, const void __
 
         switch (n) {
         case 1:
-            __get_user_size(*(u8 *)to, from, 1, ret, 1);
+            get_guest_size(*(uint8_t *)to, from, 1, ret, 1);
             return ret;
         case 2:
-            __get_user_size(*(u16 *)to, from, 2, ret, 2);
+            get_guest_size(*(uint16_t *)to, from, 2, ret, 2);
             return ret;
         case 4:
-            __get_user_size(*(u32 *)to, from, 4, ret, 4);
+            get_guest_size(*(uint32_t *)to, from, 4, ret, 4);
             return ret;
         case 8:
-            __get_user_size(*(u64*)to, from, 8, ret, 8);
+            get_guest_size(*(uint64_t *)to, from, 8, ret, 8);
             return ret;
         }
     }
--- a/xen/include/asm-x86/x86_64/uaccess.h
+++ b/xen/include/asm-x86/x86_64/uaccess.h
@@ -57,28 +57,4 @@ extern void *xlat_malloc(unsigned long *
     (likely((count) < (~0U / (size))) && \
      compat_access_ok(addr, 0 + (count) * (size)))
 
-#define __put_user_size(x,ptr,size,retval,errret)			\
-do {									\
-	retval = 0;							\
-	switch (size) {							\
-	case 1: __put_user_asm(x,ptr,retval,"b","b","iq",errret);break;	\
-	case 2: __put_user_asm(x,ptr,retval,"w","w","ir",errret);break; \
-	case 4: __put_user_asm(x,ptr,retval,"l","k","ir",errret);break;	\
-	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
-	default: __put_user_bad();					\
-	}								\
-} while (0)
-
-#define __get_user_size(x,ptr,size,retval,errret)			\
-do {									\
-	retval = 0;							\
-	switch (size) {							\
-	case 1: __get_user_asm(x,ptr,retval,"b","b","=q",errret);break;	\
-	case 2: __get_user_asm(x,ptr,retval,"w","w","=r",errret);break;	\
-	case 4: __get_user_asm(x,ptr,retval,"l","k","=r",errret);break;	\
-	case 8: __get_user_asm(x,ptr,retval,"q","","=r",errret); break;	\
-	default: __get_user_bad();					\
-	}								\
-} while (0)
-
 #endif /* __X86_64_UACCESS_H */
--- a/xen/test/livepatch/xen_hello_world_func.c
+++ b/xen/test/livepatch/xen_hello_world_func.c
@@ -26,7 +26,7 @@ const char *xen_hello_world(void)
      * Any BUG, or WARN_ON will contain symbol and payload name. Furthermore
      * exceptions will be caught and processed properly.
      */
-    rc = __get_user(tmp, non_canonical_addr);
+    rc = get_unsafe(tmp, non_canonical_addr);
     BUG_ON(rc != -EFAULT);
 #endif
 #if defined(CONFIG_ARM)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 03/17] x86: split __copy_{from,to}_user() into "guest" and "unsafe" variants
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
  2021-01-14 15:03 ` [PATCH 01/17] x86/shadow: use __put_user() instead of __copy_to_user() Jan Beulich
  2021-01-14 15:04 ` [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants Jan Beulich
@ 2021-01-14 15:04 ` Jan Beulich
  2021-02-09 16:06   ` Roger Pau Monné
  2021-01-14 15:04 ` [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse Jan Beulich
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are not. Subsequently
we will want them to have different behavior, so as first step identify
which one is which. For now, both groups of constructs alias one another.

Double underscore prefixes are retained only on
__copy_{from,to}_guest_pv(), to allow still distinguishing them from
their "checking" counterparts once they also get renamed (to
copy_{from,to}_guest_pv()).

Add previously missing __user at some call sites.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Instead of __copy_{from,to}_guest_pv(), perhaps name them just
__copy_{from,to}_pv()?

--- a/xen/arch/x86/gdbstub.c
+++ b/xen/arch/x86/gdbstub.c
@@ -33,13 +33,13 @@ gdb_arch_signal_num(struct cpu_user_regs
 unsigned int
 gdb_arch_copy_from_user(void *dest, const void *src, unsigned len)
 {
-    return __copy_from_user(dest, src, len);
+    return copy_from_unsafe(dest, src, len);
 }
 
 unsigned int 
 gdb_arch_copy_to_user(void *dest, const void *src, unsigned len)
 {
-    return __copy_to_user(dest, src, len);
+    return copy_to_unsafe(dest, src, len);
 }
 
 void
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2614,7 +2614,7 @@ static int sh_page_fault(struct vcpu *v,
         {
             shadow_l2e_t sl2e;
             mfn_t gl1mfn;
-            if ( (__copy_from_user(&sl2e,
+            if ( (copy_from_unsafe(&sl2e,
                                    (sh_linear_l2_table(v)
                                     + shadow_l2_linear_offset(va)),
                                    sizeof(sl2e)) != 0)
@@ -2633,7 +2633,7 @@ static int sh_page_fault(struct vcpu *v,
 #endif /* SHOPT_OUT_OF_SYNC */
         /* The only reasons for reserved bits to be set in shadow entries
          * are the two "magic" shadow_l1e entries. */
-        if ( likely((__copy_from_user(&sl1e,
+        if ( likely((copy_from_unsafe(&sl1e,
                                       (sh_linear_l1_table(v)
                                        + shadow_l1_linear_offset(va)),
                                       sizeof(sl1e)) == 0)
@@ -3308,10 +3308,10 @@ static bool sh_invlpg(struct vcpu *v, un
                    sh_linear_l4_table(v)[shadow_l4_linear_offset(linear)])
                & _PAGE_PRESENT) )
             return false;
-        /* This must still be a copy-from-user because we don't have the
+        /* This must still be a copy-from-unsafe because we don't have the
          * paging lock, and the higher-level shadows might disappear
          * under our feet. */
-        if ( __copy_from_user(&sl3e, (sh_linear_l3_table(v)
+        if ( copy_from_unsafe(&sl3e, (sh_linear_l3_table(v)
                                       + shadow_l3_linear_offset(linear)),
                               sizeof (sl3e)) != 0 )
         {
@@ -3330,9 +3330,9 @@ static bool sh_invlpg(struct vcpu *v, un
         return false;
 #endif
 
-    /* This must still be a copy-from-user because we don't have the shadow
+    /* This must still be a copy-from-unsafe because we don't have the shadow
      * lock, and the higher-level shadows might disappear under our feet. */
-    if ( __copy_from_user(&sl2e,
+    if ( copy_from_unsafe(&sl2e,
                           sh_linear_l2_table(v) + shadow_l2_linear_offset(linear),
                           sizeof (sl2e)) != 0 )
     {
@@ -3371,11 +3371,11 @@ static bool sh_invlpg(struct vcpu *v, un
              * hold the paging lock yet.  Check again with the lock held. */
             paging_lock(d);
 
-            /* This must still be a copy-from-user because we didn't
+            /* This must still be a copy-from-unsafe because we didn't
              * have the paging lock last time we checked, and the
              * higher-level shadows might have disappeared under our
              * feet. */
-            if ( __copy_from_user(&sl2e,
+            if ( copy_from_unsafe(&sl2e,
                                   sh_linear_l2_table(v)
                                   + shadow_l2_linear_offset(linear),
                                   sizeof (sl2e)) != 0 )
--- a/xen/arch/x86/pv/emul-gate-op.c
+++ b/xen/arch/x86/pv/emul-gate-op.c
@@ -149,12 +149,12 @@ static int read_mem(enum x86_segment seg
 
     addr = (uint32_t)addr;
 
-    if ( (rc = __copy_from_user(p_data, (void *)addr, bytes)) )
+    if ( (rc = __copy_from_guest_pv(p_data, (void __user *)addr, bytes)) )
     {
         /*
          * TODO: This should report PFEC_insn_fetch when goc->insn_fetch &&
          * cpu_has_nx, but we'd then need a "fetch" variant of
-         * __copy_from_user() respecting NX, SMEP, and protection keys.
+         * __copy_from_guest_pv() respecting NX, SMEP, and protection keys.
          */
         x86_emul_pagefault(0, addr + bytes - rc, ctxt);
         return X86EMUL_EXCEPTION;
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -649,7 +649,8 @@ static int rep_ins(uint16_t port,
         if ( rc != X86EMUL_OKAY )
             return rc;
 
-        if ( (rc = __copy_to_user((void *)addr, &data, bytes_per_rep)) != 0 )
+        if ( (rc = __copy_to_guest_pv((void __user *)addr, &data,
+                                      bytes_per_rep)) != 0 )
         {
             x86_emul_pagefault(PFEC_write_access,
                                addr + bytes_per_rep - rc, ctxt);
@@ -716,7 +717,8 @@ static int rep_outs(enum x86_segment seg
         if ( rc != X86EMUL_OKAY )
             return rc;
 
-        if ( (rc = __copy_from_user(&data, (void *)addr, bytes_per_rep)) != 0 )
+        if ( (rc = __copy_from_guest_pv(&data, (void __user *)addr,
+                                        bytes_per_rep)) != 0 )
         {
             x86_emul_pagefault(0, addr + bytes_per_rep - rc, ctxt);
             return X86EMUL_EXCEPTION;
@@ -1253,12 +1255,12 @@ static int insn_fetch(enum x86_segment s
     if ( rc != X86EMUL_OKAY )
         return rc;
 
-    if ( (rc = __copy_from_user(p_data, (void *)addr, bytes)) != 0 )
+    if ( (rc = __copy_from_guest_pv(p_data, (void __user *)addr, bytes)) != 0 )
     {
         /*
          * TODO: This should report PFEC_insn_fetch when goc->insn_fetch &&
          * cpu_has_nx, but we'd then need a "fetch" variant of
-         * __copy_from_user() respecting NX, SMEP, and protection keys.
+         * __copy_from_guest_pv() respecting NX, SMEP, and protection keys.
          */
         x86_emul_pagefault(0, addr + bytes - rc, ctxt);
         return X86EMUL_EXCEPTION;
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -41,7 +41,7 @@ l1_pgentry_t *map_guest_l1e(unsigned lon
         return NULL;
 
     /* Find this l1e and its enclosing l1mfn in the linear map. */
-    if ( __copy_from_user(&l2e,
+    if ( copy_from_unsafe(&l2e,
                           &__linear_l2_table[l2_linear_offset(linear)],
                           sizeof(l2_pgentry_t)) )
         return NULL;
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -22,7 +22,7 @@ static inline l1_pgentry_t guest_get_eff
         toggle_guest_pt(curr);
 
     if ( unlikely(!__addr_ok(linear)) ||
-         __copy_from_user(&l1e,
+         copy_from_unsafe(&l1e,
                           &__linear_l1_table[l1_linear_offset(linear)],
                           sizeof(l1_pgentry_t)) )
         l1e = l1e_empty();
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -43,7 +43,7 @@ static int ptwr_emulated_read(enum x86_s
     unsigned long addr = offset;
 
     if ( !__addr_ok(addr) ||
-         (rc = __copy_from_user(p_data, (void *)addr, bytes)) )
+         (rc = __copy_from_guest_pv(p_data, (void *)addr, bytes)) )
     {
         x86_emul_pagefault(0, addr + bytes - rc, ctxt);  /* Read fault. */
         return X86EMUL_EXCEPTION;
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1097,7 +1097,7 @@ void do_invalid_op(struct cpu_user_regs
     }
 
     if ( !is_active_kernel_text(regs->rip) ||
-         __copy_from_user(bug_insn, eip, sizeof(bug_insn)) ||
+         copy_from_unsafe(bug_insn, eip, sizeof(bug_insn)) ||
          memcmp(bug_insn, "\xf\xb", sizeof(bug_insn)) )
         goto die;
 
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -110,7 +110,7 @@ unsigned __copy_from_user_ll(void *to, c
 unsigned copy_to_user(void __user *to, const void *from, unsigned n)
 {
     if ( access_ok(to, n) )
-        n = __copy_to_user(to, from, n);
+        n = __copy_to_guest_pv(to, from, n);
     return n;
 }
 
@@ -168,7 +168,7 @@ unsigned clear_user(void __user *to, uns
 unsigned copy_from_user(void *to, const void __user *from, unsigned n)
 {
     if ( access_ok(from, n) )
-        n = __copy_from_user(to, from, n);
+        n = __copy_from_guest_pv(to, from, n);
     else
         memset(to, 0, n);
     return n;
--- a/xen/include/asm-x86/guest_access.h
+++ b/xen/include/asm-x86/guest_access.h
@@ -28,11 +28,11 @@
 #define __raw_copy_to_guest(dst, src, len)      \
     (is_hvm_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
-     __copy_to_user((dst), (src), (len)))
+     __copy_to_guest_pv(dst, src, len))
 #define __raw_copy_from_guest(dst, src, len)    \
     (is_hvm_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
-     __copy_from_user((dst), (src), (len)))
+     __copy_from_guest_pv(dst, src, len))
 #define __raw_clear_guest(dst,  len)            \
     (is_hvm_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -197,21 +197,20 @@ do {
 #define get_guest_size get_unsafe_size
 
 /**
- * __copy_to_user: - Copy a block of data into user space, with less checking
- * @to:   Destination address, in user space.
- * @from: Source address, in kernel space.
+ * __copy_to_guest_pv: - Copy a block of data into guest space, with less
+ *                       checking
+ * @to:   Destination address, in guest space.
+ * @from: Source address, in hypervisor space.
  * @n:    Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from kernel space to user space.  Caller must check
+ * Copy data from hypervisor space to guest space.  Caller must check
  * the specified block with access_ok() before calling this function.
  *
  * Returns number of bytes that could not be copied.
  * On success, this will be zero.
  */
 static always_inline unsigned long
-__copy_to_user(void __user *to, const void *from, unsigned long n)
+__copy_to_guest_pv(void __user *to, const void *from, unsigned long n)
 {
     if (__builtin_constant_p(n)) {
         unsigned long ret;
@@ -233,16 +232,16 @@ __copy_to_user(void __user *to, const vo
     }
     return __copy_to_user_ll(to, from, n);
 }
+#define copy_to_unsafe __copy_to_guest_pv
 
 /**
- * __copy_from_user: - Copy a block of data from user space, with less checking
- * @to:   Destination address, in kernel space.
- * @from: Source address, in user space.
+ * __copy_from_guest_pv: - Copy a block of data from guest space, with less
+ *                         checking
+ * @to:   Destination address, in hypervisor space.
+ * @from: Source address, in guest space.
  * @n:    Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from user space to kernel space.  Caller must check
+ * Copy data from guest space to hypervisor space.  Caller must check
  * the specified block with access_ok() before calling this function.
  *
  * Returns number of bytes that could not be copied.
@@ -252,7 +251,7 @@ __copy_to_user(void __user *to, const vo
  * data to the requested size using zero bytes.
  */
 static always_inline unsigned long
-__copy_from_user(void *to, const void __user *from, unsigned long n)
+__copy_from_guest_pv(void *to, const void __user *from, unsigned long n)
 {
     if (__builtin_constant_p(n)) {
         unsigned long ret;
@@ -274,6 +273,7 @@ __copy_from_user(void *to, const void __
     }
     return __copy_from_user_ll(to, from, n);
 }
+#define copy_from_unsafe __copy_from_guest_pv
 
 /*
  * The exception table consists of pairs of addresses: the first is the



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (2 preceding siblings ...)
  2021-01-14 15:04 ` [PATCH 03/17] x86: split __copy_{from,to}_user() " Jan Beulich
@ 2021-01-14 15:04 ` Jan Beulich
  2021-02-09 16:26   ` Roger Pau Monné
  2021-02-12 10:41   ` Roger Pau Monné
  2021-01-14 15:05 ` [PATCH 05/17] x86: rename {get,put}_user() to {get,put}_guest() Jan Beulich
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

Inspired by
https://lore.kernel.org/lkml/f12e7d3cecf41b2c29734ea45a393be21d4a8058.1597848273.git.jpoimboe@redhat.com/
and prior work in that area of x86 Linux, suppress speculation with
guest specified pointer values by suitably masking the addresses to
non-canonical space in case they fall into Xen's virtual address range.

Introduce a new Kconfig control.

Note that it is necessary in such code to avoid using "m" kind operands:
If we didn't, there would be no guarantee that the register passed to
guest_access_mask_ptr is also the (base) one used for the memory access.

As a minor unrelated change in get_unsafe_asm() the unnecessary "itype"
parameter gets dropped and the XOR on the fixup path gets changed to be
a 32-bit one in all cases: This way we avoid pointless REX.W or operand
size overrides, or writes to partial registers.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
The insn sequence chosen is certainly up for discussion; I've picked
this one despite the RCR because alternatives I could come up with,
like

	mov	$(HYPERVISOR_VIRT_END), %rax
	mov	$~0, %rdx
	mov	$0x7fffffffffffffff, %rcx
	cmp	%rax, %rdi
	cmovb	%rcx, %rdx
	and	%rdx, %rdi

weren't necessarily better: Either, as above, they are longer and
require a 3rd scratch register, or they also utilize the carry flag in
some similar way.
---
Judging from the comment ahead of put_unsafe_asm() we might as well not
tell gcc at all anymore about the memory access there, now that there's
no use of the operand anymore in the assembly code.

--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -10,12 +10,19 @@
 #include <xen/sched.h>
 #include <asm/uaccess.h>
 
-unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
+#ifndef GUARD
+# define GUARD UA_KEEP
+#endif
+
+unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
 {
     unsigned dummy;
 
     stac();
     asm volatile (
+        GUARD(
+        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
+        )
         "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
         "    jbe  1f\n"
         "    mov  %k[to], %[cnt]\n"
@@ -42,6 +49,7 @@ unsigned __copy_to_user_ll(void __user *
         _ASM_EXTABLE(1b, 2b)
         : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
           [aux] "=&r" (dummy)
+          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
         : "[aux]" (n)
         : "memory" );
     clac();
@@ -49,12 +57,15 @@ unsigned __copy_to_user_ll(void __user *
     return n;
 }
 
-unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n)
+unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n)
 {
     unsigned dummy;
 
     stac();
     asm volatile (
+        GUARD(
+        "    guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n"
+        )
         "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
         "    jbe  1f\n"
         "    mov  %k[to], %[cnt]\n"
@@ -87,6 +98,7 @@ unsigned __copy_from_user_ll(void *to, c
         _ASM_EXTABLE(1b, 6b)
         : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
           [aux] "=&r" (dummy)
+          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
         : "[aux]" (n)
         : "memory" );
     clac();
@@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
     return n;
 }
 
+#if GUARD(1) + 0
+
 /**
  * copy_to_user: - Copy a block of data into user space.
  * @to:   Destination address, in user space.
@@ -128,8 +142,11 @@ unsigned clear_user(void __user *to, uns
 {
     if ( access_ok(to, n) )
     {
+        long dummy;
+
         stac();
         asm volatile (
+            "    guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n"
             "0:  rep stos"__OS"\n"
             "    mov  %[bytes], %[cnt]\n"
             "1:  rep stosb\n"
@@ -140,7 +157,8 @@ unsigned clear_user(void __user *to, uns
             ".previous\n"
             _ASM_EXTABLE(0b,3b)
             _ASM_EXTABLE(1b,2b)
-            : [cnt] "=&c" (n), [to] "+D" (to)
+            : [cnt] "=&c" (n), [to] "+D" (to), [scratch1] "=&r" (dummy),
+              [scratch2] "=&r" (dummy)
             : [bytes] "r" (n & (BYTES_PER_LONG - 1)),
               [longs] "0" (n / BYTES_PER_LONG), "a" (0) );
         clac();
@@ -174,6 +192,16 @@ unsigned copy_from_user(void *to, const
     return n;
 }
 
+# undef GUARD
+# define GUARD UA_DROP
+# define copy_to_guest_ll copy_to_unsafe_ll
+# define copy_from_guest_ll copy_from_unsafe_ll
+# undef __user
+# define __user
+# include __FILE__
+
+#endif /* GUARD(1) */
+
 /*
  * Local variables:
  * mode: C
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -446,6 +446,8 @@ UNLIKELY_START(g, create_bounce_frame_ba
         jmp   asm_domain_crash_synchronous  /* Does not return */
 __UNLIKELY_END(create_bounce_frame_bad_sp)
 
+        guest_access_mask_ptr %rsi, %rax, %rcx
+
 #define STORE_GUEST_STACK(reg, n) \
 0:      movq  %reg,(n)*8(%rsi); \
         _ASM_EXTABLE(0b, domain_crash_page_fault_ ## n ## x8)
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -114,6 +114,24 @@ config SPECULATIVE_HARDEN_BRANCH
 
 	  If unsure, say Y.
 
+config SPECULATIVE_HARDEN_GUEST_ACCESS
+	bool "Speculative PV Guest Memory Access Hardening"
+	default y
+	depends on PV
+	help
+	  Contemporary processors may use speculative execution as a
+	  performance optimisation, but this can potentially be abused by an
+	  attacker to leak data via speculative sidechannels.
+
+	  One source of data leakage is via speculative accesses to hypervisor
+	  memory through guest controlled values used to access guest memory.
+
+	  When enabled, code paths accessing PV guest memory will have guest
+	  controlled addresses massaged such that memory accesses through them
+	  won't touch hypervisor address space.
+
+	  If unsure, say Y.
+
 endmenu
 
 config HYPFS
--- a/xen/include/asm-x86/asm-defns.h
+++ b/xen/include/asm-x86/asm-defns.h
@@ -44,3 +44,16 @@
 .macro INDIRECT_JMP arg:req
     INDIRECT_BRANCH jmp \arg
 .endm
+
+.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
+#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
+    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
+    mov $~0, \scratch2
+    cmp \ptr, \scratch1
+    rcr $1, \scratch2
+    and \scratch2, \ptr
+#elif defined(CONFIG_DEBUG) && defined(CONFIG_PV)
+    xor $~\@, \scratch1
+    xor $~\@, \scratch2
+#endif
+.endm
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -13,13 +13,19 @@
 unsigned copy_to_user(void *to, const void *from, unsigned len);
 unsigned clear_user(void *to, unsigned len);
 unsigned copy_from_user(void *to, const void *from, unsigned len);
+
 /* Handles exceptions in both to and from, but doesn't do access_ok */
-unsigned __copy_to_user_ll(void __user*to, const void *from, unsigned n);
-unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n);
+unsigned int copy_to_guest_ll(void __user*to, const void *from, unsigned int n);
+unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n);
+unsigned int copy_to_unsafe_ll(void *to, const void *from, unsigned int n);
+unsigned int copy_from_unsafe_ll(void *to, const void *from, unsigned int n);
 
 extern long __get_user_bad(void);
 extern void __put_user_bad(void);
 
+#define UA_KEEP(args...) args
+#define UA_DROP(args...)
+
 /**
  * get_user: - Get a simple variable from user space.
  * @x:   Variable to store result.
@@ -78,7 +84,6 @@ extern void __put_user_bad(void);
  * On error, the variable @x is set to zero.
  */
 #define __get_guest(x, ptr) get_guest_nocheck(x, ptr, sizeof(*(ptr)))
-#define get_unsafe __get_guest
 
 /**
  * __put_guest: - Write a simple value into guest space, with less checking.
@@ -99,7 +104,13 @@ extern void __put_user_bad(void);
  */
 #define __put_guest(x, ptr) \
     put_guest_nocheck((__typeof__(*(ptr)))(x), ptr, sizeof(*(ptr)))
-#define put_unsafe __put_guest
+
+#define put_unsafe(x, ptr)						\
+({									\
+	int err_; 							\
+	put_unsafe_size(x, ptr, sizeof(*(ptr)), UA_DROP, err_, -EFAULT);\
+	err_;								\
+})
 
 #define put_guest_nocheck(x, ptr, size)					\
 ({									\
@@ -116,6 +127,13 @@ extern void __put_user_bad(void);
 			       : -EFAULT;				\
 })
 
+#define get_unsafe(x, ptr)						\
+({									\
+	int err_; 							\
+	get_unsafe_size(x, ptr, sizeof(*(ptr)), UA_DROP, err_, -EFAULT);\
+	err_;								\
+})
+
 #define get_guest_nocheck(x, ptr, size)					\
 ({									\
 	int err_; 							\
@@ -139,62 +157,87 @@ struct __large_struct { unsigned long bu
  * we do not write to any memory gcc knows about, so there are no
  * aliasing issues.
  */
-#define put_unsafe_asm(x, addr, err, itype, rtype, ltype, errret)	\
+#define put_unsafe_asm(x, addr, GUARD, err, itype, rtype, ltype, errret) \
 	stac();								\
 	__asm__ __volatile__(						\
-		"1:	mov"itype" %"rtype"1,%2\n"			\
+		GUARD(							\
+		"	guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
+		)							\
+		"1:	mov"itype" %"rtype"[val], (%[ptr])\n"		\
 		"2:\n"							\
 		".section .fixup,\"ax\"\n"				\
-		"3:	mov %3,%0\n"					\
+		"3:	mov %[errno], %[ret]\n"				\
 		"	jmp 2b\n"					\
 		".previous\n"						\
 		_ASM_EXTABLE(1b, 3b)					\
-		: "=r"(err)						\
-		: ltype (x), "m"(__m(addr)), "i"(errret), "0"(err));	\
+		: [ret] "+r" (err), [ptr] "=&r" (dummy_)		\
+		  GUARD(, [scr1] "=&r" (dummy_), [scr2] "=&r" (dummy_))	\
+		: [val] ltype (x), "m" (__m(addr)),			\
+		  "[ptr]" (addr), [errno] "i" (errret));		\
 	clac()
 
-#define get_unsafe_asm(x, addr, err, itype, rtype, ltype, errret)	\
+#define get_unsafe_asm(x, addr, GUARD, err, rtype, ltype, errret)	\
 	stac();								\
 	__asm__ __volatile__(						\
-		"1:	mov"itype" %2,%"rtype"1\n"			\
+		GUARD(							\
+		"	guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
+		)							\
+		"1:	mov (%[ptr]), %"rtype"[val]\n"			\
 		"2:\n"							\
 		".section .fixup,\"ax\"\n"				\
-		"3:	mov %3,%0\n"					\
-		"	xor"itype" %"rtype"1,%"rtype"1\n"		\
+		"3:	mov %[errno], %[ret]\n"				\
+		"	xor %k[val], %k[val]\n"				\
 		"	jmp 2b\n"					\
 		".previous\n"						\
 		_ASM_EXTABLE(1b, 3b)					\
-		: "=r"(err), ltype (x)					\
-		: "m"(__m(addr)), "i"(errret), "0"(err));		\
+		: [ret] "+r" (err), [val] ltype (x),			\
+		  [ptr] "=&r" (dummy_)					\
+		  GUARD(, [scr1] "=&r" (dummy_), [scr2] "=&r" (dummy_))	\
+		: "m" (__m(addr)), "[ptr]" (addr),			\
+		  [errno] "i" (errret));				\
 	clac()
 
-#define put_unsafe_size(x, ptr, size, retval, errret)                      \
+#define put_unsafe_size(x, ptr, size, grd, retval, errret)                 \
 do {                                                                       \
     retval = 0;                                                            \
     switch ( size )                                                        \
     {                                                                      \
-    case 1: put_unsafe_asm(x, ptr, retval, "b", "b", "iq", errret); break; \
-    case 2: put_unsafe_asm(x, ptr, retval, "w", "w", "ir", errret); break; \
-    case 4: put_unsafe_asm(x, ptr, retval, "l", "k", "ir", errret); break; \
-    case 8: put_unsafe_asm(x, ptr, retval, "q",  "", "ir", errret); break; \
+    long dummy_;                                                           \
+    case 1:                                                                \
+        put_unsafe_asm(x, ptr, grd, retval, "b", "b", "iq", errret);       \
+        break;                                                             \
+    case 2:                                                                \
+        put_unsafe_asm(x, ptr, grd, retval, "w", "w", "ir", errret);       \
+        break;                                                             \
+    case 4:                                                                \
+        put_unsafe_asm(x, ptr, grd, retval, "l", "k", "ir", errret);       \
+        break;                                                             \
+    case 8:                                                                \
+        put_unsafe_asm(x, ptr, grd, retval, "q",  "", "ir", errret);       \
+        break;                                                             \
     default: __put_user_bad();                                             \
     }                                                                      \
 } while ( false )
-#define put_guest_size put_unsafe_size
 
-#define get_unsafe_size(x, ptr, size, retval, errret)                      \
+#define put_guest_size(x, ptr, size, retval, errret) \
+    put_unsafe_size(x, ptr, size, UA_KEEP, retval, errret)
+
+#define get_unsafe_size(x, ptr, size, grd, retval, errret)                 \
 do {                                                                       \
     retval = 0;                                                            \
     switch ( size )                                                        \
     {                                                                      \
-    case 1: get_unsafe_asm(x, ptr, retval, "b", "b", "=q", errret); break; \
-    case 2: get_unsafe_asm(x, ptr, retval, "w", "w", "=r", errret); break; \
-    case 4: get_unsafe_asm(x, ptr, retval, "l", "k", "=r", errret); break; \
-    case 8: get_unsafe_asm(x, ptr, retval, "q",  "", "=r", errret); break; \
+    long dummy_;                                                           \
+    case 1: get_unsafe_asm(x, ptr, grd, retval, "b", "=q", errret); break; \
+    case 2: get_unsafe_asm(x, ptr, grd, retval, "w", "=r", errret); break; \
+    case 4: get_unsafe_asm(x, ptr, grd, retval, "k", "=r", errret); break; \
+    case 8: get_unsafe_asm(x, ptr, grd, retval,  "", "=r", errret); break; \
     default: __get_user_bad();                                             \
     }                                                                      \
 } while ( false )
-#define get_guest_size get_unsafe_size
+
+#define get_guest_size(x, ptr, size, retval, errret)                       \
+    get_unsafe_size(x, ptr, size, UA_KEEP, retval, errret)
 
 /**
  * __copy_to_guest_pv: - Copy a block of data into guest space, with less
@@ -230,9 +273,8 @@ __copy_to_guest_pv(void __user *to, cons
             return ret;
         }
     }
-    return __copy_to_user_ll(to, from, n);
+    return copy_to_guest_ll(to, from, n);
 }
-#define copy_to_unsafe __copy_to_guest_pv
 
 /**
  * __copy_from_guest_pv: - Copy a block of data from guest space, with less
@@ -271,9 +313,87 @@ __copy_from_guest_pv(void *to, const voi
             return ret;
         }
     }
-    return __copy_from_user_ll(to, from, n);
+    return copy_from_guest_ll(to, from, n);
+}
+
+/**
+ * copy_to_unsafe: - Copy a block of data to unsafe space, with exception
+ *                   checking
+ * @to:   Unsafe destination address.
+ * @from: Safe source address, in hypervisor space.
+ * @n:    Number of bytes to copy.
+ *
+ * Copy data from hypervisor space to a potentially unmapped area.
+ *
+ * Returns number of bytes that could not be copied.
+ * On success, this will be zero.
+ */
+static always_inline unsigned int
+copy_to_unsafe(void __user *to, const void *from, unsigned int n)
+{
+    if (__builtin_constant_p(n)) {
+        unsigned long ret;
+
+        switch (n) {
+        case 1:
+            put_unsafe_size(*(const uint8_t *)from, to, 1, UA_DROP, ret, 1);
+            return ret;
+        case 2:
+            put_unsafe_size(*(const uint16_t *)from, to, 2, UA_DROP, ret, 2);
+            return ret;
+        case 4:
+            put_unsafe_size(*(const uint32_t *)from, to, 4, UA_DROP, ret, 4);
+            return ret;
+        case 8:
+            put_unsafe_size(*(const uint64_t *)from, to, 8, UA_DROP, ret, 8);
+            return ret;
+        }
+    }
+
+    return copy_to_unsafe_ll(to, from, n);
+}
+
+/**
+ * copy_from_unsafe: - Copy a block of data from unsafe space, with exception
+ *                     checking
+ * @to:   Safe destination address, in hypervisor space.
+ * @from: Unsafe source address.
+ * @n:    Number of bytes to copy.
+ *
+ * Copy data from a potentially unmapped area space to hypervisor space.
+ *
+ * Returns number of bytes that could not be copied.
+ * On success, this will be zero.
+ *
+ * If some data could not be copied, this function will pad the copied
+ * data to the requested size using zero bytes.
+ */
+static always_inline unsigned int
+copy_from_unsafe(void *to, const void __user *from, unsigned int n)
+{
+    if ( __builtin_constant_p(n) )
+    {
+        unsigned long ret;
+
+        switch ( n )
+        {
+        case 1:
+            get_unsafe_size(*(uint8_t *)to, from, 1, UA_DROP, ret, 1);
+            return ret;
+        case 2:
+            get_unsafe_size(*(uint16_t *)to, from, 2, UA_DROP, ret, 2);
+            return ret;
+        case 4:
+            get_unsafe_size(*(uint32_t *)to, from, 4, UA_DROP, ret, 4);
+            return ret;
+        case 8:
+            get_unsafe_size(*(uint64_t *)to, from, 8, UA_DROP, ret, 8);
+            return ret;
+        }
+    }
+
+    return copy_from_unsafe_ll(to, from, n);
 }
-#define copy_from_unsafe __copy_from_guest_pv
 
 /*
  * The exception table consists of pairs of addresses: the first is the



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 05/17] x86: rename {get,put}_user() to {get,put}_guest()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (3 preceding siblings ...)
  2021-01-14 15:04 ` [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse Jan Beulich
@ 2021-01-14 15:05 ` Jan Beulich
  2021-01-14 15:05 ` [PATCH 06/17] x86/gdbsx: convert "user" to "guest" accesses Jan Beulich
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

Bring them (back) in line with __{get,put}_guest().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1626,19 +1626,19 @@ static void load_segments(struct vcpu *n
 
             if ( !ring_1(regs) )
             {
-                ret  = put_user(regs->ss,       esp-1);
-                ret |= put_user(regs->esp,      esp-2);
+                ret  = put_guest(regs->ss,  esp - 1);
+                ret |= put_guest(regs->esp, esp - 2);
                 esp -= 2;
             }
 
             if ( ret |
-                 put_user(rflags,              esp-1) |
-                 put_user(cs_and_mask,         esp-2) |
-                 put_user(regs->eip,           esp-3) |
-                 put_user(uregs->gs,           esp-4) |
-                 put_user(uregs->fs,           esp-5) |
-                 put_user(uregs->es,           esp-6) |
-                 put_user(uregs->ds,           esp-7) )
+                 put_guest(rflags,      esp - 1) |
+                 put_guest(cs_and_mask, esp - 2) |
+                 put_guest(regs->eip,   esp - 3) |
+                 put_guest(uregs->gs,   esp - 4) |
+                 put_guest(uregs->fs,   esp - 5) |
+                 put_guest(uregs->es,   esp - 6) |
+                 put_guest(uregs->ds,   esp - 7) )
             {
                 gprintk(XENLOG_ERR,
                         "error while creating compat failsafe callback frame\n");
@@ -1667,17 +1667,17 @@ static void load_segments(struct vcpu *n
         cs_and_mask = (unsigned long)regs->cs |
             ((unsigned long)vcpu_info(n, evtchn_upcall_mask) << 32);
 
-        if ( put_user(regs->ss,            rsp- 1) |
-             put_user(regs->rsp,           rsp- 2) |
-             put_user(rflags,              rsp- 3) |
-             put_user(cs_and_mask,         rsp- 4) |
-             put_user(regs->rip,           rsp- 5) |
-             put_user(uregs->gs,           rsp- 6) |
-             put_user(uregs->fs,           rsp- 7) |
-             put_user(uregs->es,           rsp- 8) |
-             put_user(uregs->ds,           rsp- 9) |
-             put_user(regs->r11,           rsp-10) |
-             put_user(regs->rcx,           rsp-11) )
+        if ( put_guest(regs->ss,    rsp -  1) |
+             put_guest(regs->rsp,   rsp -  2) |
+             put_guest(rflags,      rsp -  3) |
+             put_guest(cs_and_mask, rsp -  4) |
+             put_guest(regs->rip,   rsp -  5) |
+             put_guest(uregs->gs,   rsp -  6) |
+             put_guest(uregs->fs,   rsp -  7) |
+             put_guest(uregs->es,   rsp -  8) |
+             put_guest(uregs->ds,   rsp -  9) |
+             put_guest(regs->r11,   rsp - 10) |
+             put_guest(regs->rcx,   rsp - 11) )
         {
             gprintk(XENLOG_ERR,
                     "error while creating failsafe callback frame\n");
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -27,14 +27,12 @@ extern void __put_user_bad(void);
 #define UA_DROP(args...)
 
 /**
- * get_user: - Get a simple variable from user space.
+ * get_guest: - Get a simple variable from guest space.
  * @x:   Variable to store result.
- * @ptr: Source address, in user space.
- *
- * Context: User context only.  This function may sleep.
+ * @ptr: Source address, in guest space.
  *
- * This macro copies a single simple variable from user space to kernel
- * space.  It supports simple types like char and int, but not larger
+ * This macro load a single simple variable from guest space.
+ * It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
  * @ptr must have pointer-to-simple-variable type, and the result of
@@ -43,18 +41,15 @@ extern void __put_user_bad(void);
  * Returns zero on success, or -EFAULT on error.
  * On error, the variable @x is set to zero.
  */
-#define get_user(x,ptr)	\
-  __get_user_check((x),(ptr),sizeof(*(ptr)))
+#define get_guest(x, ptr) get_guest_check(x, ptr, sizeof(*(ptr)))
 
 /**
- * put_user: - Write a simple value into user space.
- * @x:   Value to copy to user space.
- * @ptr: Destination address, in user space.
- *
- * Context: User context only.  This function may sleep.
+ * put_guest: - Write a simple value into guest space.
+ * @x:   Value to store in guest space.
+ * @ptr: Destination address, in guest space.
  *
- * This macro copies a single simple value from kernel space to user
- * space.  It supports simple types like char and int, but not larger
+ * This macro stores a single simple value from to guest space.
+ * It supports simple types like char and int, but not larger
  * data types like structures or arrays.
  *
  * @ptr must have pointer-to-simple-variable type, and @x must be assignable
@@ -62,8 +57,8 @@ extern void __put_user_bad(void);
  *
  * Returns zero on success, or -EFAULT on error.
  */
-#define put_user(x,ptr)							\
-  __put_user_check((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr)))
+#define put_guest(x, ptr) \
+    put_guest_check((__typeof__(*(ptr)))(x), ptr, sizeof(*(ptr)))
 
 /**
  * __get_guest: - Get a simple variable from guest space, with less checking.
@@ -119,7 +114,7 @@ extern void __put_user_bad(void);
 	err_;								\
 })
 
-#define __put_user_check(x, ptr, size)					\
+#define put_guest_check(x, ptr, size)					\
 ({									\
 	__typeof__(*(ptr)) __user *ptr_ = (ptr);			\
 	__typeof__(size) size_ = (size);				\
@@ -141,7 +136,7 @@ extern void __put_user_bad(void);
 	err_;								\
 })
 
-#define __get_user_check(x, ptr, size)					\
+#define get_guest_check(x, ptr, size)					\
 ({									\
 	__typeof__(*(ptr)) __user *ptr_ = (ptr);			\
 	__typeof__(size) size_ = (size);				\



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 06/17] x86/gdbsx: convert "user" to "guest" accesses
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (4 preceding siblings ...)
  2021-01-14 15:05 ` [PATCH 05/17] x86: rename {get,put}_user() to {get,put}_guest() Jan Beulich
@ 2021-01-14 15:05 ` Jan Beulich
  2021-01-14 15:06 ` [PATCH 07/17] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv() Jan Beulich
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné,
	Tim Deegan, George Dunlap, Elena Ufimtseva

Using copy_{from,to}_user(), this code was assuming to be called only by
PV guests. Use copy_{from,to}_guest() instead, transforming the incoming
structure field into a guest handle (the field should really have been
one in the first place). Also do not transform the debuggee address into
a pointer.

As a not originally intended side effect this also fixes a bug in
dbg_rw_guest_mem(): At the end of the loop "addr" was incremented, but
then in the next loop iteration (with the variable also having gone out
of scope inbetween) re-initialized from the function parameter.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -108,14 +108,14 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma
 }
 
 /* Returns: number of bytes remaining to be copied */
-static unsigned int dbg_rw_guest_mem(struct domain *dp, void * __user gaddr,
-                                     void * __user buf, unsigned int len,
-                                     bool toaddr, uint64_t pgd3)
+static unsigned int dbg_rw_guest_mem(struct domain *dp, unsigned long addr,
+                                     XEN_GUEST_HANDLE_PARAM(void) buf,
+                                     unsigned int len, bool toaddr,
+                                     uint64_t pgd3)
 {
     while ( len > 0 )
     {
         char *va;
-        unsigned long addr = (unsigned long)gaddr;
         mfn_t mfn;
         gfn_t gfn = INVALID_GFN;
         unsigned long pagecnt;
@@ -133,20 +133,18 @@ static unsigned int dbg_rw_guest_mem(str
 
         if ( toaddr )
         {
-            copy_from_user(va, buf, pagecnt);    /* va = buf */
+            copy_from_guest(va, buf, pagecnt);
             paging_mark_dirty(dp, mfn);
         }
         else
-        {
-            copy_to_user(buf, va, pagecnt);    /* buf = va */
-        }
+            copy_to_guest(buf, va, pagecnt);
 
         unmap_domain_page(va);
         if ( !gfn_eq(gfn, INVALID_GFN) )
             put_gfn(dp, gfn_x(gfn));
 
         addr += pagecnt;
-        buf += pagecnt;
+        guest_handle_add_offset(buf, pagecnt);
         len -= pagecnt;
     }
 
@@ -160,7 +158,7 @@ static unsigned int dbg_rw_guest_mem(str
  * pgd3: value of init_mm.pgd[3] in guest. see above.
  * Returns: number of bytes remaining to be copied.
  */
-unsigned int dbg_rw_mem(void * __user addr, void * __user buf,
+unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf,
                         unsigned int len, domid_t domid, bool toaddr,
                         uint64_t pgd3)
 {
@@ -169,7 +167,7 @@ unsigned int dbg_rw_mem(void * __user ad
     if ( d )
     {
         if ( !d->is_dying )
-            len = dbg_rw_guest_mem(d, addr, buf, len, toaddr, pgd3);
+            len = dbg_rw_guest_mem(d, gva, buf, len, toaddr, pgd3);
         rcu_unlock_domain(d);
     }
 
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -40,10 +40,8 @@
 #ifdef CONFIG_GDBSX
 static int gdbsx_guest_mem_io(domid_t domid, struct xen_domctl_gdbsx_memio *iop)
 {
-    void * __user gva = (void *)iop->gva, * __user uva = (void *)iop->uva;
-
-    iop->remain = dbg_rw_mem(gva, uva, iop->len, domid,
-                             !!iop->gwr, iop->pgd3val);
+    iop->remain = dbg_rw_mem(iop->gva, guest_handle_from_ptr(iop->uva, void),
+                             iop->len, domid, iop->gwr, iop->pgd3val);
 
     return iop->remain ? -EFAULT : 0;
 }
--- a/xen/include/asm-x86/debugger.h
+++ b/xen/include/asm-x86/debugger.h
@@ -93,9 +93,9 @@ static inline bool debugger_trap_entry(
 #endif
 
 #ifdef CONFIG_GDBSX
-unsigned int dbg_rw_mem(void * __user addr, void * __user buf,
+unsigned int dbg_rw_mem(unsigned long gva, XEN_GUEST_HANDLE_PARAM(void) buf,
                         unsigned int len, domid_t domid, bool toaddr,
-                        uint64_t pgd3);
+                        unsigned long pgd3);
 #endif
 
 #endif /* __X86_DEBUGGER_H__ */



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 07/17] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (5 preceding siblings ...)
  2021-01-14 15:05 ` [PATCH 06/17] x86/gdbsx: convert "user" to "guest" accesses Jan Beulich
@ 2021-01-14 15:06 ` Jan Beulich
  2021-01-14 15:07 ` [PATCH 08/17] x86: move stac()/clac() from {get,put}_unsafe_asm() Jan Beulich
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

Bring them (back) in line with __copy_{from,to}_guest_pv(). Since it
falls in the same group, also convert clear_user(). Instead of adjusting
__raw_clear_guest(), drop it - it's unused and would require a non-
checking __clear_guest_pv() which we don't have.

Add previously missing __user at some call sites and in the function
declarations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/pv/emul-inv-op.c
+++ b/xen/arch/x86/pv/emul-inv-op.c
@@ -33,7 +33,7 @@ static int emulate_forced_invalid_op(str
     eip = regs->rip;
 
     /* Check for forced emulation signature: ud2 ; .ascii "xen". */
-    if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 )
+    if ( (rc = copy_from_guest_pv(sig, (char __user *)eip, sizeof(sig))) != 0 )
     {
         pv_inject_page_fault(0, eip + sizeof(sig) - rc);
         return EXCRET_fault_fixed;
@@ -43,7 +43,8 @@ static int emulate_forced_invalid_op(str
     eip += sizeof(sig);
 
     /* We only emulate CPUID. */
-    if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 )
+    if ( (rc = copy_from_guest_pv(instr, (char __user *)eip,
+                                  sizeof(instr))) != 0 )
     {
         pv_inject_page_fault(0, eip + sizeof(instr) - rc);
         return EXCRET_fault_fixed;
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -54,8 +54,8 @@ unsigned long do_iret(void)
     struct iret_context iret_saved;
     struct vcpu *v = current;
 
-    if ( unlikely(copy_from_user(&iret_saved, (void *)regs->rsp,
-                                 sizeof(iret_saved))) )
+    if ( unlikely(copy_from_guest_pv(&iret_saved, (void __user *)regs->rsp,
+                                     sizeof(iret_saved))) )
     {
         gprintk(XENLOG_ERR,
                 "Fault while reading IRET context from guest stack\n");
--- a/xen/arch/x86/pv/ro-page-fault.c
+++ b/xen/arch/x86/pv/ro-page-fault.c
@@ -90,7 +90,8 @@ static int ptwr_emulated_update(unsigned
 
         /* Align address; read full word. */
         addr &= ~(sizeof(full) - 1);
-        if ( (rc = copy_from_user(&full, (void *)addr, sizeof(full))) != 0 )
+        if ( (rc = copy_from_guest_pv(&full, (void __user *)addr,
+                                      sizeof(full))) != 0 )
         {
             x86_emul_pagefault(0, /* Read fault. */
                                addr + sizeof(full) - rc,
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -109,19 +109,17 @@ unsigned int copy_from_guest_ll(void *to
 #if GUARD(1) + 0
 
 /**
- * copy_to_user: - Copy a block of data into user space.
- * @to:   Destination address, in user space.
- * @from: Source address, in kernel space.
+ * copy_to_guest_pv: - Copy a block of data into guest space.
+ * @to:   Destination address, in guest space.
+ * @from: Source address, in hypervisor space.
  * @n:    Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from kernel space to user space.
+ * Copy data from hypervisor space to guest space.
  *
  * Returns number of bytes that could not be copied.
  * On success, this will be zero.
  */
-unsigned copy_to_user(void __user *to, const void *from, unsigned n)
+unsigned int copy_to_guest_pv(void __user *to, const void *from, unsigned int n)
 {
     if ( access_ok(to, n) )
         n = __copy_to_guest_pv(to, from, n);
@@ -129,16 +127,16 @@ unsigned copy_to_user(void __user *to, c
 }
 
 /**
- * clear_user: - Zero a block of memory in user space.
- * @to:   Destination address, in user space.
+ * clear_guest_pv: - Zero a block of memory in guest space.
+ * @to:   Destination address, in guest space.
  * @n:    Number of bytes to zero.
  *
- * Zero a block of memory in user space.
+ * Zero a block of memory in guest space.
  *
  * Returns number of bytes that could not be cleared.
  * On success, this will be zero.
  */
-unsigned clear_user(void __user *to, unsigned n)
+unsigned int clear_guest_pv(void __user *to, unsigned int n)
 {
     if ( access_ok(to, n) )
     {
@@ -168,14 +166,12 @@ unsigned clear_user(void __user *to, uns
 }
 
 /**
- * copy_from_user: - Copy a block of data from user space.
- * @to:   Destination address, in kernel space.
- * @from: Source address, in user space.
+ * copy_from_guest_pv: - Copy a block of data from guest space.
+ * @to:   Destination address, in hypervisor space.
+ * @from: Source address, in guest space.
  * @n:    Number of bytes to copy.
  *
- * Context: User context only.  This function may sleep.
- *
- * Copy data from user space to kernel space.
+ * Copy data from guest space to hypervisor space.
  *
  * Returns number of bytes that could not be copied.
  * On success, this will be zero.
@@ -183,7 +179,8 @@ unsigned clear_user(void __user *to, uns
  * If some data could not be copied, this function will pad the copied
  * data to the requested size using zero bytes.
  */
-unsigned copy_from_user(void *to, const void __user *from, unsigned n)
+unsigned int copy_from_guest_pv(void *to, const void __user *from,
+                                unsigned int n)
 {
     if ( access_ok(from, n) )
         n = __copy_from_guest_pv(to, from, n);
--- a/xen/include/asm-x86/guest_access.h
+++ b/xen/include/asm-x86/guest_access.h
@@ -16,15 +16,15 @@
 #define raw_copy_to_guest(dst, src, len)        \
     (is_hvm_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
-     copy_to_user((dst), (src), (len)))
+     copy_to_guest_pv(dst, src, len))
 #define raw_copy_from_guest(dst, src, len)      \
     (is_hvm_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
-     copy_from_user((dst), (src), (len)))
+     copy_from_guest_pv(dst, src, len))
 #define raw_clear_guest(dst,  len)              \
     (is_hvm_vcpu(current) ?                     \
      clear_user_hvm((dst), (len)) :             \
-     clear_user((dst), (len)))
+     clear_guest_pv(dst, len))
 #define __raw_copy_to_guest(dst, src, len)      \
     (is_hvm_vcpu(current) ?                     \
      copy_to_user_hvm((dst), (src), (len)) :    \
@@ -33,10 +33,6 @@
     (is_hvm_vcpu(current) ?                     \
      copy_from_user_hvm((dst), (src), (len)) :  \
      __copy_from_guest_pv(dst, src, len))
-#define __raw_clear_guest(dst,  len)            \
-    (is_hvm_vcpu(current) ?                     \
-     clear_user_hvm((dst), (len)) :             \
-     clear_user((dst), (len)))
 
 /*
  * Pre-validate a guest handle.
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -10,9 +10,11 @@
 
 #include <asm/x86_64/uaccess.h>
 
-unsigned copy_to_user(void *to, const void *from, unsigned len);
-unsigned clear_user(void *to, unsigned len);
-unsigned copy_from_user(void *to, const void *from, unsigned len);
+unsigned int copy_to_guest_pv(void __user *to, const void *from,
+                              unsigned int len);
+unsigned int clear_guest_pv(void __user *to, unsigned int len);
+unsigned int copy_from_guest_pv(void *to, const void __user *from,
+                                unsigned int len);
 
 /* Handles exceptions in both to and from, but doesn't do access_ok */
 unsigned int copy_to_guest_ll(void __user*to, const void *from, unsigned int n);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 08/17] x86: move stac()/clac() from {get,put}_unsafe_asm() ...
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (6 preceding siblings ...)
  2021-01-14 15:06 ` [PATCH 07/17] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv() Jan Beulich
@ 2021-01-14 15:07 ` Jan Beulich
  2021-01-14 15:07 ` [PATCH 09/17] x86/PV: use get_unsafe() instead of copy_from_unsafe() Jan Beulich
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

... to {get,put}_unsafe_size(). There's no need to have the macros
expanded once per case label in the latter. This also makes the former
well-formed single statements again. No change in generated code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -155,7 +155,6 @@ struct __large_struct { unsigned long bu
  * aliasing issues.
  */
 #define put_unsafe_asm(x, addr, GUARD, err, itype, rtype, ltype, errret) \
-	stac();								\
 	__asm__ __volatile__(						\
 		GUARD(							\
 		"	guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
@@ -170,11 +169,9 @@ struct __large_struct { unsigned long bu
 		: [ret] "+r" (err), [ptr] "=&r" (dummy_)		\
 		  GUARD(, [scr1] "=&r" (dummy_), [scr2] "=&r" (dummy_))	\
 		: [val] ltype (x), "m" (__m(addr)),			\
-		  "[ptr]" (addr), [errno] "i" (errret));		\
-	clac()
+		  "[ptr]" (addr), [errno] "i" (errret))
 
 #define get_unsafe_asm(x, addr, GUARD, err, rtype, ltype, errret)	\
-	stac();								\
 	__asm__ __volatile__(						\
 		GUARD(							\
 		"	guest_access_mask_ptr %[ptr], %[scr1], %[scr2]\n" \
@@ -191,12 +188,12 @@ struct __large_struct { unsigned long bu
 		  [ptr] "=&r" (dummy_)					\
 		  GUARD(, [scr1] "=&r" (dummy_), [scr2] "=&r" (dummy_))	\
 		: "m" (__m(addr)), "[ptr]" (addr),			\
-		  [errno] "i" (errret));				\
-	clac()
+		  [errno] "i" (errret))
 
 #define put_unsafe_size(x, ptr, size, grd, retval, errret)                 \
 do {                                                                       \
     retval = 0;                                                            \
+    stac();                                                                \
     switch ( size )                                                        \
     {                                                                      \
     long dummy_;                                                           \
@@ -214,6 +211,7 @@ do {
         break;                                                             \
     default: __put_user_bad();                                             \
     }                                                                      \
+    clac();                                                                \
 } while ( false )
 
 #define put_guest_size(x, ptr, size, retval, errret) \
@@ -222,6 +220,7 @@ do {
 #define get_unsafe_size(x, ptr, size, grd, retval, errret)                 \
 do {                                                                       \
     retval = 0;                                                            \
+    stac();                                                                \
     switch ( size )                                                        \
     {                                                                      \
     long dummy_;                                                           \
@@ -231,6 +230,7 @@ do {
     case 8: get_unsafe_asm(x, ptr, grd, retval,  "", "=r", errret); break; \
     default: __get_user_bad();                                             \
     }                                                                      \
+    clac();                                                                \
 } while ( false )
 
 #define get_guest_size(x, ptr, size, retval, errret)                       \



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 09/17] x86/PV: use get_unsafe() instead of copy_from_unsafe()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (7 preceding siblings ...)
  2021-01-14 15:07 ` [PATCH 08/17] x86: move stac()/clac() from {get,put}_unsafe_asm() Jan Beulich
@ 2021-01-14 15:07 ` Jan Beulich
  2021-01-14 15:08 ` [PATCH 10/17] x86/shadow: " Jan Beulich
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

The former expands to a single (memory accessing) insn, which the latter
does not guarantee. Yet we'd prefer to read consistent PTEs rather than
risking a split read racing with an update done elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -41,9 +41,7 @@ l1_pgentry_t *map_guest_l1e(unsigned lon
         return NULL;
 
     /* Find this l1e and its enclosing l1mfn in the linear map. */
-    if ( copy_from_unsafe(&l2e,
-                          &__linear_l2_table[l2_linear_offset(linear)],
-                          sizeof(l2_pgentry_t)) )
+    if ( get_unsafe(l2e, &__linear_l2_table[l2_linear_offset(linear)]) )
         return NULL;
 
     /* Check flags that it will be safe to read the l1e. */
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -22,9 +22,7 @@ static inline l1_pgentry_t guest_get_eff
         toggle_guest_pt(curr);
 
     if ( unlikely(!__addr_ok(linear)) ||
-         copy_from_unsafe(&l1e,
-                          &__linear_l1_table[l1_linear_offset(linear)],
-                          sizeof(l1_pgentry_t)) )
+         get_unsafe(l1e, &__linear_l1_table[l1_linear_offset(linear)]) )
         l1e = l1e_empty();
 
     if ( user_mode )



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 10/17] x86/shadow: use get_unsafe() instead of copy_from_unsafe()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (8 preceding siblings ...)
  2021-01-14 15:07 ` [PATCH 09/17] x86/PV: use get_unsafe() instead of copy_from_unsafe() Jan Beulich
@ 2021-01-14 15:08 ` Jan Beulich
  2021-01-14 15:08 ` [PATCH 11/17] x86/shadow: polish shadow_write_entries() Jan Beulich
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:08 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

This is the slightly more direct way of getting at what we want, and
better in line with shadow_write_entries()'s use of put_unsafe().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2614,10 +2614,9 @@ static int sh_page_fault(struct vcpu *v,
         {
             shadow_l2e_t sl2e;
             mfn_t gl1mfn;
-            if ( (copy_from_unsafe(&sl2e,
-                                   (sh_linear_l2_table(v)
-                                    + shadow_l2_linear_offset(va)),
-                                   sizeof(sl2e)) != 0)
+            if ( (get_unsafe(sl2e,
+                             (sh_linear_l2_table(v) +
+                              shadow_l2_linear_offset(va))) != 0)
                  || !(shadow_l2e_get_flags(sl2e) & _PAGE_PRESENT)
                  || !mfn_valid(gl1mfn = backpointer(mfn_to_page(
                                   shadow_l2e_get_mfn(sl2e))))
@@ -2633,10 +2632,9 @@ static int sh_page_fault(struct vcpu *v,
 #endif /* SHOPT_OUT_OF_SYNC */
         /* The only reasons for reserved bits to be set in shadow entries
          * are the two "magic" shadow_l1e entries. */
-        if ( likely((copy_from_unsafe(&sl1e,
-                                      (sh_linear_l1_table(v)
-                                       + shadow_l1_linear_offset(va)),
-                                      sizeof(sl1e)) == 0)
+        if ( likely((get_unsafe(sl1e,
+                                (sh_linear_l1_table(v) +
+                                 shadow_l1_linear_offset(va))) == 0)
                     && sh_l1e_is_magic(sl1e)) )
         {
 
@@ -3311,9 +3309,9 @@ static bool sh_invlpg(struct vcpu *v, un
         /* This must still be a copy-from-unsafe because we don't have the
          * paging lock, and the higher-level shadows might disappear
          * under our feet. */
-        if ( copy_from_unsafe(&sl3e, (sh_linear_l3_table(v)
-                                      + shadow_l3_linear_offset(linear)),
-                              sizeof (sl3e)) != 0 )
+        if ( get_unsafe(sl3e,
+                        (sh_linear_l3_table(v) +
+                         shadow_l3_linear_offset(linear))) != 0 )
         {
             perfc_incr(shadow_invlpg_fault);
             return false;
@@ -3332,9 +3330,9 @@ static bool sh_invlpg(struct vcpu *v, un
 
     /* This must still be a copy-from-unsafe because we don't have the shadow
      * lock, and the higher-level shadows might disappear under our feet. */
-    if ( copy_from_unsafe(&sl2e,
-                          sh_linear_l2_table(v) + shadow_l2_linear_offset(linear),
-                          sizeof (sl2e)) != 0 )
+    if ( get_unsafe(sl2e,
+                    (sh_linear_l2_table(v) +
+                     shadow_l2_linear_offset(linear))) != 0 )
     {
         perfc_incr(shadow_invlpg_fault);
         return false;
@@ -3375,10 +3373,9 @@ static bool sh_invlpg(struct vcpu *v, un
              * have the paging lock last time we checked, and the
              * higher-level shadows might have disappeared under our
              * feet. */
-            if ( copy_from_unsafe(&sl2e,
-                                  sh_linear_l2_table(v)
-                                  + shadow_l2_linear_offset(linear),
-                                  sizeof (sl2e)) != 0 )
+            if ( get_unsafe(sl2e,
+                            (sh_linear_l2_table(v) +
+                             shadow_l2_linear_offset(linear))) != 0 )
             {
                 perfc_incr(shadow_invlpg_fault);
                 paging_unlock(d);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 11/17] x86/shadow: polish shadow_write_entries()
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (9 preceding siblings ...)
  2021-01-14 15:08 ` [PATCH 10/17] x86/shadow: " Jan Beulich
@ 2021-01-14 15:08 ` Jan Beulich
  2021-01-14 15:09 ` [PATCH 12/17] x86/shadow: move shadow_set_l<N>e() to their own source file Jan Beulich
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:08 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

First of all, avoid the initial dummy write: Try to write the actual
new value instead, and start the loop from 1 if this was successful.
Further, drop safe_write_entry() and use write_atomic() instead. This
eliminates the need for the BUILD_BUG_ON() there at the same time.

Then
- use const and unsigned,
- drop a redundant NULL check,
- don't open-code PAGE_OFFSET() and IS_ALIGNED(),
- adjust comment style.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -746,50 +746,50 @@ l1e_propagate_from_guest(struct vcpu *v,
  * functions which ever write (non-zero) data onto a shadow page.
  */
 
-static inline void safe_write_entry(void *dst, void *src)
-/* Copy one PTE safely when processors might be running on the
- * destination pagetable.   This does *not* give safety against
- * concurrent writes (that's what the paging lock is for), just
- * stops the hardware picking up partially written entries. */
-{
-    volatile unsigned long *d = dst;
-    unsigned long *s = src;
-    ASSERT(!((unsigned long) d & (sizeof (shadow_l1e_t) - 1)));
-    /* In 64-bit, sizeof(pte) == sizeof(ulong) == 1 word,
-     * which will be an atomic write, since the entry is aligned. */
-    BUILD_BUG_ON(sizeof (shadow_l1e_t) != sizeof (unsigned long));
-    *d = *s;
-}
-
-
 static inline void
-shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
-/* This function does the actual writes to shadow pages.
+shadow_write_entries(void *d, const void *s, unsigned int entries, mfn_t mfn)
+/*
+ * This function does the actual writes to shadow pages.
  * It must not be called directly, since it doesn't do the bookkeeping
- * that shadow_set_l*e() functions do. */
+ * that shadow_set_l*e() functions do.
+ *
+ * Copy PTEs safely when processors might be running on the
+ * destination pagetable.  This does *not* give safety against
+ * concurrent writes (that's what the paging lock is for), just
+ * stops the hardware picking up partially written entries.
+ */
 {
     shadow_l1e_t *dst = d;
-    shadow_l1e_t *src = s;
+    const shadow_l1e_t *src = s;
     void *map = NULL;
-    int i;
+    unsigned int i = 0;
 
-    /* Because we mirror access rights at all levels in the shadow, an
+    /*
+     * Because we mirror access rights at all levels in the shadow, an
      * l2 (or higher) entry with the RW bit cleared will leave us with
      * no write access through the linear map.
      * We detect that by writing to the shadow with put_unsafe() and
-     * using map_domain_page() to get a writeable mapping if we need to. */
-    if ( put_unsafe(*dst, dst) )
+     * using map_domain_page() to get a writeable mapping if we need to.
+     */
+    if ( put_unsafe(*src, dst) )
     {
         perfc_incr(shadow_linear_map_failed);
         map = map_domain_page(mfn);
-        dst = map + ((unsigned long)dst & (PAGE_SIZE - 1));
+        dst = map + PAGE_OFFSET(dst);
+    }
+    else
+    {
+        ++src;
+        ++dst;
+        i = 1;
     }
 
+    ASSERT(IS_ALIGNED((unsigned long)dst, sizeof(*dst)));
 
-    for ( i = 0; i < entries; i++ )
-        safe_write_entry(dst++, src++);
+    for ( ; i < entries; i++ )
+        write_atomic(&dst++->l1, src++->l1);
 
-    if ( map != NULL ) unmap_domain_page(map);
+    unmap_domain_page(map);
 }
 
 /* type is only used to distinguish grant map pages from ordinary RAM



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 12/17] x86/shadow: move shadow_set_l<N>e() to their own source file
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (10 preceding siblings ...)
  2021-01-14 15:08 ` [PATCH 11/17] x86/shadow: polish shadow_write_entries() Jan Beulich
@ 2021-01-14 15:09 ` Jan Beulich
  2021-01-14 15:09 ` [PATCH 13/17] x86/shadow: don't open-code SHF_* shorthands Jan Beulich
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

The few GUEST_PAGING_LEVELS dependencies (of shadow_set_l2e() only) can
be easily expressed by function parameters; I suppose the extra indirect
call is acceptable for the increasingly little used 32-bit non-PAE case.
This way shadow_set_l[12]e(), each of which compiles to almost 1k of
code, need building just once.

The implication is the need for some "relaxation" in types.h: The
underlying PTE types don't vary anymore (and aren't expected to down the
road), so they as well as some basic helpers can be exposed even in the
new, artificial GUEST_PAGING_LEVELS == 0 case.

Almost pure code movement - exceptions are the conversion of
"#if GUEST_PAGING_LEVELS == 2" to runtime conditionals and style
corrections (including to avoid open-coding mfn_to_maddr() and
PAGE_OFFSET()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/Makefile
+++ b/xen/arch/x86/mm/shadow/Makefile
@@ -1,5 +1,5 @@
 ifeq ($(CONFIG_SHADOW_PAGING),y)
-obj-y += common.o guest_2.o guest_3.o guest_4.o
+obj-y += common.o guest_2.o guest_3.o guest_4.o set.o
 obj-$(CONFIG_HVM) += hvm.o
 obj-$(CONFIG_PV) += pv.o
 else
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -389,8 +389,13 @@ static inline mfn_t sh_next_page(mfn_t s
     ASSERT(!next->u.sh.head);
     return page_to_mfn(next);
 }
+#else
+# define sh_next_page NULL
 #endif
 
+#define shadow_set_l2e(d, sl2e, new_sl2e, sl2mfn) \
+    shadow_set_l2e(d, sl2e, new_sl2e, sl2mfn, SH_type_fl1_shadow, sh_next_page)
+
 static inline u32
 guest_index(void *ptr)
 {
@@ -740,391 +745,6 @@ l1e_propagate_from_guest(struct vcpu *v,
 }
 
 
-/**************************************************************************/
-/* These functions update shadow entries (and do bookkeeping on the shadow
- * tables they are in).  It is intended that they are the only
- * functions which ever write (non-zero) data onto a shadow page.
- */
-
-static inline void
-shadow_write_entries(void *d, const void *s, unsigned int entries, mfn_t mfn)
-/*
- * This function does the actual writes to shadow pages.
- * It must not be called directly, since it doesn't do the bookkeeping
- * that shadow_set_l*e() functions do.
- *
- * Copy PTEs safely when processors might be running on the
- * destination pagetable.  This does *not* give safety against
- * concurrent writes (that's what the paging lock is for), just
- * stops the hardware picking up partially written entries.
- */
-{
-    shadow_l1e_t *dst = d;
-    const shadow_l1e_t *src = s;
-    void *map = NULL;
-    unsigned int i = 0;
-
-    /*
-     * Because we mirror access rights at all levels in the shadow, an
-     * l2 (or higher) entry with the RW bit cleared will leave us with
-     * no write access through the linear map.
-     * We detect that by writing to the shadow with put_unsafe() and
-     * using map_domain_page() to get a writeable mapping if we need to.
-     */
-    if ( put_unsafe(*src, dst) )
-    {
-        perfc_incr(shadow_linear_map_failed);
-        map = map_domain_page(mfn);
-        dst = map + PAGE_OFFSET(dst);
-    }
-    else
-    {
-        ++src;
-        ++dst;
-        i = 1;
-    }
-
-    ASSERT(IS_ALIGNED((unsigned long)dst, sizeof(*dst)));
-
-    for ( ; i < entries; i++ )
-        write_atomic(&dst++->l1, src++->l1);
-
-    unmap_domain_page(map);
-}
-
-/* type is only used to distinguish grant map pages from ordinary RAM
- * i.e. non-p2m_is_grant() pages are treated as p2m_ram_rw.  */
-static int inline
-shadow_get_page_from_l1e(shadow_l1e_t sl1e, struct domain *d, p2m_type_t type)
-{
-    int res;
-    mfn_t mfn;
-    struct domain *owner;
-
-    ASSERT(!sh_l1e_is_magic(sl1e));
-
-    if ( !shadow_mode_refcounts(d) )
-        return 1;
-
-    res = get_page_from_l1e(sl1e, d, d);
-
-    // If a privileged domain is attempting to install a map of a page it does
-    // not own, we let it succeed anyway.
-    //
-    if ( unlikely(res < 0) &&
-         !shadow_mode_translate(d) &&
-         mfn_valid(mfn = shadow_l1e_get_mfn(sl1e)) &&
-         (owner = page_get_owner(mfn_to_page(mfn))) &&
-         (d != owner) )
-    {
-        res = xsm_priv_mapping(XSM_TARGET, d, owner);
-        if ( !res ) {
-            res = get_page_from_l1e(sl1e, d, owner);
-            SHADOW_PRINTK("privileged domain %d installs map of mfn %"PRI_mfn" "
-                           "which is owned by d%d: %s\n",
-                           d->domain_id, mfn_x(mfn), owner->domain_id,
-                           res >= 0 ? "success" : "failed");
-        }
-    }
-
-    /* Okay, it might still be a grant mapping PTE.  Try it. */
-    if ( unlikely(res < 0) &&
-         (type == p2m_grant_map_rw ||
-          (type == p2m_grant_map_ro &&
-           !(shadow_l1e_get_flags(sl1e) & _PAGE_RW))) )
-    {
-        /* It's a grant mapping.  The grant table implementation will
-           already have checked that we're supposed to have access, so
-           we can just grab a reference directly. */
-        mfn = shadow_l1e_get_mfn(sl1e);
-        if ( mfn_valid(mfn) )
-            res = get_page_from_l1e(sl1e, d, page_get_owner(mfn_to_page(mfn)));
-    }
-
-    if ( unlikely(res < 0) )
-    {
-        perfc_incr(shadow_get_page_fail);
-        SHADOW_PRINTK("failed: l1e=" SH_PRI_pte "\n");
-    }
-
-    return res;
-}
-
-static void inline
-shadow_put_page_from_l1e(shadow_l1e_t sl1e, struct domain *d)
-{
-    if ( !shadow_mode_refcounts(d) )
-        return;
-
-    put_page_from_l1e(sl1e, d);
-}
-
-#if GUEST_PAGING_LEVELS >= 4
-static int shadow_set_l4e(struct domain *d,
-                          shadow_l4e_t *sl4e,
-                          shadow_l4e_t new_sl4e,
-                          mfn_t sl4mfn)
-{
-    int flags = 0;
-    shadow_l4e_t old_sl4e;
-    paddr_t paddr;
-    ASSERT(sl4e != NULL);
-    old_sl4e = *sl4e;
-
-    if ( old_sl4e.l4 == new_sl4e.l4 ) return 0; /* Nothing to do */
-
-    paddr = ((((paddr_t)mfn_x(sl4mfn)) << PAGE_SHIFT)
-             | (((unsigned long)sl4e) & ~PAGE_MASK));
-
-    if ( shadow_l4e_get_flags(new_sl4e) & _PAGE_PRESENT )
-    {
-        /* About to install a new reference */
-        mfn_t sl3mfn = shadow_l4e_get_mfn(new_sl4e);
-
-        if ( !sh_get_ref(d, sl3mfn, paddr) )
-        {
-            domain_crash(d);
-            return SHADOW_SET_ERROR;
-        }
-
-        /* Are we pinning l3 shadows to handle weird Linux behaviour? */
-        if ( sh_type_is_pinnable(d, SH_type_l3_64_shadow) )
-            sh_pin(d, sl3mfn);
-    }
-
-    /* Write the new entry */
-    shadow_write_entries(sl4e, &new_sl4e, 1, sl4mfn);
-    flush_root_pgtbl_domain(d);
-
-    flags |= SHADOW_SET_CHANGED;
-
-    if ( shadow_l4e_get_flags(old_sl4e) & _PAGE_PRESENT )
-    {
-        /* We lost a reference to an old mfn. */
-        mfn_t osl3mfn = shadow_l4e_get_mfn(old_sl4e);
-
-        if ( !mfn_eq(osl3mfn, shadow_l4e_get_mfn(new_sl4e))
-             || !perms_strictly_increased(shadow_l4e_get_flags(old_sl4e),
-                                          shadow_l4e_get_flags(new_sl4e)) )
-        {
-            flags |= SHADOW_SET_FLUSH;
-        }
-        sh_put_ref(d, osl3mfn, paddr);
-    }
-
-    return flags;
-}
-
-static int shadow_set_l3e(struct domain *d,
-                          shadow_l3e_t *sl3e,
-                          shadow_l3e_t new_sl3e,
-                          mfn_t sl3mfn)
-{
-    int flags = 0;
-    shadow_l3e_t old_sl3e;
-    paddr_t paddr;
-    ASSERT(sl3e != NULL);
-    old_sl3e = *sl3e;
-
-    if ( old_sl3e.l3 == new_sl3e.l3 ) return 0; /* Nothing to do */
-
-    paddr = ((((paddr_t)mfn_x(sl3mfn)) << PAGE_SHIFT)
-             | (((unsigned long)sl3e) & ~PAGE_MASK));
-
-    if ( shadow_l3e_get_flags(new_sl3e) & _PAGE_PRESENT )
-    {
-        /* About to install a new reference */
-        if ( !sh_get_ref(d, shadow_l3e_get_mfn(new_sl3e), paddr) )
-        {
-            domain_crash(d);
-            return SHADOW_SET_ERROR;
-        }
-    }
-
-    /* Write the new entry */
-    shadow_write_entries(sl3e, &new_sl3e, 1, sl3mfn);
-    flags |= SHADOW_SET_CHANGED;
-
-    if ( shadow_l3e_get_flags(old_sl3e) & _PAGE_PRESENT )
-    {
-        /* We lost a reference to an old mfn. */
-        mfn_t osl2mfn = shadow_l3e_get_mfn(old_sl3e);
-
-        if ( !mfn_eq(osl2mfn, shadow_l3e_get_mfn(new_sl3e)) ||
-             !perms_strictly_increased(shadow_l3e_get_flags(old_sl3e),
-                                       shadow_l3e_get_flags(new_sl3e)) )
-        {
-            flags |= SHADOW_SET_FLUSH;
-        }
-        sh_put_ref(d, osl2mfn, paddr);
-    }
-    return flags;
-}
-#endif /* GUEST_PAGING_LEVELS >= 4 */
-
-static int shadow_set_l2e(struct domain *d,
-                          shadow_l2e_t *sl2e,
-                          shadow_l2e_t new_sl2e,
-                          mfn_t sl2mfn)
-{
-    int flags = 0;
-    shadow_l2e_t old_sl2e;
-    paddr_t paddr;
-
-#if GUEST_PAGING_LEVELS == 2
-    /* In 2-on-3 we work with pairs of l2es pointing at two-page
-     * shadows.  Reference counting and up-pointers track from the first
-     * page of the shadow to the first l2e, so make sure that we're
-     * working with those:
-     * Start with a pair of identical entries */
-    shadow_l2e_t pair[2] = { new_sl2e, new_sl2e };
-    /* Align the pointer down so it's pointing at the first of the pair */
-    sl2e = (shadow_l2e_t *)((unsigned long)sl2e & ~(sizeof(shadow_l2e_t)));
-#endif
-
-    ASSERT(sl2e != NULL);
-    old_sl2e = *sl2e;
-
-    if ( old_sl2e.l2 == new_sl2e.l2 ) return 0; /* Nothing to do */
-
-    paddr = ((((paddr_t)mfn_x(sl2mfn)) << PAGE_SHIFT)
-             | (((unsigned long)sl2e) & ~PAGE_MASK));
-
-    if ( shadow_l2e_get_flags(new_sl2e) & _PAGE_PRESENT )
-    {
-        mfn_t sl1mfn = shadow_l2e_get_mfn(new_sl2e);
-        ASSERT(mfn_to_page(sl1mfn)->u.sh.head);
-
-        /* About to install a new reference */
-        if ( !sh_get_ref(d, sl1mfn, paddr) )
-        {
-            domain_crash(d);
-            return SHADOW_SET_ERROR;
-        }
-#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
-        {
-            struct page_info *sp = mfn_to_page(sl1mfn);
-            mfn_t gl1mfn;
-
-            ASSERT(sp->u.sh.head);
-            gl1mfn = backpointer(sp);
-            /* If the shadow is a fl1 then the backpointer contains
-               the GFN instead of the GMFN, and it's definitely not
-               OOS. */
-            if ( (sp->u.sh.type != SH_type_fl1_shadow) && mfn_valid(gl1mfn)
-                 && mfn_is_out_of_sync(gl1mfn) )
-                sh_resync(d, gl1mfn);
-        }
-#endif
-#if GUEST_PAGING_LEVELS == 2
-        /* Update the second entry to point tio the second half of the l1 */
-        sl1mfn = sh_next_page(sl1mfn);
-        pair[1] = shadow_l2e_from_mfn(sl1mfn, shadow_l2e_get_flags(new_sl2e));
-#endif
-    }
-
-    /* Write the new entry */
-#if GUEST_PAGING_LEVELS == 2
-    shadow_write_entries(sl2e, &pair, 2, sl2mfn);
-#else /* normal case */
-    shadow_write_entries(sl2e, &new_sl2e, 1, sl2mfn);
-#endif
-    flags |= SHADOW_SET_CHANGED;
-
-    if ( shadow_l2e_get_flags(old_sl2e) & _PAGE_PRESENT )
-    {
-        /* We lost a reference to an old mfn. */
-        mfn_t osl1mfn = shadow_l2e_get_mfn(old_sl2e);
-
-        if ( !mfn_eq(osl1mfn, shadow_l2e_get_mfn(new_sl2e)) ||
-             !perms_strictly_increased(shadow_l2e_get_flags(old_sl2e),
-                                       shadow_l2e_get_flags(new_sl2e)) )
-        {
-            flags |= SHADOW_SET_FLUSH;
-        }
-        sh_put_ref(d, osl1mfn, paddr);
-    }
-    return flags;
-}
-
-static int shadow_set_l1e(struct domain *d,
-                          shadow_l1e_t *sl1e,
-                          shadow_l1e_t new_sl1e,
-                          p2m_type_t new_type,
-                          mfn_t sl1mfn)
-{
-    int flags = 0;
-    shadow_l1e_t old_sl1e;
-    unsigned int old_sl1f;
-#if SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC
-    mfn_t new_gmfn = shadow_l1e_get_mfn(new_sl1e);
-#endif
-    ASSERT(sl1e != NULL);
-
-#if SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC
-    if ( mfn_valid(new_gmfn) && mfn_oos_may_write(new_gmfn)
-         && ((shadow_l1e_get_flags(new_sl1e) & (_PAGE_RW|_PAGE_PRESENT))
-             == (_PAGE_RW|_PAGE_PRESENT)) )
-        oos_fixup_add(d, new_gmfn, sl1mfn, pgentry_ptr_to_slot(sl1e));
-#endif
-
-    old_sl1e = *sl1e;
-
-    if ( old_sl1e.l1 == new_sl1e.l1 ) return 0; /* Nothing to do */
-
-    if ( (shadow_l1e_get_flags(new_sl1e) & _PAGE_PRESENT)
-         && !sh_l1e_is_magic(new_sl1e) )
-    {
-        /* About to install a new reference */
-        if ( shadow_mode_refcounts(d) )
-        {
-#define PAGE_FLIPPABLE (_PAGE_RW | _PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
-            int rc;
-
-            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_SHADOW_L1_GET_REF);
-            switch ( rc = shadow_get_page_from_l1e(new_sl1e, d, new_type) )
-            {
-            default:
-                /* Doesn't look like a pagetable. */
-                flags |= SHADOW_SET_ERROR;
-                new_sl1e = shadow_l1e_empty();
-                break;
-            case PAGE_FLIPPABLE & -PAGE_FLIPPABLE ... PAGE_FLIPPABLE:
-                ASSERT(!(rc & ~PAGE_FLIPPABLE));
-                new_sl1e = shadow_l1e_flip_flags(new_sl1e, rc);
-                /* fall through */
-            case 0:
-                shadow_vram_get_mfn(shadow_l1e_get_mfn(new_sl1e),
-                                    shadow_l1e_get_flags(new_sl1e),
-                                    sl1mfn, sl1e, d);
-                break;
-            }
-#undef PAGE_FLIPPABLE
-        }
-    }
-
-    /* Write the new entry */
-    shadow_write_entries(sl1e, &new_sl1e, 1, sl1mfn);
-    flags |= SHADOW_SET_CHANGED;
-
-    old_sl1f = shadow_l1e_get_flags(old_sl1e);
-    if ( (old_sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(old_sl1e) &&
-         shadow_mode_refcounts(d) )
-    {
-        /* We lost a reference to an old mfn. */
-        /* N.B. Unlike higher-level sets, never need an extra flush
-         * when writing an l1e.  Because it points to the same guest frame
-         * as the guest l1e did, it's the guest's responsibility to
-         * trigger a flush later. */
-        shadow_vram_put_mfn(shadow_l1e_get_mfn(old_sl1e), old_sl1f,
-                            sl1mfn, sl1e, d);
-        shadow_put_page_from_l1e(old_sl1e, d);
-        TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_SHADOW_L1_PUT_REF);
-    }
-    return flags;
-}
-
-
 /**************************************************************************/
 /* Macros to walk pagetables.  These take the shadow of a pagetable and
  * walk every "interesting" entry.  That is, they don't touch Xen mappings,
--- /dev/null
+++ b/xen/arch/x86/mm/shadow/set.c
@@ -0,0 +1,418 @@
+/******************************************************************************
+ * arch/x86/mm/shadow/set.c
+ *
+ * Simple, mostly-synchronous shadow page tables.
+ * Parts of this code are Copyright (c) 2006 by XenSource Inc.
+ * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
+ * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define GUEST_PAGING_LEVELS 0
+
+#include <xen/sched.h>
+#include <xsm/xsm.h>
+#include <asm/shadow.h>
+#include "private.h"
+#include "types.h"
+
+/*
+ * These functions update shadow entries (and do bookkeeping on the shadow
+ * tables they are in).  It is intended that they are the only
+ * functions which ever write (non-zero) data onto a shadow page.
+ */
+
+static inline void
+shadow_write_entries(void *d, const void *s, unsigned int entries, mfn_t mfn)
+/*
+ * This function does the actual writes to shadow pages.
+ * It must not be called directly, since it doesn't do the bookkeeping
+ * that shadow_set_l*e() functions do.
+ *
+ * Copy PTEs safely when processors might be running on the
+ * destination pagetable.  This does *not* give safety against
+ * concurrent writes (that's what the paging lock is for), just
+ * stops the hardware picking up partially written entries.
+ */
+{
+    shadow_l1e_t *dst = d;
+    const shadow_l1e_t *src = s;
+    void *map = NULL;
+    unsigned int i = 0;
+
+    /*
+     * Because we mirror access rights at all levels in the shadow, an
+     * l2 (or higher) entry with the RW bit cleared will leave us with
+     * no write access through the linear map.
+     * We detect that by writing to the shadow with put_unsafe() and
+     * using map_domain_page() to get a writeable mapping if we need to.
+     */
+    if ( put_unsafe(*src, dst) )
+    {
+        perfc_incr(shadow_linear_map_failed);
+        map = map_domain_page(mfn);
+        dst = map + PAGE_OFFSET(dst);
+    }
+    else
+    {
+        ++src;
+        ++dst;
+        i = 1;
+    }
+
+    ASSERT(IS_ALIGNED((unsigned long)dst, sizeof(*dst)));
+
+    for ( ; i < entries; i++ )
+        write_atomic(&dst++->l1, src++->l1);
+
+    unmap_domain_page(map);
+}
+
+/*
+ * "type" is only used to distinguish grant map pages from ordinary RAM
+ * i.e. non-p2m_is_grant() pages are treated as p2m_ram_rw.
+ */
+static int inline
+shadow_get_page_from_l1e(shadow_l1e_t sl1e, struct domain *d, p2m_type_t type)
+{
+    int res;
+    mfn_t mfn;
+    struct domain *owner;
+
+    ASSERT(!sh_l1e_is_magic(sl1e));
+
+    if ( !shadow_mode_refcounts(d) )
+        return 1;
+
+    res = get_page_from_l1e(sl1e, d, d);
+
+    /*
+     * If a privileged domain is attempting to install a map of a page it does
+     * not own, we let it succeed anyway.
+     */
+    if ( unlikely(res < 0) &&
+         !shadow_mode_translate(d) &&
+         mfn_valid(mfn = shadow_l1e_get_mfn(sl1e)) &&
+         (owner = page_get_owner(mfn_to_page(mfn))) &&
+         (d != owner) )
+    {
+        res = xsm_priv_mapping(XSM_TARGET, d, owner);
+        if ( !res )
+        {
+            res = get_page_from_l1e(sl1e, d, owner);
+            SHADOW_PRINTK("privileged %pd installs map of mfn %"PRI_mfn" owned by %pd: %s\n",
+                           d, mfn_x(mfn), owner,
+                           res >= 0 ? "success" : "failed");
+        }
+    }
+
+    /* Okay, it might still be a grant mapping PTE.  Try it. */
+    if ( unlikely(res < 0) &&
+         (type == p2m_grant_map_rw ||
+          (type == p2m_grant_map_ro &&
+           !(shadow_l1e_get_flags(sl1e) & _PAGE_RW))) )
+    {
+        /*
+         * It's a grant mapping.  The grant table implementation will
+         * already have checked that we're supposed to have access, so
+         * we can just grab a reference directly.
+         */
+        mfn = shadow_l1e_get_mfn(sl1e);
+        if ( mfn_valid(mfn) )
+            res = get_page_from_l1e(sl1e, d, page_get_owner(mfn_to_page(mfn)));
+    }
+
+    if ( unlikely(res < 0) )
+    {
+        perfc_incr(shadow_get_page_fail);
+        SHADOW_PRINTK("failed: l1e=" SH_PRI_pte "\n");
+    }
+
+    return res;
+}
+
+int shadow_set_l4e(struct domain *d, shadow_l4e_t *sl4e,
+                   shadow_l4e_t new_sl4e, mfn_t sl4mfn)
+{
+    int flags = 0;
+    shadow_l4e_t old_sl4e;
+    paddr_t paddr;
+
+    ASSERT(sl4e != NULL);
+    old_sl4e = *sl4e;
+
+    if ( old_sl4e.l4 == new_sl4e.l4 ) return 0; /* Nothing to do */
+
+    paddr = mfn_to_maddr(sl4mfn) | PAGE_OFFSET(sl4e);
+
+    if ( shadow_l4e_get_flags(new_sl4e) & _PAGE_PRESENT )
+    {
+        /* About to install a new reference */
+        mfn_t sl3mfn = shadow_l4e_get_mfn(new_sl4e);
+
+        if ( !sh_get_ref(d, sl3mfn, paddr) )
+        {
+            domain_crash(d);
+            return SHADOW_SET_ERROR;
+        }
+
+        /* Are we pinning l3 shadows to handle weird Linux behaviour? */
+        if ( sh_type_is_pinnable(d, SH_type_l3_64_shadow) )
+            sh_pin(d, sl3mfn);
+    }
+
+    /* Write the new entry */
+    shadow_write_entries(sl4e, &new_sl4e, 1, sl4mfn);
+    flush_root_pgtbl_domain(d);
+
+    flags |= SHADOW_SET_CHANGED;
+
+    if ( shadow_l4e_get_flags(old_sl4e) & _PAGE_PRESENT )
+    {
+        /* We lost a reference to an old mfn. */
+        mfn_t osl3mfn = shadow_l4e_get_mfn(old_sl4e);
+
+        if ( !mfn_eq(osl3mfn, shadow_l4e_get_mfn(new_sl4e)) ||
+             !perms_strictly_increased(shadow_l4e_get_flags(old_sl4e),
+                                       shadow_l4e_get_flags(new_sl4e)) )
+            flags |= SHADOW_SET_FLUSH;
+
+        sh_put_ref(d, osl3mfn, paddr);
+    }
+
+    return flags;
+}
+
+int shadow_set_l3e(struct domain *d, shadow_l3e_t *sl3e,
+                   shadow_l3e_t new_sl3e, mfn_t sl3mfn)
+{
+    int flags = 0;
+    shadow_l3e_t old_sl3e;
+    paddr_t paddr;
+
+    ASSERT(sl3e != NULL);
+    old_sl3e = *sl3e;
+
+    if ( old_sl3e.l3 == new_sl3e.l3 ) return 0; /* Nothing to do */
+
+    paddr = mfn_to_maddr(sl3mfn) | PAGE_OFFSET(sl3e);
+
+    if ( shadow_l3e_get_flags(new_sl3e) & _PAGE_PRESENT )
+    {
+        /* About to install a new reference */
+        if ( !sh_get_ref(d, shadow_l3e_get_mfn(new_sl3e), paddr) )
+        {
+            domain_crash(d);
+            return SHADOW_SET_ERROR;
+        }
+    }
+
+    /* Write the new entry */
+    shadow_write_entries(sl3e, &new_sl3e, 1, sl3mfn);
+    flags |= SHADOW_SET_CHANGED;
+
+    if ( shadow_l3e_get_flags(old_sl3e) & _PAGE_PRESENT )
+    {
+        /* We lost a reference to an old mfn. */
+        mfn_t osl2mfn = shadow_l3e_get_mfn(old_sl3e);
+
+        if ( !mfn_eq(osl2mfn, shadow_l3e_get_mfn(new_sl3e)) ||
+             !perms_strictly_increased(shadow_l3e_get_flags(old_sl3e),
+                                       shadow_l3e_get_flags(new_sl3e)) )
+            flags |= SHADOW_SET_FLUSH;
+
+        sh_put_ref(d, osl2mfn, paddr);
+    }
+
+    return flags;
+}
+
+int shadow_set_l2e(struct domain *d, shadow_l2e_t *sl2e,
+                   shadow_l2e_t new_sl2e, mfn_t sl2mfn,
+                   unsigned int type_fl1_shadow,
+                   mfn_t (*next_page)(mfn_t smfn))
+{
+    int flags = 0;
+    shadow_l2e_t old_sl2e;
+    paddr_t paddr;
+    /*
+     * In 2-on-3 we work with pairs of l2es pointing at two-page
+     * shadows.  Reference counting and up-pointers track from the first
+     * page of the shadow to the first l2e, so make sure that we're
+     * working with those:
+     * Start with a pair of identical entries.
+     */
+    shadow_l2e_t pair[2] = { new_sl2e, new_sl2e };
+
+    if ( next_page )
+    {
+        /* Align the pointer down so it's pointing at the first of the pair */
+        sl2e = (shadow_l2e_t *)((unsigned long)sl2e & ~sizeof(shadow_l2e_t));
+    }
+
+    ASSERT(sl2e != NULL);
+    old_sl2e = *sl2e;
+
+    if ( old_sl2e.l2 == new_sl2e.l2 ) return 0; /* Nothing to do */
+
+    paddr = mfn_to_maddr(sl2mfn) | PAGE_OFFSET(sl2e);
+
+    if ( shadow_l2e_get_flags(new_sl2e) & _PAGE_PRESENT )
+    {
+        mfn_t sl1mfn = shadow_l2e_get_mfn(new_sl2e);
+        ASSERT(mfn_to_page(sl1mfn)->u.sh.head);
+
+        /* About to install a new reference */
+        if ( !sh_get_ref(d, sl1mfn, paddr) )
+        {
+            domain_crash(d);
+            return SHADOW_SET_ERROR;
+        }
+
+#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
+        {
+            struct page_info *sp = mfn_to_page(sl1mfn);
+            mfn_t gl1mfn;
+
+            ASSERT(sp->u.sh.head);
+            gl1mfn = backpointer(sp);
+            /*
+             * If the shadow is a fl1 then the backpointer contains the
+             * GFN instead of the GMFN, and it's definitely not OOS.
+             */
+            if ( (sp->u.sh.type != type_fl1_shadow) && mfn_valid(gl1mfn)
+                 && mfn_is_out_of_sync(gl1mfn) )
+                sh_resync(d, gl1mfn);
+        }
+#endif
+
+        if ( next_page )
+        {
+            /* Update the second entry to point to the second half of the l1 */
+            sl1mfn = next_page(sl1mfn);
+            pair[1] = shadow_l2e_from_mfn(sl1mfn,
+                                          shadow_l2e_get_flags(new_sl2e));
+        }
+    }
+
+    /* Write the new entry / entries */
+    shadow_write_entries(sl2e, &pair, !next_page ? 1 : 2, sl2mfn);
+
+    flags |= SHADOW_SET_CHANGED;
+
+    if ( shadow_l2e_get_flags(old_sl2e) & _PAGE_PRESENT )
+    {
+        /* We lost a reference to an old mfn. */
+        mfn_t osl1mfn = shadow_l2e_get_mfn(old_sl2e);
+
+        if ( !mfn_eq(osl1mfn, shadow_l2e_get_mfn(new_sl2e)) ||
+             !perms_strictly_increased(shadow_l2e_get_flags(old_sl2e),
+                                       shadow_l2e_get_flags(new_sl2e)) )
+            flags |= SHADOW_SET_FLUSH;
+
+        sh_put_ref(d, osl1mfn, paddr);
+    }
+
+    return flags;
+}
+
+int shadow_set_l1e(struct domain *d, shadow_l1e_t *sl1e,
+                   shadow_l1e_t new_sl1e, p2m_type_t new_type,
+                   mfn_t sl1mfn)
+{
+    int flags = 0;
+    shadow_l1e_t old_sl1e;
+    unsigned int old_sl1f;
+#if SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC
+    mfn_t new_gmfn = shadow_l1e_get_mfn(new_sl1e);
+#endif
+
+    ASSERT(sl1e != NULL);
+
+#if SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC
+    if ( mfn_valid(new_gmfn) && mfn_oos_may_write(new_gmfn) &&
+         ((shadow_l1e_get_flags(new_sl1e) & (_PAGE_RW | _PAGE_PRESENT)) ==
+          (_PAGE_RW | _PAGE_PRESENT)) )
+        oos_fixup_add(d, new_gmfn, sl1mfn, pgentry_ptr_to_slot(sl1e));
+#endif
+
+    old_sl1e = *sl1e;
+
+    if ( old_sl1e.l1 == new_sl1e.l1 ) return 0; /* Nothing to do */
+
+    if ( (shadow_l1e_get_flags(new_sl1e) & _PAGE_PRESENT) &&
+         !sh_l1e_is_magic(new_sl1e) )
+    {
+        /* About to install a new reference */
+        if ( shadow_mode_refcounts(d) )
+        {
+#define PAGE_FLIPPABLE (_PAGE_RW | _PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
+            int rc;
+
+            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_SHADOW_L1_GET_REF);
+            switch ( rc = shadow_get_page_from_l1e(new_sl1e, d, new_type) )
+            {
+            default:
+                /* Doesn't look like a pagetable. */
+                flags |= SHADOW_SET_ERROR;
+                new_sl1e = shadow_l1e_empty();
+                break;
+            case PAGE_FLIPPABLE & -PAGE_FLIPPABLE ... PAGE_FLIPPABLE:
+                ASSERT(!(rc & ~PAGE_FLIPPABLE));
+                new_sl1e = shadow_l1e_flip_flags(new_sl1e, rc);
+                /* fall through */
+            case 0:
+                shadow_vram_get_mfn(shadow_l1e_get_mfn(new_sl1e),
+                                    shadow_l1e_get_flags(new_sl1e),
+                                    sl1mfn, sl1e, d);
+                break;
+            }
+#undef PAGE_FLIPPABLE
+        }
+    }
+
+    /* Write the new entry */
+    shadow_write_entries(sl1e, &new_sl1e, 1, sl1mfn);
+    flags |= SHADOW_SET_CHANGED;
+
+    old_sl1f = shadow_l1e_get_flags(old_sl1e);
+    if ( (old_sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(old_sl1e) &&
+         shadow_mode_refcounts(d) )
+    {
+        /*
+         * We lost a reference to an old mfn.
+         *
+         * N.B. Unlike higher-level sets, never need an extra flush when
+         * writing an l1e.  Because it points to the same guest frame as the
+         * guest l1e did, it's the guest's responsibility to trigger a flush
+         * later.
+         */
+        shadow_vram_put_mfn(shadow_l1e_get_mfn(old_sl1e), old_sl1f,
+                            sl1mfn, sl1e, d);
+        shadow_put_page_from_l1e(old_sl1e, d);
+        TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_SHADOW_L1_PUT_REF);
+    }
+
+    return flags;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -24,8 +24,8 @@
 
 /* The number of levels in the shadow pagetable is entirely determined
  * by the number of levels in the guest pagetable */
-#if GUEST_PAGING_LEVELS == 4
-#define SHADOW_PAGING_LEVELS 4
+#if GUEST_PAGING_LEVELS != 2
+#define SHADOW_PAGING_LEVELS GUEST_PAGING_LEVELS
 #else
 #define SHADOW_PAGING_LEVELS 3
 #endif
@@ -43,7 +43,7 @@
 #define SHADOW_L1_PAGETABLE_SHIFT        12
 #define SHADOW_L2_PAGETABLE_SHIFT        21
 #define SHADOW_L3_PAGETABLE_SHIFT        30
-#else /* SHADOW_PAGING_LEVELS == 4 */
+#elif SHADOW_PAGING_LEVELS == 4
 #define SHADOW_L1_PAGETABLE_ENTRIES     512
 #define SHADOW_L2_PAGETABLE_ENTRIES     512
 #define SHADOW_L3_PAGETABLE_ENTRIES     512
@@ -58,9 +58,7 @@
 typedef l1_pgentry_t shadow_l1e_t;
 typedef l2_pgentry_t shadow_l2e_t;
 typedef l3_pgentry_t shadow_l3e_t;
-#if SHADOW_PAGING_LEVELS >= 4
 typedef l4_pgentry_t shadow_l4e_t;
-#endif
 
 /* Access functions for them */
 static inline paddr_t shadow_l1e_get_paddr(shadow_l1e_t sl1e)
@@ -69,10 +67,8 @@ static inline paddr_t shadow_l2e_get_pad
 { return l2e_get_paddr(sl2e); }
 static inline paddr_t shadow_l3e_get_paddr(shadow_l3e_t sl3e)
 { return l3e_get_paddr(sl3e); }
-#if SHADOW_PAGING_LEVELS >= 4
 static inline paddr_t shadow_l4e_get_paddr(shadow_l4e_t sl4e)
 { return l4e_get_paddr(sl4e); }
-#endif
 
 static inline mfn_t shadow_l1e_get_mfn(shadow_l1e_t sl1e)
 { return l1e_get_mfn(sl1e); }
@@ -80,10 +76,8 @@ static inline mfn_t shadow_l2e_get_mfn(s
 { return l2e_get_mfn(sl2e); }
 static inline mfn_t shadow_l3e_get_mfn(shadow_l3e_t sl3e)
 { return l3e_get_mfn(sl3e); }
-#if SHADOW_PAGING_LEVELS >= 4
 static inline mfn_t shadow_l4e_get_mfn(shadow_l4e_t sl4e)
 { return l4e_get_mfn(sl4e); }
-#endif
 
 static inline u32 shadow_l1e_get_flags(shadow_l1e_t sl1e)
 { return l1e_get_flags(sl1e); }
@@ -91,10 +85,8 @@ static inline u32 shadow_l2e_get_flags(s
 { return l2e_get_flags(sl2e); }
 static inline u32 shadow_l3e_get_flags(shadow_l3e_t sl3e)
 { return l3e_get_flags(sl3e); }
-#if SHADOW_PAGING_LEVELS >= 4
 static inline u32 shadow_l4e_get_flags(shadow_l4e_t sl4e)
 { return l4e_get_flags(sl4e); }
-#endif
 
 static inline shadow_l1e_t
 shadow_l1e_remove_flags(shadow_l1e_t sl1e, u32 flags)
@@ -109,10 +101,8 @@ static inline shadow_l2e_t shadow_l2e_em
 { return l2e_empty(); }
 static inline shadow_l3e_t shadow_l3e_empty(void)
 { return l3e_empty(); }
-#if SHADOW_PAGING_LEVELS >= 4
 static inline shadow_l4e_t shadow_l4e_empty(void)
 { return l4e_empty(); }
-#endif
 
 static inline shadow_l1e_t shadow_l1e_from_mfn(mfn_t mfn, u32 flags)
 { return l1e_from_mfn(mfn, flags); }
@@ -120,10 +110,8 @@ static inline shadow_l2e_t shadow_l2e_fr
 { return l2e_from_mfn(mfn, flags); }
 static inline shadow_l3e_t shadow_l3e_from_mfn(mfn_t mfn, u32 flags)
 { return l3e_from_mfn(mfn, flags); }
-#if SHADOW_PAGING_LEVELS >= 4
 static inline shadow_l4e_t shadow_l4e_from_mfn(mfn_t mfn, u32 flags)
 { return l4e_from_mfn(mfn, flags); }
-#endif
 
 #define shadow_l1_table_offset(a) l1_table_offset(a)
 #define shadow_l2_table_offset(a) l2_table_offset(a)
@@ -208,7 +196,7 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define SH_type_fl1_shadow SH_type_fl1_pae_shadow
 #define SH_type_l2_shadow  SH_type_l2_pae_shadow
 #define SH_type_l2h_shadow SH_type_l2h_pae_shadow
-#else
+#elif GUEST_PAGING_LEVELS == 4
 #define SH_type_l1_shadow  SH_type_l1_64_shadow
 #define SH_type_fl1_shadow SH_type_fl1_64_shadow
 #define SH_type_l2_shadow  SH_type_l2_64_shadow
@@ -217,6 +205,8 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define SH_type_l4_shadow  SH_type_l4_64_shadow
 #endif
 
+#if GUEST_PAGING_LEVELS
+
 /* macros for dealing with the naming of the internal function names of the
  * shadow code's external entry points.
  */
@@ -262,6 +252,8 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define sh_rm_write_access_from_sl1p INTERNAL_NAME(sh_rm_write_access_from_sl1p)
 #endif
 
+#endif /* GUEST_PAGING_LEVELS */
+
 #if SHADOW_PAGING_LEVELS == 3
 #define MFN_FITS_IN_HVM_CR3(_MFN) !(mfn_x(_MFN) >> 20)
 #endif
@@ -270,6 +262,26 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define SH_PRI_gpte PRI_gpte
 #define SH_PRI_gfn  PRI_gfn
 
+int shadow_set_l1e(struct domain *d, shadow_l1e_t *sl1e,
+                   shadow_l1e_t new_sl1e, p2m_type_t new_type,
+                   mfn_t sl1mfn);
+int shadow_set_l2e(struct domain *d, shadow_l2e_t *sl2e,
+                   shadow_l2e_t new_sl2e, mfn_t sl2mfn,
+                   unsigned int type_fl1_shadow,
+                   mfn_t (*next_page)(mfn_t smfn));
+int shadow_set_l3e(struct domain *d, shadow_l3e_t *sl3e,
+                   shadow_l3e_t new_sl3e, mfn_t sl3mfn);
+int shadow_set_l4e(struct domain *d, shadow_l4e_t *sl4e,
+                   shadow_l4e_t new_sl4e, mfn_t sl4mfn);
+
+static void inline
+shadow_put_page_from_l1e(shadow_l1e_t sl1e, struct domain *d)
+{
+    if ( !shadow_mode_refcounts(d) )
+        return;
+
+    put_page_from_l1e(sl1e, d);
+}
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_FAST_FAULT_PATH)
 /******************************************************************************



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 13/17] x86/shadow: don't open-code SHF_* shorthands
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (11 preceding siblings ...)
  2021-01-14 15:09 ` [PATCH 12/17] x86/shadow: move shadow_set_l<N>e() to their own source file Jan Beulich
@ 2021-01-14 15:09 ` Jan Beulich
  2021-01-14 15:10 ` [PATCH 14/17] x86/shadow: SH_type_l2h_shadow is PV-only Jan Beulich
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

Use SHF_L1_ANY, SHF_32, SHF_PAE, as well as SHF_64, and introduce
SHF_FL1_ANY.

Note that in shadow_audit_tables() this has the effect of no longer
(I assume mistakenly, or else I don't see why the respective callback
table entry isn't NULL) excluding SHF_L2H_64.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1827,14 +1827,7 @@ int sh_remove_write_access(struct domain
         NULL  /* unused  */
     };
 
-    static const unsigned int callback_mask =
-          SHF_L1_32
-        | SHF_FL1_32
-        | SHF_L1_PAE
-        | SHF_FL1_PAE
-        | SHF_L1_64
-        | SHF_FL1_64
-        ;
+    static const unsigned int callback_mask = SHF_L1_ANY | SHF_FL1_ANY;
     struct page_info *pg = mfn_to_page(gmfn);
 #if SHADOW_OPTIMIZATIONS & SHOPT_WRITABLE_HEURISTIC
     struct vcpu *curr = current;
@@ -2057,14 +2050,7 @@ int sh_remove_all_mappings(struct domain
         NULL  /* unused  */
     };
 
-    static const unsigned int callback_mask =
-          SHF_L1_32
-        | SHF_FL1_32
-        | SHF_L1_PAE
-        | SHF_FL1_PAE
-        | SHF_L1_64
-        | SHF_FL1_64
-        ;
+    static const unsigned int callback_mask = SHF_L1_ANY | SHF_FL1_ANY;
 
     perfc_incr(shadow_mappings);
     if ( sh_check_page_has_no_refs(page) )
@@ -3411,11 +3397,9 @@ void shadow_audit_tables(struct vcpu *v)
         /* Audit only the current mode's tables */
         switch ( v->arch.paging.mode->guest_levels )
         {
-        case 2: mask = (SHF_L1_32|SHF_FL1_32|SHF_L2_32); break;
-        case 3: mask = (SHF_L1_PAE|SHF_FL1_PAE|SHF_L2_PAE
-                        |SHF_L2H_PAE); break;
-        case 4: mask = (SHF_L1_64|SHF_FL1_64|SHF_L2_64
-                        |SHF_L3_64|SHF_L4_64); break;
+        case 2: mask = SHF_32; break;
+        case 3: mask = SHF_PAE; break;
+        case 4: mask = SHF_64; break;
         default: BUG();
         }
     }
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -269,6 +269,7 @@ static inline void sh_terminate_list(str
 #define SHF_64  (SHF_L1_64|SHF_FL1_64|SHF_L2_64|SHF_L2H_64|SHF_L3_64|SHF_L4_64)
 
 #define SHF_L1_ANY  (SHF_L1_32|SHF_L1_PAE|SHF_L1_64)
+#define SHF_FL1_ANY (SHF_FL1_32|SHF_FL1_PAE|SHF_FL1_64)
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
 /* Marks a guest L1 page table which is shadowed but not write-protected.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 14/17] x86/shadow: SH_type_l2h_shadow is PV-only
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (12 preceding siblings ...)
  2021-01-14 15:09 ` [PATCH 13/17] x86/shadow: don't open-code SHF_* shorthands Jan Beulich
@ 2021-01-14 15:10 ` Jan Beulich
  2021-01-14 15:10 ` [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow Jan Beulich
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

..., i.e. being used only with 4 guest paging levels. Drop its L2/PAE
alias and adjust / drop conditionals. Use >= 4 where touching them
anyway, in preparation for 5-level paging.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -334,7 +334,7 @@ static void sh_audit_gw(struct vcpu *v,
         if ( mfn_valid((smfn = get_shadow_status(d, gw->l2mfn,
                                                  SH_type_l2_shadow))) )
             (void) sh_audit_l2_table(v, smfn, INVALID_MFN);
-#if GUEST_PAGING_LEVELS == 3
+#if GUEST_PAGING_LEVELS >= 4 /* 32-bit PV only */
         if ( mfn_valid((smfn = get_shadow_status(d, gw->l2mfn,
                                                  SH_type_l2h_shadow))) )
             (void) sh_audit_l2_table(v, smfn, INVALID_MFN);
@@ -937,7 +937,8 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
         /* Lower-level shadow, not yet linked form a higher level */
         mfn_to_page(smfn)->up = 0;
 
-#if GUEST_PAGING_LEVELS == 4
+#if GUEST_PAGING_LEVELS >= 4
+
 #if (SHADOW_OPTIMIZATIONS & SHOPT_LINUX_L3_TOPLEVEL)
     if ( shadow_type == SH_type_l4_64_shadow &&
          unlikely(d->arch.paging.shadow.opt_flags & SHOPT_LINUX_L3_TOPLEVEL) )
@@ -969,14 +970,12 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
         }
     }
 #endif
-#endif
 
     // Create the Xen mappings...
     if ( !shadow_mode_external(d) )
     {
         switch (shadow_type)
         {
-#if GUEST_PAGING_LEVELS == 4
         case SH_type_l4_shadow:
         {
             shadow_l4e_t *l4t = map_domain_page(smfn);
@@ -988,8 +987,7 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
             unmap_domain_page(l4t);
         }
         break;
-#endif
-#if GUEST_PAGING_LEVELS >= 3
+
         case SH_type_l2h_shadow:
             BUILD_BUG_ON(sizeof(l2_pgentry_t) != sizeof(shadow_l2e_t));
             if ( is_pv_32bit_domain(d) )
@@ -1000,11 +998,12 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
                 unmap_domain_page(l2t);
             }
             break;
-#endif
         default: /* Do nothing */ break;
         }
     }
 
+#endif /* GUEST_PAGING_LEVELS >= 4 */
+
     shadow_promote(d, gmfn, shadow_type);
     set_shadow_status(d, gmfn, shadow_type, smfn);
 
@@ -1334,7 +1333,7 @@ void sh_destroy_l2_shadow(struct domain
 
     SHADOW_DEBUG(DESTROY_SHADOW, "%"PRI_mfn"\n", mfn_x(smfn));
 
-#if GUEST_PAGING_LEVELS >= 3
+#if GUEST_PAGING_LEVELS >= 4
     ASSERT(t == SH_type_l2_shadow || t == SH_type_l2h_shadow);
 #else
     ASSERT(t == SH_type_l2_shadow);
@@ -1858,7 +1857,7 @@ int
 sh_map_and_validate_gl2he(struct vcpu *v, mfn_t gl2mfn,
                            void *new_gl2p, u32 size)
 {
-#if GUEST_PAGING_LEVELS >= 3
+#if GUEST_PAGING_LEVELS >= 4
     return sh_map_and_validate(v, gl2mfn, new_gl2p, size,
                                 SH_type_l2h_shadow,
                                 shadow_l2_index,
@@ -3359,9 +3358,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
                 gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
                 if ( p2m_is_ram(p2mt) )
-                    sh_set_toplevel_shadow(v, i, gl2mfn, (i == 3)
-                                           ? SH_type_l2h_shadow
-                                           : SH_type_l2_shadow,
+                    sh_set_toplevel_shadow(v, i, gl2mfn, SH_type_l2_shadow,
                                            sh_make_shadow);
                 else
                     sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
@@ -3663,7 +3660,7 @@ void sh_clear_shadow_entry(struct domain
         (void) shadow_set_l1e(d, ep, shadow_l1e_empty(), p2m_invalid, smfn);
         break;
     case SH_type_l2_shadow:
-#if GUEST_PAGING_LEVELS >= 3
+#if GUEST_PAGING_LEVELS >= 4
     case SH_type_l2h_shadow:
 #endif
         (void) shadow_set_l2e(d, ep, shadow_l2e_empty(), smfn);
@@ -4115,10 +4112,8 @@ int sh_audit_l3_table(struct vcpu *v, mf
             mfn = shadow_l3e_get_mfn(*sl3e);
             gmfn = get_shadow_status(d, get_gfn_query_unlocked(
                                         d, gfn_x(gfn), &p2mt),
-                                     ((GUEST_PAGING_LEVELS == 3 ||
-                                       is_pv_32bit_domain(d))
-                                      && !shadow_mode_external(d)
-                                      && (guest_index(gl3e) % 4) == 3)
+                                     (is_pv_32bit_domain(d) &&
+                                      guest_index(gl3e) == 3)
                                      ? SH_type_l2h_shadow
                                      : SH_type_l2_shadow);
             if ( !mfn_eq(gmfn, mfn) )
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -195,7 +195,6 @@ static inline shadow_l4e_t shadow_l4e_fr
 #define SH_type_l1_shadow  SH_type_l1_pae_shadow
 #define SH_type_fl1_shadow SH_type_fl1_pae_shadow
 #define SH_type_l2_shadow  SH_type_l2_pae_shadow
-#define SH_type_l2h_shadow SH_type_l2h_pae_shadow
 #elif GUEST_PAGING_LEVELS == 4
 #define SH_type_l1_shadow  SH_type_l1_64_shadow
 #define SH_type_fl1_shadow SH_type_fl1_64_shadow



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (13 preceding siblings ...)
  2021-01-14 15:10 ` [PATCH 14/17] x86/shadow: SH_type_l2h_shadow is PV-only Jan Beulich
@ 2021-01-14 15:10 ` Jan Beulich
  2021-01-22 13:11   ` Tim Deegan
  2021-01-14 15:10 ` [PATCH 16/17] x86/shadow: only 4-level guest code needs building when !HVM Jan Beulich
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

This is a remnant from 32-bit days, having no place anymore where a
shadow of this type would be created.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
hash_{domain,vcpu}_foreach() have a use each of literal 15. It's not
clear to me what the proper replacement constant would be, as it
doesn't look as if SH_type_monitor_table was meant.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -792,9 +792,6 @@ sh_validate_guest_entry(struct vcpu *v,
     if ( page->shadow_flags & SHF_L2_PAE )
         result |= SHADOW_INTERNAL_NAME(sh_map_and_validate_gl2e, 3)
             (v, gmfn, entry, size);
-    if ( page->shadow_flags & SHF_L2H_PAE )
-        result |= SHADOW_INTERNAL_NAME(sh_map_and_validate_gl2he, 3)
-            (v, gmfn, entry, size);
 
     if ( page->shadow_flags & SHF_L1_64 )
         result |= SHADOW_INTERNAL_NAME(sh_map_and_validate_gl1e, 4)
@@ -859,7 +856,6 @@ const u8 sh_type_to_size[] = {
     1, /* SH_type_l1_pae_shadow  */
     1, /* SH_type_fl1_pae_shadow */
     1, /* SH_type_l2_pae_shadow  */
-    1, /* SH_type_l2h_pae_shadow */
     1, /* SH_type_l1_64_shadow   */
     1, /* SH_type_fl1_64_shadow  */
     1, /* SH_type_l2_64_shadow   */
@@ -900,7 +896,6 @@ void shadow_unhook_mappings(struct domai
         SHADOW_INTERNAL_NAME(sh_unhook_32b_mappings, 2)(d, smfn, user_only);
         break;
     case SH_type_l2_pae_shadow:
-    case SH_type_l2h_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_unhook_pae_mappings, 3)(d, smfn, user_only);
         break;
     case SH_type_l4_64_shadow:
@@ -1757,7 +1752,6 @@ void sh_destroy_shadow(struct domain *d,
         SHADOW_INTERNAL_NAME(sh_destroy_l1_shadow, 3)(d, smfn);
         break;
     case SH_type_l2_pae_shadow:
-    case SH_type_l2h_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_destroy_l2_shadow, 3)(d, smfn);
         break;
 
@@ -1816,7 +1810,6 @@ int sh_remove_write_access(struct domain
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 3), /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 3), /* fl1_pae */
         NULL, /* l2_pae  */
-        NULL, /* l2h_pae */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 4), /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 4), /* fl1_64  */
         NULL, /* l2_64   */
@@ -2039,7 +2032,6 @@ int sh_remove_all_mappings(struct domain
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 3), /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 3), /* fl1_pae */
         NULL, /* l2_pae  */
-        NULL, /* l2h_pae */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 4), /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 4), /* fl1_64  */
         NULL, /* l2_64   */
@@ -2136,7 +2128,6 @@ static int sh_remove_shadow_via_pointer(
         break;
     case SH_type_l1_pae_shadow:
     case SH_type_l2_pae_shadow:
-    case SH_type_l2h_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_clear_shadow_entry, 3)(d, vaddr, pmfn);
         break;
     case SH_type_l1_64_shadow:
@@ -2181,7 +2172,6 @@ void sh_remove_shadows(struct domain *d,
         NULL, /* l1_pae  */
         NULL, /* fl1_pae */
         SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 3), /* l2_pae  */
-        SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 3), /* l2h_pae */
         NULL, /* l1_64   */
         NULL, /* fl1_64  */
         SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 4), /* l2_64   */
@@ -2198,10 +2188,9 @@ void sh_remove_shadows(struct domain *d,
         SHF_L2_32, /* l1_32   */
         0, /* fl1_32  */
         0, /* l2_32   */
-        SHF_L2H_PAE | SHF_L2_PAE, /* l1_pae  */
+        SHF_L2_PAE, /* l1_pae  */
         0, /* fl1_pae */
         0, /* l2_pae  */
-        0, /* l2h_pae  */
         SHF_L2H_64 | SHF_L2_64, /* l1_64   */
         0, /* fl1_64  */
         SHF_L3_64, /* l2_64   */
@@ -2261,7 +2250,6 @@ void sh_remove_shadows(struct domain *d,
 
     DO_UNSHADOW(SH_type_l2_32_shadow);
     DO_UNSHADOW(SH_type_l1_32_shadow);
-    DO_UNSHADOW(SH_type_l2h_pae_shadow);
     DO_UNSHADOW(SH_type_l2_pae_shadow);
     DO_UNSHADOW(SH_type_l1_pae_shadow);
     DO_UNSHADOW(SH_type_l4_64_shadow);
@@ -2344,7 +2332,6 @@ void sh_reset_l3_up_pointers(struct vcpu
         NULL, /* l1_pae  */
         NULL, /* fl1_pae */
         NULL, /* l2_pae  */
-        NULL, /* l2h_pae */
         NULL, /* l1_64   */
         NULL, /* fl1_64  */
         NULL, /* l2_64   */
@@ -3368,7 +3355,6 @@ void shadow_audit_tables(struct vcpu *v)
         SHADOW_INTERNAL_NAME(sh_audit_l1_table, 3),  /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_audit_fl1_table, 3), /* fl1_pae */
         SHADOW_INTERNAL_NAME(sh_audit_l2_table, 3),  /* l2_pae  */
-        SHADOW_INTERNAL_NAME(sh_audit_l2_table, 3),  /* l2h_pae */
         SHADOW_INTERNAL_NAME(sh_audit_l1_table, 4),  /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_audit_fl1_table, 4), /* fl1_64  */
         SHADOW_INTERNAL_NAME(sh_audit_l2_table, 4),  /* l2_64   */
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -568,7 +568,7 @@ static inline void check_for_early_unsha
          && sh_mfn_is_a_page_table(gmfn)
          && (!d->arch.paging.shadow.pagetable_dying_op ||
              !(mfn_to_page(gmfn)->shadow_flags
-               & (SHF_L2_32|SHF_L2_PAE|SHF_L2H_PAE|SHF_L4_64))) )
+               & (SHF_L2_32|SHF_L2_PAE|SHF_L4_64))) )
     {
         perfc_incr(shadow_early_unshadow);
         sh_remove_shadows(d, gmfn, 1, 0 /* Fast, can fail to unshadow */ );
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -837,8 +837,7 @@ do {
     int _i;                                                                \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                        \
     ASSERT(shadow_mode_external(_dom));                                    \
-    ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_pae_shadow        \
-           || mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2h_pae_shadow);  \
+    ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_pae_shadow);      \
     for ( _i = 0; _i < SHADOW_L2_PAGETABLE_ENTRIES; _i++ )                 \
     {                                                                      \
         (_sl2e) = _sp + _i;                                                \
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -177,18 +177,17 @@ extern void shadow_audit_tables(struct v
 #define SH_type_l1_pae_shadow  (4U) /* shadowing a pae L1 page */
 #define SH_type_fl1_pae_shadow (5U) /* L1 shadow for pae 2M superpg */
 #define SH_type_l2_pae_shadow  (6U) /* shadowing a pae L2-low page */
-#define SH_type_l2h_pae_shadow (7U) /* shadowing a pae L2-high page */
-#define SH_type_l1_64_shadow   (8U) /* shadowing a 64-bit L1 page */
-#define SH_type_fl1_64_shadow  (9U) /* L1 shadow for 64-bit 2M superpg */
-#define SH_type_l2_64_shadow  (10U) /* shadowing a 64-bit L2 page */
-#define SH_type_l2h_64_shadow (11U) /* shadowing a compat PAE L2 high page */
-#define SH_type_l3_64_shadow  (12U) /* shadowing a 64-bit L3 page */
-#define SH_type_l4_64_shadow  (13U) /* shadowing a 64-bit L4 page */
-#define SH_type_max_shadow    (13U)
-#define SH_type_p2m_table     (14U) /* in use as the p2m table */
-#define SH_type_monitor_table (15U) /* in use as a monitor table */
-#define SH_type_oos_snapshot  (16U) /* in use as OOS snapshot */
-#define SH_type_unused        (17U)
+#define SH_type_l1_64_shadow   (7U) /* shadowing a 64-bit L1 page */
+#define SH_type_fl1_64_shadow  (8U) /* L1 shadow for 64-bit 2M superpg */
+#define SH_type_l2_64_shadow   (9U) /* shadowing a 64-bit L2 page */
+#define SH_type_l2h_64_shadow (10U) /* shadowing a compat PAE L2 high page */
+#define SH_type_l3_64_shadow  (11U) /* shadowing a 64-bit L3 page */
+#define SH_type_l4_64_shadow  (12U) /* shadowing a 64-bit L4 page */
+#define SH_type_max_shadow    (12U)
+#define SH_type_p2m_table     (13U) /* in use as the p2m table */
+#define SH_type_monitor_table (14U) /* in use as a monitor table */
+#define SH_type_oos_snapshot  (15U) /* in use as OOS snapshot */
+#define SH_type_unused        (16U)
 
 /*
  * What counts as a pinnable shadow?
@@ -200,7 +199,6 @@ static inline int sh_type_is_pinnable(st
      * persist even when not currently in use in a guest CR3 */
     if ( t == SH_type_l2_32_shadow
          || t == SH_type_l2_pae_shadow
-         || t == SH_type_l2h_pae_shadow
          || t == SH_type_l4_64_shadow )
         return 1;
 
@@ -256,7 +254,6 @@ static inline void sh_terminate_list(str
 #define SHF_L1_PAE  (1u << SH_type_l1_pae_shadow)
 #define SHF_FL1_PAE (1u << SH_type_fl1_pae_shadow)
 #define SHF_L2_PAE  (1u << SH_type_l2_pae_shadow)
-#define SHF_L2H_PAE (1u << SH_type_l2h_pae_shadow)
 #define SHF_L1_64   (1u << SH_type_l1_64_shadow)
 #define SHF_FL1_64  (1u << SH_type_fl1_64_shadow)
 #define SHF_L2_64   (1u << SH_type_l2_64_shadow)
@@ -265,7 +262,7 @@ static inline void sh_terminate_list(str
 #define SHF_L4_64   (1u << SH_type_l4_64_shadow)
 
 #define SHF_32  (SHF_L1_32|SHF_FL1_32|SHF_L2_32)
-#define SHF_PAE (SHF_L1_PAE|SHF_FL1_PAE|SHF_L2_PAE|SHF_L2H_PAE)
+#define SHF_PAE (SHF_L1_PAE|SHF_FL1_PAE|SHF_L2_PAE)
 #define SHF_64  (SHF_L1_64|SHF_FL1_64|SHF_L2_64|SHF_L2H_64|SHF_L3_64|SHF_L4_64)
 
 #define SHF_L1_ANY  (SHF_L1_32|SHF_L1_PAE|SHF_L1_64)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 16/17] x86/shadow: only 4-level guest code needs building when !HVM
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (14 preceding siblings ...)
  2021-01-14 15:10 ` [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow Jan Beulich
@ 2021-01-14 15:10 ` Jan Beulich
  2021-01-14 15:11 ` [PATCH 17/17] x86/shadow: adjust is_pv_*() checks Jan Beulich
  2021-01-22 13:18 ` [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Tim Deegan
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

In order to limit #ifdef-ary, provide "stub" #define-s for
SH_type_{l1,fl1,l2}_{32,pae}_shadow and SHF_{L1,FL1,L2}_{32,PAE}.

The change in shadow_vcpu_init() is necessary to cover for "x86: correct
is_pv_domain() when !CONFIG_PV" (or any other change along those lines)
- we should only rely on is_hvm_*() to become a build time constant.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/Makefile
+++ b/xen/arch/x86/mm/Makefile
@@ -3,7 +3,7 @@ obj-$(CONFIG_HVM) += hap/
 
 obj-$(CONFIG_HVM) += altp2m.o
 obj-$(CONFIG_HVM) += guest_walk_2.o guest_walk_3.o guest_walk_4.o
-obj-$(CONFIG_SHADOW_PAGING) += guest_walk_2.o guest_walk_3.o guest_walk_4.o
+obj-$(CONFIG_SHADOW_PAGING) += guest_walk_4.o
 obj-$(CONFIG_MEM_ACCESS) += mem_access.o
 obj-y += mem_paging.o
 obj-$(CONFIG_MEM_SHARING) += mem_sharing.o
--- a/xen/arch/x86/mm/shadow/Makefile
+++ b/xen/arch/x86/mm/shadow/Makefile
@@ -1,7 +1,7 @@
 ifeq ($(CONFIG_SHADOW_PAGING),y)
-obj-y += common.o guest_2.o guest_3.o guest_4.o set.o
-obj-$(CONFIG_HVM) += hvm.o
-obj-$(CONFIG_PV) += pv.o
+obj-y += common.o set.o
+obj-$(CONFIG_HVM) += hvm.o guest_2.o guest_3.o guest_4.o
+obj-$(CONFIG_PV) += pv.o guest_4.o
 else
 obj-y += none.o
 endif
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -90,9 +90,9 @@ void shadow_vcpu_init(struct vcpu *v)
     }
 #endif
 
-    v->arch.paging.mode = is_pv_vcpu(v) ?
-                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 4) :
-                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3);
+    v->arch.paging.mode = is_hvm_vcpu(v) ?
+                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3) :
+                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);
 }
 
 #if SHADOW_AUDIT
@@ -257,6 +257,7 @@ static int sh_remove_write_access_from_s
 
     switch ( mfn_to_page(smfn)->u.sh.type )
     {
+#ifdef CONFIG_HVM
     case SH_type_l1_32_shadow:
     case SH_type_fl1_32_shadow:
         return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
@@ -266,6 +267,7 @@ static int sh_remove_write_access_from_s
     case SH_type_fl1_pae_shadow:
         return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
             (d, gmfn, smfn, off);
+#endif
 
     case SH_type_l1_64_shadow:
     case SH_type_fl1_64_shadow:
@@ -848,6 +850,7 @@ sh_validate_guest_entry(struct vcpu *v,
  * the free pool.
  */
 
+#ifdef CONFIG_HVM
 const u8 sh_type_to_size[] = {
     1, /* SH_type_none           */
     2, /* SH_type_l1_32_shadow   */
@@ -866,6 +869,7 @@ const u8 sh_type_to_size[] = {
     1, /* SH_type_monitor_table  */
     1  /* SH_type_oos_snapshot   */
 };
+#endif
 
 /*
  * Figure out the least acceptable quantity of shadow memory.
@@ -892,12 +896,14 @@ void shadow_unhook_mappings(struct domai
     struct page_info *sp = mfn_to_page(smfn);
     switch ( sp->u.sh.type )
     {
+#ifdef CONFIG_HVM
     case SH_type_l2_32_shadow:
         SHADOW_INTERNAL_NAME(sh_unhook_32b_mappings, 2)(d, smfn, user_only);
         break;
     case SH_type_l2_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_unhook_pae_mappings, 3)(d, smfn, user_only);
         break;
+#endif
     case SH_type_l4_64_shadow:
         SHADOW_INTERNAL_NAME(sh_unhook_64b_mappings, 4)(d, smfn, user_only);
         break;
@@ -1104,8 +1110,10 @@ mfn_t shadow_alloc(struct domain *d,
     /* Backpointers that are MFNs need to be packed into PDXs (PFNs don't) */
     switch (shadow_type)
     {
+#ifdef CONFIG_HVM
     case SH_type_fl1_32_shadow:
     case SH_type_fl1_pae_shadow:
+#endif
     case SH_type_fl1_64_shadow:
         break;
     default:
@@ -1739,6 +1747,7 @@ void sh_destroy_shadow(struct domain *d,
      * small numbers that the compiler will enjoy */
     switch ( t )
     {
+#ifdef CONFIG_HVM
     case SH_type_l1_32_shadow:
     case SH_type_fl1_32_shadow:
         SHADOW_INTERNAL_NAME(sh_destroy_l1_shadow, 2)(d, smfn);
@@ -1754,6 +1763,7 @@ void sh_destroy_shadow(struct domain *d,
     case SH_type_l2_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_destroy_l2_shadow, 3)(d, smfn);
         break;
+#endif
 
     case SH_type_l1_64_shadow:
     case SH_type_fl1_64_shadow:
@@ -1804,12 +1814,14 @@ int sh_remove_write_access(struct domain
     /* Dispatch table for getting per-type functions */
     static const hash_domain_callback_t callbacks[SH_type_unused] = {
         NULL, /* none    */
+#ifdef CONFIG_HVM
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 2), /* l1_32   */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 2), /* fl1_32  */
         NULL, /* l2_32   */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 3), /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 3), /* fl1_pae */
         NULL, /* l2_pae  */
+#endif
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 4), /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_rm_write_access_from_l1, 4), /* fl1_64  */
         NULL, /* l2_64   */
@@ -2026,12 +2038,14 @@ int sh_remove_all_mappings(struct domain
     /* Dispatch table for getting per-type functions */
     static const hash_domain_callback_t callbacks[SH_type_unused] = {
         NULL, /* none    */
+#ifdef CONFIG_HVM
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 2), /* l1_32   */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 2), /* fl1_32  */
         NULL, /* l2_32   */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 3), /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 3), /* fl1_pae */
         NULL, /* l2_pae  */
+#endif
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 4), /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_rm_mappings_from_l1, 4), /* fl1_64  */
         NULL, /* l2_64   */
@@ -2122,6 +2136,7 @@ static int sh_remove_shadow_via_pointer(
     /* Blank the offending entry */
     switch (sp->u.sh.type)
     {
+#ifdef CONFIG_HVM
     case SH_type_l1_32_shadow:
     case SH_type_l2_32_shadow:
         SHADOW_INTERNAL_NAME(sh_clear_shadow_entry, 2)(d, vaddr, pmfn);
@@ -2130,6 +2145,7 @@ static int sh_remove_shadow_via_pointer(
     case SH_type_l2_pae_shadow:
         SHADOW_INTERNAL_NAME(sh_clear_shadow_entry, 3)(d, vaddr, pmfn);
         break;
+#endif
     case SH_type_l1_64_shadow:
     case SH_type_l2_64_shadow:
     case SH_type_l2h_64_shadow:
@@ -2166,12 +2182,14 @@ void sh_remove_shadows(struct domain *d,
      * be called with the function to remove a lower-level shadow. */
     static const hash_domain_callback_t callbacks[SH_type_unused] = {
         NULL, /* none    */
+#ifdef CONFIG_HVM
         NULL, /* l1_32   */
         NULL, /* fl1_32  */
         SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 2), /* l2_32   */
         NULL, /* l1_pae  */
         NULL, /* fl1_pae */
         SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 3), /* l2_pae  */
+#endif
         NULL, /* l1_64   */
         NULL, /* fl1_64  */
         SHADOW_INTERNAL_NAME(sh_remove_l1_shadow, 4), /* l2_64   */
@@ -2185,12 +2203,14 @@ void sh_remove_shadows(struct domain *d,
     /* Another lookup table, for choosing which mask to use */
     static const unsigned int masks[SH_type_unused] = {
         0, /* none    */
+#ifdef CONFIG_HVM
         SHF_L2_32, /* l1_32   */
         0, /* fl1_32  */
         0, /* l2_32   */
         SHF_L2_PAE, /* l1_pae  */
         0, /* fl1_pae */
         0, /* l2_pae  */
+#endif
         SHF_L2H_64 | SHF_L2_64, /* l1_64   */
         0, /* fl1_64  */
         SHF_L3_64, /* l2_64   */
@@ -2326,12 +2346,14 @@ void sh_reset_l3_up_pointers(struct vcpu
 {
     static const hash_vcpu_callback_t callbacks[SH_type_unused] = {
         NULL, /* none    */
+#ifdef CONFIG_HVM
         NULL, /* l1_32   */
         NULL, /* fl1_32  */
         NULL, /* l2_32   */
         NULL, /* l1_pae  */
         NULL, /* fl1_pae */
         NULL, /* l2_pae  */
+#endif
         NULL, /* l1_64   */
         NULL, /* fl1_64  */
         NULL, /* l2_64   */
@@ -3349,12 +3371,14 @@ void shadow_audit_tables(struct vcpu *v)
     static const hash_vcpu_callback_t callbacks[SH_type_unused] = {
         NULL, /* none    */
 #if SHADOW_AUDIT & (SHADOW_AUDIT_ENTRIES | SHADOW_AUDIT_ENTRIES_FULL)
+# ifdef CONFIG_HVM
         SHADOW_INTERNAL_NAME(sh_audit_l1_table, 2),  /* l1_32   */
         SHADOW_INTERNAL_NAME(sh_audit_fl1_table, 2), /* fl1_32  */
         SHADOW_INTERNAL_NAME(sh_audit_l2_table, 2),  /* l2_32   */
         SHADOW_INTERNAL_NAME(sh_audit_l1_table, 3),  /* l1_pae  */
         SHADOW_INTERNAL_NAME(sh_audit_fl1_table, 3), /* fl1_pae */
         SHADOW_INTERNAL_NAME(sh_audit_l2_table, 3),  /* l2_pae  */
+# endif
         SHADOW_INTERNAL_NAME(sh_audit_l1_table, 4),  /* l1_64   */
         SHADOW_INTERNAL_NAME(sh_audit_fl1_table, 4), /* fl1_64  */
         SHADOW_INTERNAL_NAME(sh_audit_l2_table, 4),  /* l2_64   */
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -171,6 +171,7 @@ extern void shadow_audit_tables(struct v
 /* Shadow type codes */
 #define SH_type_none           (0U) /* on the shadow free list */
 #define SH_type_min_shadow     (1U)
+#ifdef CONFIG_HVM
 #define SH_type_l1_32_shadow   (1U) /* shadowing a 32-bit L1 guest page */
 #define SH_type_fl1_32_shadow  (2U) /* L1 shadow for a 32b 4M superpage */
 #define SH_type_l2_32_shadow   (3U) /* shadowing a 32-bit L2 guest page */
@@ -188,6 +189,25 @@ extern void shadow_audit_tables(struct v
 #define SH_type_monitor_table (14U) /* in use as a monitor table */
 #define SH_type_oos_snapshot  (15U) /* in use as OOS snapshot */
 #define SH_type_unused        (16U)
+#else
+#define SH_type_l1_32_shadow   SH_type_unused
+#define SH_type_fl1_32_shadow  SH_type_unused
+#define SH_type_l2_32_shadow   SH_type_unused
+#define SH_type_l1_pae_shadow  SH_type_unused
+#define SH_type_fl1_pae_shadow SH_type_unused
+#define SH_type_l2_pae_shadow  SH_type_unused
+#define SH_type_l1_64_shadow   1U /* shadowing a 64-bit L1 page */
+#define SH_type_fl1_64_shadow  2U /* L1 shadow for 64-bit 2M superpg */
+#define SH_type_l2_64_shadow   3U /* shadowing a 64-bit L2 page */
+#define SH_type_l2h_64_shadow  4U /* shadowing a compat PAE L2 high page */
+#define SH_type_l3_64_shadow   5U /* shadowing a 64-bit L3 page */
+#define SH_type_l4_64_shadow   6U /* shadowing a 64-bit L4 page */
+#define SH_type_max_shadow     6U
+#define SH_type_p2m_table      7U /* in use as the p2m table */
+#define SH_type_monitor_table  8U /* in use as a monitor table */
+#define SH_type_oos_snapshot   9U /* in use as OOS snapshot */
+#define SH_type_unused        10U
+#endif
 
 /*
  * What counts as a pinnable shadow?
@@ -248,12 +268,21 @@ static inline void sh_terminate_list(str
     (((1u << (SH_type_max_shadow + 1u)) - 1u) - \
      ((1u << SH_type_min_shadow) - 1u))
 
+#ifdef CONFIG_HVM
 #define SHF_L1_32   (1u << SH_type_l1_32_shadow)
 #define SHF_FL1_32  (1u << SH_type_fl1_32_shadow)
 #define SHF_L2_32   (1u << SH_type_l2_32_shadow)
 #define SHF_L1_PAE  (1u << SH_type_l1_pae_shadow)
 #define SHF_FL1_PAE (1u << SH_type_fl1_pae_shadow)
 #define SHF_L2_PAE  (1u << SH_type_l2_pae_shadow)
+#else
+#define SHF_L1_32   0
+#define SHF_FL1_32  0
+#define SHF_L2_32   0
+#define SHF_L1_PAE  0
+#define SHF_FL1_PAE 0
+#define SHF_L2_PAE  0
+#endif
 #define SHF_L1_64   (1u << SH_type_l1_64_shadow)
 #define SHF_FL1_64  (1u << SH_type_fl1_64_shadow)
 #define SHF_L2_64   (1u << SH_type_l2_64_shadow)
@@ -329,8 +358,13 @@ extern const u8 sh_type_to_size[SH_type_
 static inline unsigned int
 shadow_size(unsigned int shadow_type)
 {
+#ifdef CONFIG_HVM
     ASSERT(shadow_type < ARRAY_SIZE(sh_type_to_size));
     return sh_type_to_size[shadow_type];
+#else
+    ASSERT(shadow_type < SH_type_unused);
+    return 1;
+#endif
 }
 
 /******************************************************************************
@@ -488,8 +522,10 @@ static inline unsigned long __backpointe
 {
     switch (sp->u.sh.type)
     {
+#ifdef CONFIG_HVM
     case SH_type_fl1_32_shadow:
     case SH_type_fl1_pae_shadow:
+#endif
     case SH_type_fl1_64_shadow:
         return sp->v.sh.back;
     }



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 17/17] x86/shadow: adjust is_pv_*() checks
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (15 preceding siblings ...)
  2021-01-14 15:10 ` [PATCH 16/17] x86/shadow: only 4-level guest code needs building when !HVM Jan Beulich
@ 2021-01-14 15:11 ` Jan Beulich
  2021-01-22 13:18 ` [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Tim Deegan
  17 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-14 15:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan, George Dunlap

To cover for "x86: correct is_pv_domain() when !CONFIG_PV" (or any other
change along those lines) we should prefer is_hvm_*(), as it may become
a build time constant while is_pv_*() generally won't.

Also when a domain pointer is in scope, prefer is_*_domain() over
is_*_vcpu().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -674,7 +674,7 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
     if ( pg->shadow_flags &
          ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
          || sh_page_has_multiple_shadows(pg)
-         || is_pv_vcpu(v)
+         || !is_hvm_vcpu(v)
          || !v->domain->arch.paging.shadow.oos_active )
         return 0;
 
@@ -2419,7 +2419,7 @@ static void sh_update_paging_modes(struc
         v->arch.paging.mode->shadow.detach_old_tables(v);
 
 #ifdef CONFIG_HVM
-    if ( !is_pv_domain(d) )
+    if ( is_hvm_domain(d) )
     {
         const struct paging_mode *old_mode = v->arch.paging.mode;
 
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -666,8 +666,8 @@ _sh_propagate(struct vcpu *v,
     // PV guests in 64-bit mode use two different page tables for user vs
     // supervisor permissions, making the guest's _PAGE_USER bit irrelevant.
     // It is always shadowed as present...
-    if ( (GUEST_PAGING_LEVELS == 4) && !is_pv_32bit_domain(d)
-         && is_pv_domain(d) )
+    if ( (GUEST_PAGING_LEVELS == 4) && !is_hvm_domain(d) &&
+         !is_pv_32bit_domain(d) )
     {
         sflags |= _PAGE_USER;
     }
@@ -1119,7 +1119,7 @@ static shadow_l2e_t * shadow_get_and_cre
         unsigned int t = SH_type_l2_shadow;
 
         /* Tag compat L2 containing hypervisor (m2p) mappings */
-        if ( is_pv_32bit_vcpu(v) &&
+        if ( is_pv_32bit_domain(d) &&
              guest_l4_table_offset(gw->va) == 0 &&
              guest_l3_table_offset(gw->va) == 3 )
             t = SH_type_l2h_shadow;
@@ -2313,7 +2313,7 @@ static int sh_page_fault(struct vcpu *v,
         return 0;
     }
 
-    cpl = is_pv_vcpu(v) ? (regs->ss & 3) : hvm_get_cpl(v);
+    cpl = is_hvm_domain(d) ? hvm_get_cpl(v) : (regs->ss & 3);
 
  rewalk:
 
@@ -3236,7 +3236,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
 #endif
 
     /* Don't do anything on an uninitialised vcpu */
-    if ( is_pv_domain(d) && !v->is_initialised )
+    if ( !is_hvm_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
         return;
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -27,7 +27,7 @@ int shadow_domain_init(struct domain *d)
     };
 
     paging_log_dirty_init(d, &sh_none_ops);
-    return is_pv_domain(d) ? 0 : -EOPNOTSUPP;
+    return is_hvm_domain(d) ? -EOPNOTSUPP : 0;
 }
 
 static int _page_fault(struct vcpu *v, unsigned long va,



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-14 15:10 ` [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow Jan Beulich
@ 2021-01-22 13:11   ` Tim Deegan
  2021-01-22 16:31     ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Tim Deegan @ 2021-01-22 13:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

Hi,

At 16:10 +0100 on 14 Jan (1610640627), Jan Beulich wrote:
> This is a remnant from 32-bit days, having no place anymore where a
> shadow of this type would be created.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> hash_{domain,vcpu}_foreach() have a use each of literal 15. It's not
> clear to me what the proper replacement constant would be, as it
> doesn't look as if SH_type_monitor_table was meant.

Good spot.  I think the '<= 15' should be replaced with '< SH_type_unused'.
It originally matched the callback arrays all being coded as
"static hash_callback_t callbacks[16]".

Cheers,

Tim


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus ...
  2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
                   ` (16 preceding siblings ...)
  2021-01-14 15:11 ` [PATCH 17/17] x86/shadow: adjust is_pv_*() checks Jan Beulich
@ 2021-01-22 13:18 ` Tim Deegan
  17 siblings, 0 replies; 46+ messages in thread
From: Tim Deegan @ 2021-01-22 13:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

At 16:01 +0100 on 14 Jan (1610640090), Jan Beulich wrote:
> ... shadow adjustments towards not building 2- and 3-level code
> when !HVM. While the latter isn't functionally related to the
> former, it depends on some of the rearrangements done there.

The shadow changes look good, thank you!
Reviewed-by: Tim Deegan <tim@xen.org>

I have read the uaccess stuff in passing and it looks OK to me too,
but I didn't review it in detail.

Cheers,

Tim.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-22 13:11   ` Tim Deegan
@ 2021-01-22 16:31     ` Jan Beulich
  2021-01-22 20:02       ` Tim Deegan
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-01-22 16:31 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

On 22.01.2021 14:11, Tim Deegan wrote:
> At 16:10 +0100 on 14 Jan (1610640627), Jan Beulich wrote:
>> This is a remnant from 32-bit days, having no place anymore where a
>> shadow of this type would be created.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> hash_{domain,vcpu}_foreach() have a use each of literal 15. It's not
>> clear to me what the proper replacement constant would be, as it
>> doesn't look as if SH_type_monitor_table was meant.
> 
> Good spot.  I think the '<= 15' should be replaced with '< SH_type_unused'.
> It originally matched the callback arrays all being coded as
> "static hash_callback_t callbacks[16]".

So are you saying this was off by one then prior to this patch
(when SH_type_unused was still 17), albeit in apparently a
benign way? And the comments in sh_remove_write_access(),
sh_remove_all_mappings(), sh_remove_shadows(), and
sh_reset_l3_up_pointers() are then wrong as well, and would
instead better be like in shadow_audit_tables()?

Because of this having been benign (due to none of the callback
tables specifying non-NULL entries there), wouldn't it make
sense to dimension the tables by SH_type_max_shadow + 1 only?
Or would you consider this too risky?

In any event I guess I'll make the patch addressing this (one
way or the other) a prereq to everything here.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-22 16:31     ` Jan Beulich
@ 2021-01-22 20:02       ` Tim Deegan
  2021-01-25 11:09         ` Jan Beulich
  2021-01-25 11:33         ` Jan Beulich
  0 siblings, 2 replies; 46+ messages in thread
From: Tim Deegan @ 2021-01-22 20:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

Hi,

At 17:31 +0100 on 22 Jan (1611336662), Jan Beulich wrote:
> On 22.01.2021 14:11, Tim Deegan wrote:
> > At 16:10 +0100 on 14 Jan (1610640627), Jan Beulich wrote:
> >> hash_{domain,vcpu}_foreach() have a use each of literal 15. It's not
> >> clear to me what the proper replacement constant would be, as it
> >> doesn't look as if SH_type_monitor_table was meant.
> > 
> > Good spot.  I think the '<= 15' should be replaced with '< SH_type_unused'.
> > It originally matched the callback arrays all being coded as
> > "static hash_callback_t callbacks[16]".
> 
> So are you saying this was off by one then prior to this patch
> (when SH_type_unused was still 17), albeit in apparently a
> benign way?

Yes - this assertion is just to catch overruns of the callback table,
and so it was OK for its limit to be too low.  The new types that were
added since then are never put in the hash table, so don't trigger
this assertion.

> And the comments in sh_remove_write_access(),
> sh_remove_all_mappings(), sh_remove_shadows(), and
> sh_reset_l3_up_pointers() are then wrong as well, and would
> instead better be like in shadow_audit_tables()?

Yes, it looks like those comments are also out of date where they
mention 'unused'.

> Because of this having been benign (due to none of the callback
> tables specifying non-NULL entries there), wouldn't it make
> sense to dimension the tables by SH_type_max_shadow + 1 only?
> Or would you consider this too risky?

Yes, I think that would be fine, also changing '<= 15' to
'<= SH_type_max_shadow'.  Maybe add a matching
ASSERT(t <= SH_type_max_shadow) in shadow_hash_insert as well?

Cheers,

Tim.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-22 20:02       ` Tim Deegan
@ 2021-01-25 11:09         ` Jan Beulich
  2021-01-25 11:33         ` Jan Beulich
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-25 11:09 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

On 22.01.2021 21:02, Tim Deegan wrote:
> At 17:31 +0100 on 22 Jan (1611336662), Jan Beulich wrote:
>> On 22.01.2021 14:11, Tim Deegan wrote:
>>> At 16:10 +0100 on 14 Jan (1610640627), Jan Beulich wrote:
>>>> hash_{domain,vcpu}_foreach() have a use each of literal 15. It's not
>>>> clear to me what the proper replacement constant would be, as it
>>>> doesn't look as if SH_type_monitor_table was meant.
>>>
>>> Good spot.  I think the '<= 15' should be replaced with '< SH_type_unused'.
>>> It originally matched the callback arrays all being coded as
>>> "static hash_callback_t callbacks[16]".
>>
>> So are you saying this was off by one then prior to this patch
>> (when SH_type_unused was still 17), albeit in apparently a
>> benign way?
> 
> Yes - this assertion is just to catch overruns of the callback table,
> and so it was OK for its limit to be too low.  The new types that were
> added since then are never put in the hash table, so don't trigger
> this assertion.
> 
>> And the comments in sh_remove_write_access(),
>> sh_remove_all_mappings(), sh_remove_shadows(), and
>> sh_reset_l3_up_pointers() are then wrong as well, and would
>> instead better be like in shadow_audit_tables()?
> 
> Yes, it looks like those comments are also out of date where they
> mention 'unused'.

For this, which likely will end up being part of ...

>> Because of this having been benign (due to none of the callback
>> tables specifying non-NULL entries there), wouldn't it make
>> sense to dimension the tables by SH_type_max_shadow + 1 only?
>> Or would you consider this too risky?
> 
> Yes, I think that would be fine, also changing '<= 15' to
> '<= SH_type_max_shadow'.  Maybe add a matching
> ASSERT(t <= SH_type_max_shadow) in shadow_hash_insert as well?

... this, I'll send a patch for 4.16 going beyond the more
immediate one sent, which I'll ask Ian to consider for 4.15
(assuming of course you consider it okay in the first place).

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow
  2021-01-22 20:02       ` Tim Deegan
  2021-01-25 11:09         ` Jan Beulich
@ 2021-01-25 11:33         ` Jan Beulich
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-01-25 11:33 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

On 22.01.2021 21:02, Tim Deegan wrote:
> At 17:31 +0100 on 22 Jan (1611336662), Jan Beulich wrote:
>> Because of this having been benign (due to none of the callback
>> tables specifying non-NULL entries there), wouldn't it make
>> sense to dimension the tables by SH_type_max_shadow + 1 only?
>> Or would you consider this too risky?
> 
> Yes, I think that would be fine, also changing '<= 15' to
> '<= SH_type_max_shadow'.  Maybe add a matching
> ASSERT(t <= SH_type_max_shadow) in shadow_hash_insert as well?

The latter (also for shadow_hash_delete()) would seem kind
of orthogonal to me, but for now I've put it in there.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-01-14 15:04 ` [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants Jan Beulich
@ 2021-02-05 15:43   ` Roger Pau Monné
  2021-02-05 16:13     ` Jan Beulich
  2021-02-09 14:55   ` Roger Pau Monné
  1 sibling, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-05 15:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> The "guest" variants are intended to work with (potentially) fully guest
> controlled addresses, while the "unsafe" variants are not.

Just to clarify, both work against user addresses, but guest variants
need to be more careful because the guest provided address can also be
modified?

I'm trying to understand the difference between "fully guest
controlled" and "guest controlled".

Maybe an example would help clarify.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-05 15:43   ` Roger Pau Monné
@ 2021-02-05 16:13     ` Jan Beulich
  2021-02-05 16:18       ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-05 16:13 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 05.02.2021 16:43, Roger Pau Monné wrote:
> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
>> The "guest" variants are intended to work with (potentially) fully guest
>> controlled addresses, while the "unsafe" variants are not.
> 
> Just to clarify, both work against user addresses, but guest variants
> need to be more careful because the guest provided address can also be
> modified?
> 
> I'm trying to understand the difference between "fully guest
> controlled" and "guest controlled".

Not exactly, not. "unsafe" means access to anything which may
fault, guest controlled or not. do_invalid_op()'s reading of
the insn stream is a good example - the faulting insn there
isn't guest controlled at all, but we still want to be careful
when trying to read these bytes, as we don't want to fully
trust %rip there.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-05 16:13     ` Jan Beulich
@ 2021-02-05 16:18       ` Roger Pau Monné
  2021-02-05 16:26         ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-05 16:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
> On 05.02.2021 16:43, Roger Pau Monné wrote:
> > On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> >> The "guest" variants are intended to work with (potentially) fully guest
> >> controlled addresses, while the "unsafe" variants are not.
> > 
> > Just to clarify, both work against user addresses, but guest variants
> > need to be more careful because the guest provided address can also be
> > modified?
> > 
> > I'm trying to understand the difference between "fully guest
> > controlled" and "guest controlled".
> 
> Not exactly, not. "unsafe" means access to anything which may
> fault, guest controlled or not. do_invalid_op()'s reading of
> the insn stream is a good example - the faulting insn there
> isn't guest controlled at all, but we still want to be careful
> when trying to read these bytes, as we don't want to fully
> trust %rip there.

Would it make sense to threat everything as 'guest' accesses for the
sake of not having this difference?

I think having two accessors it's likely to cause confusion and could
possibly lead to the wrong one being used in unexpected contexts. Does
it add a too big performance penalty to always use the most
restrictive one?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-05 16:18       ` Roger Pau Monné
@ 2021-02-05 16:26         ` Jan Beulich
  2021-02-09 13:07           ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-05 16:26 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 05.02.2021 17:18, Roger Pau Monné wrote:
> On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
>> On 05.02.2021 16:43, Roger Pau Monné wrote:
>>> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
>>>> The "guest" variants are intended to work with (potentially) fully guest
>>>> controlled addresses, while the "unsafe" variants are not.
>>>
>>> Just to clarify, both work against user addresses, but guest variants
>>> need to be more careful because the guest provided address can also be
>>> modified?
>>>
>>> I'm trying to understand the difference between "fully guest
>>> controlled" and "guest controlled".
>>
>> Not exactly, not. "unsafe" means access to anything which may
>> fault, guest controlled or not. do_invalid_op()'s reading of
>> the insn stream is a good example - the faulting insn there
>> isn't guest controlled at all, but we still want to be careful
>> when trying to read these bytes, as we don't want to fully
>> trust %rip there.
> 
> Would it make sense to threat everything as 'guest' accesses for the
> sake of not having this difference?

That's what we've been doing until now. It is the purpose of
this change to allow the two to behave differently.

> I think having two accessors it's likely to cause confusion and could
> possibly lead to the wrong one being used in unexpected contexts. Does
> it add a too big performance penalty to always use the most
> restrictive one?

The problem is the most restrictive one is going to be too
restrictive - we wouldn't be able to access Xen space anymore
e.g. from the place pointed at above as example. This is
because for guest accesses (but not for "unsafe" ones) we're
going to divert them into non-canonical space (and hence make
speculation impossible, as such an access would fault) if it
would touch Xen space.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-05 16:26         ` Jan Beulich
@ 2021-02-09 13:07           ` Roger Pau Monné
  2021-02-09 13:15             ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 13:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Fri, Feb 05, 2021 at 05:26:33PM +0100, Jan Beulich wrote:
> On 05.02.2021 17:18, Roger Pau Monné wrote:
> > On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
> >> On 05.02.2021 16:43, Roger Pau Monné wrote:
> >>> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> >>>> The "guest" variants are intended to work with (potentially) fully guest
> >>>> controlled addresses, while the "unsafe" variants are not.
> >>>
> >>> Just to clarify, both work against user addresses, but guest variants
> >>> need to be more careful because the guest provided address can also be
> >>> modified?
> >>>
> >>> I'm trying to understand the difference between "fully guest
> >>> controlled" and "guest controlled".
> >>
> >> Not exactly, not. "unsafe" means access to anything which may
> >> fault, guest controlled or not. do_invalid_op()'s reading of
> >> the insn stream is a good example - the faulting insn there
> >> isn't guest controlled at all, but we still want to be careful
> >> when trying to read these bytes, as we don't want to fully
> >> trust %rip there.

Oh, I see. It's possible that %rip points to an unmapped address
there, and we need to be careful when reading, even if the value of
%rip cannot be controlled by the guest and can legitimacy point to
Xen's address space.

> > Would it make sense to threat everything as 'guest' accesses for the
> > sake of not having this difference?
> 
> That's what we've been doing until now. It is the purpose of
> this change to allow the two to behave differently.
> 
> > I think having two accessors it's likely to cause confusion and could
> > possibly lead to the wrong one being used in unexpected contexts. Does
> > it add a too big performance penalty to always use the most
> > restrictive one?
> 
> The problem is the most restrictive one is going to be too
> restrictive - we wouldn't be able to access Xen space anymore
> e.g. from the place pointed at above as example. This is
> because for guest accesses (but not for "unsafe" ones) we're
> going to divert them into non-canonical space (and hence make
> speculation impossible, as such an access would fault) if it
> would touch Xen space.

Yes, I understand now. I think it would have been helpful (for me) to
have the first sentence as:

The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are expected to be
used in order to access addresses not under the guest control, but
that could trigger faults anyway (like accessing the instruction
stream in do_invalid_op).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 13:07           ` Roger Pau Monné
@ 2021-02-09 13:15             ` Jan Beulich
  2021-02-09 14:46               ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-09 13:15 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 09.02.2021 14:07, Roger Pau Monné wrote:
> On Fri, Feb 05, 2021 at 05:26:33PM +0100, Jan Beulich wrote:
>> On 05.02.2021 17:18, Roger Pau Monné wrote:
>>> On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
>>>> On 05.02.2021 16:43, Roger Pau Monné wrote:
>>>>> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
>>>>>> The "guest" variants are intended to work with (potentially) fully guest
>>>>>> controlled addresses, while the "unsafe" variants are not.
>>>>>
>>>>> Just to clarify, both work against user addresses, but guest variants
>>>>> need to be more careful because the guest provided address can also be
>>>>> modified?
>>>>>
>>>>> I'm trying to understand the difference between "fully guest
>>>>> controlled" and "guest controlled".
>>>>
>>>> Not exactly, not. "unsafe" means access to anything which may
>>>> fault, guest controlled or not. do_invalid_op()'s reading of
>>>> the insn stream is a good example - the faulting insn there
>>>> isn't guest controlled at all, but we still want to be careful
>>>> when trying to read these bytes, as we don't want to fully
>>>> trust %rip there.
> 
> Oh, I see. It's possible that %rip points to an unmapped address
> there, and we need to be careful when reading, even if the value of
> %rip cannot be controlled by the guest and can legitimacy point to
> Xen's address space.
> 
>>> Would it make sense to threat everything as 'guest' accesses for the
>>> sake of not having this difference?
>>
>> That's what we've been doing until now. It is the purpose of
>> this change to allow the two to behave differently.
>>
>>> I think having two accessors it's likely to cause confusion and could
>>> possibly lead to the wrong one being used in unexpected contexts. Does
>>> it add a too big performance penalty to always use the most
>>> restrictive one?
>>
>> The problem is the most restrictive one is going to be too
>> restrictive - we wouldn't be able to access Xen space anymore
>> e.g. from the place pointed at above as example. This is
>> because for guest accesses (but not for "unsafe" ones) we're
>> going to divert them into non-canonical space (and hence make
>> speculation impossible, as such an access would fault) if it
>> would touch Xen space.
> 
> Yes, I understand now. I think it would have been helpful (for me) to
> have the first sentence as:
> 
> The "guest" variants are intended to work with (potentially) fully guest
> controlled addresses, while the "unsafe" variants are expected to be
> used in order to access addresses not under the guest control, but
> that could trigger faults anyway (like accessing the instruction
> stream in do_invalid_op).

I can use some of this, but in particular "access addresses not
under the guest control" isn't entirely correct. The question isn't
whether there's a guest control aspect, but which part of the
address space the addresses are in. See specifically x86/PV: use
get_unsafe() instead of copy_from_unsafe()" for two pretty good
examples. The address within the linear page tables are - in a
way at least - still somewhat guest controlled, but we're
deliberately accessing Xen virtual addresses there.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 13:15             ` Jan Beulich
@ 2021-02-09 14:46               ` Roger Pau Monné
  2021-02-09 14:57                 ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 14:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Tue, Feb 09, 2021 at 02:15:18PM +0100, Jan Beulich wrote:
> On 09.02.2021 14:07, Roger Pau Monné wrote:
> > On Fri, Feb 05, 2021 at 05:26:33PM +0100, Jan Beulich wrote:
> >> On 05.02.2021 17:18, Roger Pau Monné wrote:
> >>> On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
> >>>> On 05.02.2021 16:43, Roger Pau Monné wrote:
> >>>>> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> >>>>>> The "guest" variants are intended to work with (potentially) fully guest
> >>>>>> controlled addresses, while the "unsafe" variants are not.
> >>>>>
> >>>>> Just to clarify, both work against user addresses, but guest variants
> >>>>> need to be more careful because the guest provided address can also be
> >>>>> modified?
> >>>>>
> >>>>> I'm trying to understand the difference between "fully guest
> >>>>> controlled" and "guest controlled".
> >>>>
> >>>> Not exactly, not. "unsafe" means access to anything which may
> >>>> fault, guest controlled or not. do_invalid_op()'s reading of
> >>>> the insn stream is a good example - the faulting insn there
> >>>> isn't guest controlled at all, but we still want to be careful
> >>>> when trying to read these bytes, as we don't want to fully
> >>>> trust %rip there.
> > 
> > Oh, I see. It's possible that %rip points to an unmapped address
> > there, and we need to be careful when reading, even if the value of
> > %rip cannot be controlled by the guest and can legitimacy point to
> > Xen's address space.
> > 
> >>> Would it make sense to threat everything as 'guest' accesses for the
> >>> sake of not having this difference?
> >>
> >> That's what we've been doing until now. It is the purpose of
> >> this change to allow the two to behave differently.
> >>
> >>> I think having two accessors it's likely to cause confusion and could
> >>> possibly lead to the wrong one being used in unexpected contexts. Does
> >>> it add a too big performance penalty to always use the most
> >>> restrictive one?
> >>
> >> The problem is the most restrictive one is going to be too
> >> restrictive - we wouldn't be able to access Xen space anymore
> >> e.g. from the place pointed at above as example. This is
> >> because for guest accesses (but not for "unsafe" ones) we're
> >> going to divert them into non-canonical space (and hence make
> >> speculation impossible, as such an access would fault) if it
> >> would touch Xen space.
> > 
> > Yes, I understand now. I think it would have been helpful (for me) to
> > have the first sentence as:
> > 
> > The "guest" variants are intended to work with (potentially) fully guest
> > controlled addresses, while the "unsafe" variants are expected to be
> > used in order to access addresses not under the guest control, but
> > that could trigger faults anyway (like accessing the instruction
> > stream in do_invalid_op).
> 
> I can use some of this, but in particular "access addresses not
> under the guest control" isn't entirely correct. The question isn't
> whether there's a guest control aspect, but which part of the
> address space the addresses are in. See specifically x86/PV: use
> get_unsafe() instead of copy_from_unsafe()" for two pretty good
> examples. The address within the linear page tables are - in a
> way at least - still somewhat guest controlled, but we're
> deliberately accessing Xen virtual addresses there.

OK, could this be somehow added to the commit message then?

Maybe it would be better to have something like:

The "guest" variants are intended to work with addresses belonging to
the guest address space, while the "unsafe" variants should be used
for addresses that fall into the Xen address space.

I think it's important to list exactly how the distinction between the
guest/unsafe accessors is made, or else it's impossible to review that
the changes done here are correct.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-01-14 15:04 ` [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants Jan Beulich
  2021-02-05 15:43   ` Roger Pau Monné
@ 2021-02-09 14:55   ` Roger Pau Monné
  2021-02-09 15:14     ` Jan Beulich
  1 sibling, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 14:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> The "guest" variants are intended to work with (potentially) fully guest
> controlled addresses, while the "unsafe" variants are not. (For
> descriptor table accesses the low bits of the addresses may still be
> guest controlled, but this still won't allow speculation to "escape"
> into unwanted areas.)

Descriptor table is also in guest address space, and hence would be
fine using the _guest accessors? (even if not in guest control and
thus unsuitable as an speculation vector)

> Subsequently we will want them to have different
> behavior, so as first step identify which one is which. For now, both
> groups of constructs alias one another.
> 
> Double underscore prefixes are retained only on __{get,put}_guest(), to
> allow still distinguishing them from their "checking" counterparts once
> they also get renamed (to {get,put}_guest()).
> 
> Since for them it's almost a full re-write, move what becomes
> {get,put}_unsafe_size() into the "common" uaccess.h (x86_64/*.h should
> disappear at some point anyway).
> 
> In __copy_to_user() one of the two casts in each put_guest_size()
> invocation gets dropped. They're not needed and did break symmetry with
> __copy_from_user().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -776,9 +776,9 @@ shadow_write_entries(void *d, void *s, i
>      /* Because we mirror access rights at all levels in the shadow, an
>       * l2 (or higher) entry with the RW bit cleared will leave us with
>       * no write access through the linear map.
> -     * We detect that by writing to the shadow with __put_user() and
> +     * We detect that by writing to the shadow with put_unsafe() and
>       * using map_domain_page() to get a writeable mapping if we need to. */
> -    if ( __put_user(*dst, dst) )
> +    if ( put_unsafe(*dst, dst) )
>      {
>          perfc_incr(shadow_linear_map_failed);
>          map = map_domain_page(mfn);
> --- a/xen/arch/x86/pv/emul-gate-op.c
> +++ b/xen/arch/x86/pv/emul-gate-op.c
> @@ -40,7 +40,7 @@ static int read_gate_descriptor(unsigned
>           ((gate_sel >> 3) + !is_pv_32bit_vcpu(v) >=
>            (gate_sel & 4 ? v->arch.pv.ldt_ents
>                          : v->arch.pv.gdt_ents)) ||
> -         __get_user(desc, pdesc) )
> +         get_unsafe(desc, pdesc) )
>          return 0;
>  
>      *sel = (desc.a >> 16) & 0x0000fffc;
> @@ -59,7 +59,7 @@ static int read_gate_descriptor(unsigned
>      {
>          if ( (*ar & 0x1f00) != 0x0c00 ||
>               /* Limit check done above already. */
> -             __get_user(desc, pdesc + 1) ||
> +             get_unsafe(desc, pdesc + 1) ||
>               (desc.b & 0x1f00) )
>              return 0;
>  
> @@ -294,7 +294,7 @@ void pv_emulate_gate_op(struct cpu_user_
>          { \
>              --stkp; \
>              esp -= 4; \
> -            rc = __put_user(item, stkp); \
> +            rc = __put_guest(item, stkp); \
>              if ( rc ) \
>              { \
>                  pv_inject_page_fault(PFEC_write_access, \
> @@ -362,7 +362,7 @@ void pv_emulate_gate_op(struct cpu_user_
>                      unsigned int parm;
>  
>                      --ustkp;
> -                    rc = __get_user(parm, ustkp);
> +                    rc = __get_guest(parm, ustkp);
>                      if ( rc )
>                      {
>                          pv_inject_page_fault(0, (unsigned long)(ustkp + 1) - rc);
> --- a/xen/arch/x86/pv/emulate.c
> +++ b/xen/arch/x86/pv/emulate.c
> @@ -34,13 +34,13 @@ int pv_emul_read_descriptor(unsigned int
>      if ( sel < 4 ||
>           /*
>            * Don't apply the GDT limit here, as the selector may be a Xen
> -          * provided one. __get_user() will fail (without taking further
> +          * provided one. get_unsafe() will fail (without taking further
>            * action) for ones falling in the gap between guest populated
>            * and Xen ones.
>            */
>           ((sel & 4) && (sel >> 3) >= v->arch.pv.ldt_ents) )
>          desc.b = desc.a = 0;
> -    else if ( __get_user(desc, gdt_ldt_desc_ptr(sel)) )
> +    else if ( get_unsafe(desc, gdt_ldt_desc_ptr(sel)) )
>          return 0;
>      if ( !insn_fetch )
>          desc.b &= ~_SEGMENT_L;
> --- a/xen/arch/x86/pv/iret.c
> +++ b/xen/arch/x86/pv/iret.c
> @@ -114,15 +114,15 @@ unsigned int compat_iret(void)
>      regs->rsp = (u32)regs->rsp;
>  
>      /* Restore EAX (clobbered by hypercall). */
> -    if ( unlikely(__get_user(regs->eax, (u32 *)regs->rsp)) )
> +    if ( unlikely(__get_guest(regs->eax, (u32 *)regs->rsp)) )
>      {
>          domain_crash(v->domain);
>          return 0;
>      }
>  
>      /* Restore CS and EIP. */
> -    if ( unlikely(__get_user(regs->eip, (u32 *)regs->rsp + 1)) ||
> -        unlikely(__get_user(regs->cs, (u32 *)regs->rsp + 2)) )
> +    if ( unlikely(__get_guest(regs->eip, (u32 *)regs->rsp + 1)) ||
> +        unlikely(__get_guest(regs->cs, (u32 *)regs->rsp + 2)) )
>      {
>          domain_crash(v->domain);
>          return 0;
> @@ -132,7 +132,7 @@ unsigned int compat_iret(void)
>       * Fix up and restore EFLAGS. We fix up in a local staging area
>       * to avoid firing the BUG_ON(IOPL) check in arch_get_info_guest.
>       */
> -    if ( unlikely(__get_user(eflags, (u32 *)regs->rsp + 3)) )
> +    if ( unlikely(__get_guest(eflags, (u32 *)regs->rsp + 3)) )
>      {
>          domain_crash(v->domain);
>          return 0;
> @@ -164,16 +164,16 @@ unsigned int compat_iret(void)
>          {
>              for (i = 1; i < 10; ++i)
>              {
> -                rc |= __get_user(x, (u32 *)regs->rsp + i);
> -                rc |= __put_user(x, (u32 *)(unsigned long)ksp + i);
> +                rc |= __get_guest(x, (u32 *)regs->rsp + i);
> +                rc |= __put_guest(x, (u32 *)(unsigned long)ksp + i);
>              }
>          }
>          else if ( ksp > regs->esp )
>          {
>              for ( i = 9; i > 0; --i )
>              {
> -                rc |= __get_user(x, (u32 *)regs->rsp + i);
> -                rc |= __put_user(x, (u32 *)(unsigned long)ksp + i);
> +                rc |= __get_guest(x, (u32 *)regs->rsp + i);
> +                rc |= __put_guest(x, (u32 *)(unsigned long)ksp + i);
>              }
>          }
>          if ( rc )
> @@ -189,7 +189,7 @@ unsigned int compat_iret(void)
>              eflags &= ~X86_EFLAGS_IF;
>          regs->eflags &= ~(X86_EFLAGS_VM|X86_EFLAGS_RF|
>                            X86_EFLAGS_NT|X86_EFLAGS_TF);
> -        if ( unlikely(__put_user(0, (u32 *)regs->rsp)) )
> +        if ( unlikely(__put_guest(0, (u32 *)regs->rsp)) )
>          {
>              domain_crash(v->domain);
>              return 0;
> @@ -205,8 +205,8 @@ unsigned int compat_iret(void)
>      else if ( ring_1(regs) )
>          regs->esp += 16;
>      /* Return to ring 2/3: restore ESP and SS. */
> -    else if ( __get_user(regs->ss, (u32 *)regs->rsp + 5) ||
> -              __get_user(regs->esp, (u32 *)regs->rsp + 4) )
> +    else if ( __get_guest(regs->ss, (u32 *)regs->rsp + 5) ||
> +              __get_guest(regs->esp, (u32 *)regs->rsp + 4) )
>      {
>          domain_crash(v->domain);
>          return 0;
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -274,7 +274,7 @@ static void compat_show_guest_stack(stru
>      {
>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
>              break;
> -        if ( __get_user(addr, stack) )
> +        if ( get_unsafe(addr, stack) )
>          {
>              if ( i != 0 )
>                  printk("\n    ");
> @@ -343,7 +343,7 @@ static void show_guest_stack(struct vcpu
>      {
>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
>              break;
> -        if ( __get_user(addr, stack) )
> +        if ( get_unsafe(addr, stack) )

Shouldn't accessing the guest stack use the _guest accessors?

Or has this address been verified by Xen and not in idrect control of
the guest, and thus can't be used for speculation purposes?

I feel like this should be using the _guest accessors anyway, as the
guest stack is an address in guest space?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 14:46               ` Roger Pau Monné
@ 2021-02-09 14:57                 ` Jan Beulich
  2021-02-09 15:23                   ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-09 14:57 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 09.02.2021 15:46, Roger Pau Monné wrote:
> On Tue, Feb 09, 2021 at 02:15:18PM +0100, Jan Beulich wrote:
>> On 09.02.2021 14:07, Roger Pau Monné wrote:
>>> On Fri, Feb 05, 2021 at 05:26:33PM +0100, Jan Beulich wrote:
>>>> On 05.02.2021 17:18, Roger Pau Monné wrote:
>>>>> On Fri, Feb 05, 2021 at 05:13:22PM +0100, Jan Beulich wrote:
>>>>>> On 05.02.2021 16:43, Roger Pau Monné wrote:
>>>>>>> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
>>>>>>>> The "guest" variants are intended to work with (potentially) fully guest
>>>>>>>> controlled addresses, while the "unsafe" variants are not.
>>>>>>>
>>>>>>> Just to clarify, both work against user addresses, but guest variants
>>>>>>> need to be more careful because the guest provided address can also be
>>>>>>> modified?
>>>>>>>
>>>>>>> I'm trying to understand the difference between "fully guest
>>>>>>> controlled" and "guest controlled".
>>>>>>
>>>>>> Not exactly, not. "unsafe" means access to anything which may
>>>>>> fault, guest controlled or not. do_invalid_op()'s reading of
>>>>>> the insn stream is a good example - the faulting insn there
>>>>>> isn't guest controlled at all, but we still want to be careful
>>>>>> when trying to read these bytes, as we don't want to fully
>>>>>> trust %rip there.
>>>
>>> Oh, I see. It's possible that %rip points to an unmapped address
>>> there, and we need to be careful when reading, even if the value of
>>> %rip cannot be controlled by the guest and can legitimacy point to
>>> Xen's address space.
>>>
>>>>> Would it make sense to threat everything as 'guest' accesses for the
>>>>> sake of not having this difference?
>>>>
>>>> That's what we've been doing until now. It is the purpose of
>>>> this change to allow the two to behave differently.
>>>>
>>>>> I think having two accessors it's likely to cause confusion and could
>>>>> possibly lead to the wrong one being used in unexpected contexts. Does
>>>>> it add a too big performance penalty to always use the most
>>>>> restrictive one?
>>>>
>>>> The problem is the most restrictive one is going to be too
>>>> restrictive - we wouldn't be able to access Xen space anymore
>>>> e.g. from the place pointed at above as example. This is
>>>> because for guest accesses (but not for "unsafe" ones) we're
>>>> going to divert them into non-canonical space (and hence make
>>>> speculation impossible, as such an access would fault) if it
>>>> would touch Xen space.
>>>
>>> Yes, I understand now. I think it would have been helpful (for me) to
>>> have the first sentence as:
>>>
>>> The "guest" variants are intended to work with (potentially) fully guest
>>> controlled addresses, while the "unsafe" variants are expected to be
>>> used in order to access addresses not under the guest control, but
>>> that could trigger faults anyway (like accessing the instruction
>>> stream in do_invalid_op).
>>
>> I can use some of this, but in particular "access addresses not
>> under the guest control" isn't entirely correct. The question isn't
>> whether there's a guest control aspect, but which part of the
>> address space the addresses are in. See specifically x86/PV: use
>> get_unsafe() instead of copy_from_unsafe()" for two pretty good
>> examples. The address within the linear page tables are - in a
>> way at least - still somewhat guest controlled, but we're
>> deliberately accessing Xen virtual addresses there.
> 
> OK, could this be somehow added to the commit message then?
> 
> Maybe it would be better to have something like:
> 
> The "guest" variants are intended to work with addresses belonging to
> the guest address space, while the "unsafe" variants should be used
> for addresses that fall into the Xen address space.
> 
> I think it's important to list exactly how the distinction between the
> guest/unsafe accessors is made, or else it's impossible to review that
> the changes done here are correct.

I did change the paragraph to read

The "guest" variants are intended to work with (potentially) fully guest
controlled addresses, while the "unsafe" variants are intended to be
used in order to access addresses not (directly) under guest control,
within Xen's part of virtual address space. (For linear page table and
descriptor table accesses the low bits of the addresses may still be
guest controlled, but this still won't allow speculation to "escape"
into unwanted areas.) Subsequently we will want them to have distinct
behavior, so as first step identify which one is which. For now, both
groups of constructs alias one another.

earlier today. Is this clear enough? Relevant parts I've also mirrored
into patch 3's description?

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 14:55   ` Roger Pau Monné
@ 2021-02-09 15:14     ` Jan Beulich
  2021-02-09 15:27       ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-09 15:14 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 09.02.2021 15:55, Roger Pau Monné wrote:
> On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
>> The "guest" variants are intended to work with (potentially) fully guest
>> controlled addresses, while the "unsafe" variants are not. (For
>> descriptor table accesses the low bits of the addresses may still be
>> guest controlled, but this still won't allow speculation to "escape"
>> into unwanted areas.)
> 
> Descriptor table is also in guest address space, and hence would be
> fine using the _guest accessors? (even if not in guest control and
> thus unsuitable as an speculation vector)

No, we don't access descriptor tables in guest space, I don't
think. See read_gate_descriptor() for an example. After all PV
guests don't even have the full concept of self-managed (in
their VA space) descriptor tables (GDT gets specified in terms
of frames, while LDT gets specified in terms of (VA,size)
tuples, but just for Xen to read the underlying page table
entries upon 1st access).

>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -274,7 +274,7 @@ static void compat_show_guest_stack(stru
>>      {
>>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
>>              break;
>> -        if ( __get_user(addr, stack) )
>> +        if ( get_unsafe(addr, stack) )
>>          {
>>              if ( i != 0 )
>>                  printk("\n    ");
>> @@ -343,7 +343,7 @@ static void show_guest_stack(struct vcpu
>>      {
>>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
>>              break;
>> -        if ( __get_user(addr, stack) )
>> +        if ( get_unsafe(addr, stack) )
> 
> Shouldn't accessing the guest stack use the _guest accessors?

Hmm, yes indeed.

> Or has this address been verified by Xen and not in idrect control of
> the guest, and thus can't be used for speculation purposes?
> 
> I feel like this should be using the _guest accessors anyway, as the
> guest stack is an address in guest space?

I think this being a debugging function only, not directly
accessible by guests, is what made me think speculation is
not an issue here and hence the "unsafe" variants are fine
to use (they're slightly cheaper after all, once the
subsequent changes are in place). But I guess I will better
switch these two around.

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 14:57                 ` Jan Beulich
@ 2021-02-09 15:23                   ` Roger Pau Monné
  0 siblings, 0 replies; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 15:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Tue, Feb 09, 2021 at 03:57:22PM +0100, Jan Beulich wrote:
> I did change the paragraph to read
> 
> The "guest" variants are intended to work with (potentially) fully guest
> controlled addresses, while the "unsafe" variants are intended to be
> used in order to access addresses not (directly) under guest control,
> within Xen's part of virtual address space. (For linear page table and
> descriptor table accesses the low bits of the addresses may still be
> guest controlled, but this still won't allow speculation to "escape"
> into unwanted areas.) Subsequently we will want them to have distinct
> behavior, so as first step identify which one is which. For now, both
> groups of constructs alias one another.
> 
> earlier today. Is this clear enough? Relevant parts I've also mirrored
> into patch 3's description?

Yes, that does indeed seem clearer, thanks.

Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants
  2021-02-09 15:14     ` Jan Beulich
@ 2021-02-09 15:27       ` Roger Pau Monné
  0 siblings, 0 replies; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 15:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Tue, Feb 09, 2021 at 04:14:41PM +0100, Jan Beulich wrote:
> On 09.02.2021 15:55, Roger Pau Monné wrote:
> > On Thu, Jan 14, 2021 at 04:04:11PM +0100, Jan Beulich wrote:
> >> The "guest" variants are intended to work with (potentially) fully guest
> >> controlled addresses, while the "unsafe" variants are not. (For
> >> descriptor table accesses the low bits of the addresses may still be
> >> guest controlled, but this still won't allow speculation to "escape"
> >> into unwanted areas.)
> > 
> > Descriptor table is also in guest address space, and hence would be
> > fine using the _guest accessors? (even if not in guest control and
> > thus unsuitable as an speculation vector)
> 
> No, we don't access descriptor tables in guest space, I don't
> think. See read_gate_descriptor() for an example. After all PV
> guests don't even have the full concept of self-managed (in
> their VA space) descriptor tables (GDT gets specified in terms
> of frames, while LDT gets specified in terms of (VA,size)
> tuples, but just for Xen to read the underlying page table
> entries upon 1st access).

I see, read_gate_descriptor uses gdt_ldt_desc_ptr which points into
the per-domain Xen virt space, so it's indeed an address in Xen
address space. I realize it doesn't make sense for the descriptor
table to be in guest address space, as it's not fully under guest
control.

> >> --- a/xen/arch/x86/traps.c
> >> +++ b/xen/arch/x86/traps.c
> >> @@ -274,7 +274,7 @@ static void compat_show_guest_stack(stru
> >>      {
> >>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
> >>              break;
> >> -        if ( __get_user(addr, stack) )
> >> +        if ( get_unsafe(addr, stack) )
> >>          {
> >>              if ( i != 0 )
> >>                  printk("\n    ");
> >> @@ -343,7 +343,7 @@ static void show_guest_stack(struct vcpu
> >>      {
> >>          if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
> >>              break;
> >> -        if ( __get_user(addr, stack) )
> >> +        if ( get_unsafe(addr, stack) )
> > 
> > Shouldn't accessing the guest stack use the _guest accessors?
> 
> Hmm, yes indeed.
> 
> > Or has this address been verified by Xen and not in idrect control of
> > the guest, and thus can't be used for speculation purposes?
> > 
> > I feel like this should be using the _guest accessors anyway, as the
> > guest stack is an address in guest space?
> 
> I think this being a debugging function only, not directly
> accessible by guests, is what made me think speculation is
> not an issue here and hence the "unsafe" variants are fine
> to use (they're slightly cheaper after all, once the
> subsequent changes are in place). But I guess I will better
> switch these two around.

With that:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/17] x86: split __copy_{from,to}_user() into "guest" and "unsafe" variants
  2021-01-14 15:04 ` [PATCH 03/17] x86: split __copy_{from,to}_user() " Jan Beulich
@ 2021-02-09 16:06   ` Roger Pau Monné
  2021-02-09 17:03     ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 16:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Thu, Jan 14, 2021 at 04:04:32PM +0100, Jan Beulich wrote:
> The "guest" variants are intended to work with (potentially) fully guest
> controlled addresses, while the "unsafe" variants are not. Subsequently
> we will want them to have different behavior, so as first step identify
> which one is which. For now, both groups of constructs alias one another.
> 
> Double underscore prefixes are retained only on
> __copy_{from,to}_guest_pv(), to allow still distinguishing them from
> their "checking" counterparts once they also get renamed (to
> copy_{from,to}_guest_pv()).
> 
> Add previously missing __user at some call sites.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Instead of __copy_{from,to}_guest_pv(), perhaps name them just
> __copy_{from,to}_pv()?
> 
> --- a/xen/arch/x86/gdbstub.c
> +++ b/xen/arch/x86/gdbstub.c
> @@ -33,13 +33,13 @@ gdb_arch_signal_num(struct cpu_user_regs
>  unsigned int
>  gdb_arch_copy_from_user(void *dest, const void *src, unsigned len)
>  {
> -    return __copy_from_user(dest, src, len);
> +    return copy_from_unsafe(dest, src, len);
>  }
>  
>  unsigned int 
>  gdb_arch_copy_to_user(void *dest, const void *src, unsigned len)
>  {
> -    return __copy_to_user(dest, src, len);
> +    return copy_to_unsafe(dest, src, len);

I assume we need to use the unsafe variants here, because the input
addresses are fully controlled by gdb, and hence not suitable as
speculation vectors?

Also could point to addresses belonging to both Xen or the guest
address space AFAICT.

> --- a/xen/include/asm-x86/uaccess.h
> +++ b/xen/include/asm-x86/uaccess.h

At some point we should also rename this to pvaccess.h maybe?

> @@ -197,21 +197,20 @@ do {
>  #define get_guest_size get_unsafe_size
>  
>  /**
> - * __copy_to_user: - Copy a block of data into user space, with less checking
> - * @to:   Destination address, in user space.
> - * @from: Source address, in kernel space.
> + * __copy_to_guest_pv: - Copy a block of data into guest space, with less
> + *                       checking

I would have preferred pv to be a prefix rather than a suffix, but we
already have the hvm accessors using that nomenclature.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-01-14 15:04 ` [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse Jan Beulich
@ 2021-02-09 16:26   ` Roger Pau Monné
  2021-02-10 16:55     ` Jan Beulich
  2021-02-12 10:41   ` Roger Pau Monné
  1 sibling, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-09 16:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
> Inspired by
> https://lore.kernel.org/lkml/f12e7d3cecf41b2c29734ea45a393be21d4a8058.1597848273.git.jpoimboe@redhat.com/
> and prior work in that area of x86 Linux, suppress speculation with
> guest specified pointer values by suitably masking the addresses to
> non-canonical space in case they fall into Xen's virtual address range.
> 
> Introduce a new Kconfig control.
> 
> Note that it is necessary in such code to avoid using "m" kind operands:
> If we didn't, there would be no guarantee that the register passed to
> guest_access_mask_ptr is also the (base) one used for the memory access.
> 
> As a minor unrelated change in get_unsafe_asm() the unnecessary "itype"
> parameter gets dropped and the XOR on the fixup path gets changed to be
> a 32-bit one in all cases: This way we avoid pointless REX.W or operand
> size overrides, or writes to partial registers.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> The insn sequence chosen is certainly up for discussion; I've picked
> this one despite the RCR because alternatives I could come up with,
> like
> 
> 	mov	$(HYPERVISOR_VIRT_END), %rax
> 	mov	$~0, %rdx
> 	mov	$0x7fffffffffffffff, %rcx
> 	cmp	%rax, %rdi
> 	cmovb	%rcx, %rdx
> 	and	%rdx, %rdi
> 
> weren't necessarily better: Either, as above, they are longer and
> require a 3rd scratch register, or they also utilize the carry flag in
> some similar way.
> ---
> Judging from the comment ahead of put_unsafe_asm() we might as well not
> tell gcc at all anymore about the memory access there, now that there's
> no use of the operand anymore in the assembly code.
> 
> --- a/xen/arch/x86/usercopy.c
> +++ b/xen/arch/x86/usercopy.c
> @@ -10,12 +10,19 @@
>  #include <xen/sched.h>
>  #include <asm/uaccess.h>
>  
> -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
> +#ifndef GUARD
> +# define GUARD UA_KEEP
> +#endif
> +
> +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
>  {
>      unsigned dummy;
>  
>      stac();
>      asm volatile (
> +        GUARD(
> +        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"

Don't you need to also take 'n' into account here to assert that the
address doesn't end in hypervisor address space? Or that's fine as
speculation wouldn't go that far?

I also wonder why this needs to be done in assembly, could you check
the address(es) using C?

> +        )
>          "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
>          "    jbe  1f\n"
>          "    mov  %k[to], %[cnt]\n"
> @@ -42,6 +49,7 @@ unsigned __copy_to_user_ll(void __user *
>          _ASM_EXTABLE(1b, 2b)
>          : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
>            [aux] "=&r" (dummy)
> +          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
>          : "[aux]" (n)
>          : "memory" );
>      clac();
> @@ -49,12 +57,15 @@ unsigned __copy_to_user_ll(void __user *
>      return n;
>  }
>  
> -unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n)
> +unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n)
>  {
>      unsigned dummy;
>  
>      stac();
>      asm volatile (
> +        GUARD(
> +        "    guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n"
> +        )
>          "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
>          "    jbe  1f\n"
>          "    mov  %k[to], %[cnt]\n"
> @@ -87,6 +98,7 @@ unsigned __copy_from_user_ll(void *to, c
>          _ASM_EXTABLE(1b, 6b)
>          : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
>            [aux] "=&r" (dummy)
> +          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
>          : "[aux]" (n)
>          : "memory" );
>      clac();
> @@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
>      return n;
>  }
>  
> +#if GUARD(1) + 0
> +
>  /**
>   * copy_to_user: - Copy a block of data into user space.
>   * @to:   Destination address, in user space.
> @@ -128,8 +142,11 @@ unsigned clear_user(void __user *to, uns
>  {
>      if ( access_ok(to, n) )
>      {
> +        long dummy;
> +
>          stac();
>          asm volatile (
> +            "    guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n"
>              "0:  rep stos"__OS"\n"
>              "    mov  %[bytes], %[cnt]\n"
>              "1:  rep stosb\n"
> @@ -140,7 +157,8 @@ unsigned clear_user(void __user *to, uns
>              ".previous\n"
>              _ASM_EXTABLE(0b,3b)
>              _ASM_EXTABLE(1b,2b)
> -            : [cnt] "=&c" (n), [to] "+D" (to)
> +            : [cnt] "=&c" (n), [to] "+D" (to), [scratch1] "=&r" (dummy),
> +              [scratch2] "=&r" (dummy)
>              : [bytes] "r" (n & (BYTES_PER_LONG - 1)),
>                [longs] "0" (n / BYTES_PER_LONG), "a" (0) );
>          clac();
> @@ -174,6 +192,16 @@ unsigned copy_from_user(void *to, const
>      return n;
>  }
>  
> +# undef GUARD
> +# define GUARD UA_DROP
> +# define copy_to_guest_ll copy_to_unsafe_ll
> +# define copy_from_guest_ll copy_from_unsafe_ll
> +# undef __user
> +# define __user
> +# include __FILE__
> +
> +#endif /* GUARD(1) */
> +
>  /*
>   * Local variables:
>   * mode: C
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -446,6 +446,8 @@ UNLIKELY_START(g, create_bounce_frame_ba
>          jmp   asm_domain_crash_synchronous  /* Does not return */
>  __UNLIKELY_END(create_bounce_frame_bad_sp)
>  
> +        guest_access_mask_ptr %rsi, %rax, %rcx
> +
>  #define STORE_GUEST_STACK(reg, n) \
>  0:      movq  %reg,(n)*8(%rsi); \
>          _ASM_EXTABLE(0b, domain_crash_page_fault_ ## n ## x8)
> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -114,6 +114,24 @@ config SPECULATIVE_HARDEN_BRANCH
>  
>  	  If unsure, say Y.
>  
> +config SPECULATIVE_HARDEN_GUEST_ACCESS
> +	bool "Speculative PV Guest Memory Access Hardening"
> +	default y
> +	depends on PV
> +	help
> +	  Contemporary processors may use speculative execution as a
> +	  performance optimisation, but this can potentially be abused by an
> +	  attacker to leak data via speculative sidechannels.
> +
> +	  One source of data leakage is via speculative accesses to hypervisor
> +	  memory through guest controlled values used to access guest memory.
> +
> +	  When enabled, code paths accessing PV guest memory will have guest
> +	  controlled addresses massaged such that memory accesses through them
> +	  won't touch hypervisor address space.
> +
> +	  If unsure, say Y.
> +
>  endmenu
>  
>  config HYPFS
> --- a/xen/include/asm-x86/asm-defns.h
> +++ b/xen/include/asm-x86/asm-defns.h
> @@ -44,3 +44,16 @@
>  .macro INDIRECT_JMP arg:req
>      INDIRECT_BRANCH jmp \arg
>  .endm
> +
> +.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
> +#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
> +    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
> +    mov $~0, \scratch2
> +    cmp \ptr, \scratch1
> +    rcr $1, \scratch2
> +    and \scratch2, \ptr
> +#elif defined(CONFIG_DEBUG) && defined(CONFIG_PV)
> +    xor $~\@, \scratch1
> +    xor $~\@, \scratch2
> +#endif
> +.endm
> --- a/xen/include/asm-x86/uaccess.h
> +++ b/xen/include/asm-x86/uaccess.h
> @@ -13,13 +13,19 @@
>  unsigned copy_to_user(void *to, const void *from, unsigned len);
>  unsigned clear_user(void *to, unsigned len);
>  unsigned copy_from_user(void *to, const void *from, unsigned len);
> +
>  /* Handles exceptions in both to and from, but doesn't do access_ok */
> -unsigned __copy_to_user_ll(void __user*to, const void *from, unsigned n);
> -unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n);
> +unsigned int copy_to_guest_ll(void __user*to, const void *from, unsigned int n);
> +unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n);
> +unsigned int copy_to_unsafe_ll(void *to, const void *from, unsigned int n);
> +unsigned int copy_from_unsafe_ll(void *to, const void *from, unsigned int n);
>  
>  extern long __get_user_bad(void);
>  extern void __put_user_bad(void);
>  
> +#define UA_KEEP(args...) args
> +#define UA_DROP(args...)

I assume UA means user access, and since you have dropped other uses
of user and changed to guest instead I wonder if we should name this
just A_{KEEP/DROP}.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/17] x86: split __copy_{from,to}_user() into "guest" and "unsafe" variants
  2021-02-09 16:06   ` Roger Pau Monné
@ 2021-02-09 17:03     ` Jan Beulich
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-02-09 17:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 09.02.2021 17:06, Roger Pau Monné wrote:
> On Thu, Jan 14, 2021 at 04:04:32PM +0100, Jan Beulich wrote:
>> The "guest" variants are intended to work with (potentially) fully guest
>> controlled addresses, while the "unsafe" variants are not. Subsequently
>> we will want them to have different behavior, so as first step identify
>> which one is which. For now, both groups of constructs alias one another.
>>
>> Double underscore prefixes are retained only on
>> __copy_{from,to}_guest_pv(), to allow still distinguishing them from
>> their "checking" counterparts once they also get renamed (to
>> copy_{from,to}_guest_pv()).
>>
>> Add previously missing __user at some call sites.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Instead of __copy_{from,to}_guest_pv(), perhaps name them just
>> __copy_{from,to}_pv()?
>>
>> --- a/xen/arch/x86/gdbstub.c
>> +++ b/xen/arch/x86/gdbstub.c
>> @@ -33,13 +33,13 @@ gdb_arch_signal_num(struct cpu_user_regs
>>  unsigned int
>>  gdb_arch_copy_from_user(void *dest, const void *src, unsigned len)
>>  {
>> -    return __copy_from_user(dest, src, len);
>> +    return copy_from_unsafe(dest, src, len);
>>  }
>>  
>>  unsigned int 
>>  gdb_arch_copy_to_user(void *dest, const void *src, unsigned len)
>>  {
>> -    return __copy_to_user(dest, src, len);
>> +    return copy_to_unsafe(dest, src, len);
> 
> I assume we need to use the unsafe variants here, because the input
> addresses are fully controlled by gdb, and hence not suitable as
> speculation vectors?

Speculation doesn't matter when it comes to debugging, I
think. We were using the variants without access_ok()
checks already anyway to allow access to Xen addresses.
In fact it is my understanding ...

> Also could point to addresses belonging to both Xen or the guest
> address space AFAICT.

... that the primary goal here is to access Xen
addresses, and guest space only falls into the "may also
happen to be accessed" category.

>> --- a/xen/include/asm-x86/uaccess.h
>> +++ b/xen/include/asm-x86/uaccess.h
> 
> At some point we should also rename this to pvaccess.h maybe?

We could, but I'd rather not - this isn't about PV only.
Instead I would simply re-interpret 'u' from standing for
"user" (which didn't make much sense in Xen anyway, and
was only attributed to the Linux origin) to standing for
"unsafe" (both meanings - guest and in-Xen-but-unsafe).

>> @@ -197,21 +197,20 @@ do {
>>  #define get_guest_size get_unsafe_size
>>  
>>  /**
>> - * __copy_to_user: - Copy a block of data into user space, with less checking
>> - * @to:   Destination address, in user space.
>> - * @from: Source address, in kernel space.
>> + * __copy_to_guest_pv: - Copy a block of data into guest space, with less
>> + *                       checking
> 
> I would have preferred pv to be a prefix rather than a suffix, but we
> already have the hvm accessors using that nomenclature.

Right, I wanted to match that naming model. Later we can
think about renaming to copy_{to,from}_{hvm,pv}() or
whatever else naming scheme we like. I have to admit though
I'm not convinced the longer {hvm,pv}_copy_{from,to}_guest()
would really be better.

> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks!

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-09 16:26   ` Roger Pau Monné
@ 2021-02-10 16:55     ` Jan Beulich
  2021-02-11  8:11       ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-10 16:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 09.02.2021 17:26, Roger Pau Monné wrote:
> On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
>> --- a/xen/arch/x86/usercopy.c
>> +++ b/xen/arch/x86/usercopy.c
>> @@ -10,12 +10,19 @@
>>  #include <xen/sched.h>
>>  #include <asm/uaccess.h>
>>  
>> -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
>> +#ifndef GUARD
>> +# define GUARD UA_KEEP
>> +#endif
>> +
>> +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
>>  {
>>      unsigned dummy;
>>  
>>      stac();
>>      asm volatile (
>> +        GUARD(
>> +        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
> 
> Don't you need to also take 'n' into account here to assert that the
> address doesn't end in hypervisor address space? Or that's fine as
> speculation wouldn't go that far?

Like elsewhere this leverages that the hypervisor VA range starts
immediately after the non-canonical hole. I'm unaware of
speculation being able to cross over that hole.

> I also wonder why this needs to be done in assembly, could you check
> the address(es) using C?

For this to be efficient (in avoiding speculation) the insn
sequence would better not have any conditional jumps. I don't
think the compiler can be told so.

>> --- a/xen/include/asm-x86/uaccess.h
>> +++ b/xen/include/asm-x86/uaccess.h
>> @@ -13,13 +13,19 @@
>>  unsigned copy_to_user(void *to, const void *from, unsigned len);
>>  unsigned clear_user(void *to, unsigned len);
>>  unsigned copy_from_user(void *to, const void *from, unsigned len);
>> +
>>  /* Handles exceptions in both to and from, but doesn't do access_ok */
>> -unsigned __copy_to_user_ll(void __user*to, const void *from, unsigned n);
>> -unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n);
>> +unsigned int copy_to_guest_ll(void __user*to, const void *from, unsigned int n);
>> +unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n);
>> +unsigned int copy_to_unsafe_ll(void *to, const void *from, unsigned int n);
>> +unsigned int copy_from_unsafe_ll(void *to, const void *from, unsigned int n);
>>  
>>  extern long __get_user_bad(void);
>>  extern void __put_user_bad(void);
>>  
>> +#define UA_KEEP(args...) args
>> +#define UA_DROP(args...)
> 
> I assume UA means user access, and since you have dropped other uses
> of user and changed to guest instead I wonder if we should name this
> just A_{KEEP/DROP}.

Like in the name of the file I mean to see 'u' stand for "unsafe"
going forward. (A single letter name prefix would also seem more
prone to future collisions to me.)

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-10 16:55     ` Jan Beulich
@ 2021-02-11  8:11       ` Roger Pau Monné
  2021-02-11 11:28         ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-11  8:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Wed, Feb 10, 2021 at 05:55:52PM +0100, Jan Beulich wrote:
> On 09.02.2021 17:26, Roger Pau Monné wrote:
> > On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
> >> --- a/xen/arch/x86/usercopy.c
> >> +++ b/xen/arch/x86/usercopy.c
> >> @@ -10,12 +10,19 @@
> >>  #include <xen/sched.h>
> >>  #include <asm/uaccess.h>
> >>  
> >> -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
> >> +#ifndef GUARD
> >> +# define GUARD UA_KEEP
> >> +#endif
> >> +
> >> +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
> >>  {
> >>      unsigned dummy;
> >>  
> >>      stac();
> >>      asm volatile (
> >> +        GUARD(
> >> +        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
> > 
> > Don't you need to also take 'n' into account here to assert that the
> > address doesn't end in hypervisor address space? Or that's fine as
> > speculation wouldn't go that far?
> 
> Like elsewhere this leverages that the hypervisor VA range starts
> immediately after the non-canonical hole. I'm unaware of
> speculation being able to cross over that hole.
> 
> > I also wonder why this needs to be done in assembly, could you check
> > the address(es) using C?
> 
> For this to be efficient (in avoiding speculation) the insn
> sequence would better not have any conditional jumps. I don't
> think the compiler can be told so.

Why not use evaluate_nospec to check the address like we do in other
places?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-11  8:11       ` Roger Pau Monné
@ 2021-02-11 11:28         ` Jan Beulich
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-02-11 11:28 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 11.02.2021 09:11, Roger Pau Monné wrote:
> On Wed, Feb 10, 2021 at 05:55:52PM +0100, Jan Beulich wrote:
>> On 09.02.2021 17:26, Roger Pau Monné wrote:
>>> On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/usercopy.c
>>>> +++ b/xen/arch/x86/usercopy.c
>>>> @@ -10,12 +10,19 @@
>>>>  #include <xen/sched.h>
>>>>  #include <asm/uaccess.h>
>>>>  
>>>> -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
>>>> +#ifndef GUARD
>>>> +# define GUARD UA_KEEP
>>>> +#endif
>>>> +
>>>> +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
>>>>  {
>>>>      unsigned dummy;
>>>>  
>>>>      stac();
>>>>      asm volatile (
>>>> +        GUARD(
>>>> +        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
>>>
>>> Don't you need to also take 'n' into account here to assert that the
>>> address doesn't end in hypervisor address space? Or that's fine as
>>> speculation wouldn't go that far?
>>
>> Like elsewhere this leverages that the hypervisor VA range starts
>> immediately after the non-canonical hole. I'm unaware of
>> speculation being able to cross over that hole.
>>
>>> I also wonder why this needs to be done in assembly, could you check
>>> the address(es) using C?
>>
>> For this to be efficient (in avoiding speculation) the insn
>> sequence would better not have any conditional jumps. I don't
>> think the compiler can be told so.
> 
> Why not use evaluate_nospec to check the address like we do in other
> places?

Because LFENCE is far heavier than what we do here, which is
effectively a (distant) relative of array_index_nospec().

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-01-14 15:04 ` [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse Jan Beulich
  2021-02-09 16:26   ` Roger Pau Monné
@ 2021-02-12 10:41   ` Roger Pau Monné
  2021-02-12 12:48     ` Jan Beulich
  1 sibling, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-12 10:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
> Inspired by
> https://lore.kernel.org/lkml/f12e7d3cecf41b2c29734ea45a393be21d4a8058.1597848273.git.jpoimboe@redhat.com/
> and prior work in that area of x86 Linux, suppress speculation with
> guest specified pointer values by suitably masking the addresses to
> non-canonical space in case they fall into Xen's virtual address range.
> 
> Introduce a new Kconfig control.
> 
> Note that it is necessary in such code to avoid using "m" kind operands:
> If we didn't, there would be no guarantee that the register passed to
> guest_access_mask_ptr is also the (base) one used for the memory access.
> 
> As a minor unrelated change in get_unsafe_asm() the unnecessary "itype"
> parameter gets dropped and the XOR on the fixup path gets changed to be
> a 32-bit one in all cases: This way we avoid pointless REX.W or operand
> size overrides, or writes to partial registers.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> The insn sequence chosen is certainly up for discussion; I've picked
> this one despite the RCR because alternatives I could come up with,
> like
> 
> 	mov	$(HYPERVISOR_VIRT_END), %rax
> 	mov	$~0, %rdx
> 	mov	$0x7fffffffffffffff, %rcx
> 	cmp	%rax, %rdi
> 	cmovb	%rcx, %rdx
> 	and	%rdx, %rdi
> 
> weren't necessarily better: Either, as above, they are longer and
> require a 3rd scratch register, or they also utilize the carry flag in
> some similar way.
> ---
> Judging from the comment ahead of put_unsafe_asm() we might as well not
> tell gcc at all anymore about the memory access there, now that there's
> no use of the operand anymore in the assembly code.
> 
> --- a/xen/arch/x86/usercopy.c
> +++ b/xen/arch/x86/usercopy.c
> @@ -10,12 +10,19 @@
>  #include <xen/sched.h>
>  #include <asm/uaccess.h>
>  
> -unsigned __copy_to_user_ll(void __user *to, const void *from, unsigned n)
> +#ifndef GUARD
> +# define GUARD UA_KEEP
> +#endif
> +
> +unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n)
>  {
>      unsigned dummy;
>  
>      stac();
>      asm volatile (
> +        GUARD(
> +        "    guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n"
> +        )
>          "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
>          "    jbe  1f\n"
>          "    mov  %k[to], %[cnt]\n"
> @@ -42,6 +49,7 @@ unsigned __copy_to_user_ll(void __user *
>          _ASM_EXTABLE(1b, 2b)
>          : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
>            [aux] "=&r" (dummy)
> +          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
>          : "[aux]" (n)
>          : "memory" );
>      clac();
> @@ -49,12 +57,15 @@ unsigned __copy_to_user_ll(void __user *
>      return n;
>  }
>  
> -unsigned __copy_from_user_ll(void *to, const void __user *from, unsigned n)
> +unsigned int copy_from_guest_ll(void *to, const void __user *from, unsigned int n)
>  {
>      unsigned dummy;
>  
>      stac();
>      asm volatile (
> +        GUARD(
> +        "    guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n"
> +        )
>          "    cmp  $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n"
>          "    jbe  1f\n"
>          "    mov  %k[to], %[cnt]\n"
> @@ -87,6 +98,7 @@ unsigned __copy_from_user_ll(void *to, c
>          _ASM_EXTABLE(1b, 6b)
>          : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from),
>            [aux] "=&r" (dummy)
> +          GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy))
>          : "[aux]" (n)
>          : "memory" );
>      clac();
> @@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
>      return n;
>  }
>  
> +#if GUARD(1) + 0

Why do you need the '+ 0' here? I guess it's to prevent the
preprocessor from complaining when GUARD(1) gets replaced by nothing?

> +
>  /**
>   * copy_to_user: - Copy a block of data into user space.
>   * @to:   Destination address, in user space.
> @@ -128,8 +142,11 @@ unsigned clear_user(void __user *to, uns
>  {
>      if ( access_ok(to, n) )
>      {
> +        long dummy;
> +
>          stac();
>          asm volatile (
> +            "    guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n"
>              "0:  rep stos"__OS"\n"
>              "    mov  %[bytes], %[cnt]\n"
>              "1:  rep stosb\n"
> @@ -140,7 +157,8 @@ unsigned clear_user(void __user *to, uns
>              ".previous\n"
>              _ASM_EXTABLE(0b,3b)
>              _ASM_EXTABLE(1b,2b)
> -            : [cnt] "=&c" (n), [to] "+D" (to)
> +            : [cnt] "=&c" (n), [to] "+D" (to), [scratch1] "=&r" (dummy),
> +              [scratch2] "=&r" (dummy)
>              : [bytes] "r" (n & (BYTES_PER_LONG - 1)),
>                [longs] "0" (n / BYTES_PER_LONG), "a" (0) );
>          clac();
> @@ -174,6 +192,16 @@ unsigned copy_from_user(void *to, const
>      return n;
>  }
>  
> +# undef GUARD
> +# define GUARD UA_DROP
> +# define copy_to_guest_ll copy_to_unsafe_ll
> +# define copy_from_guest_ll copy_from_unsafe_ll
> +# undef __user
> +# define __user
> +# include __FILE__
> +
> +#endif /* GUARD(1) */
> +
>  /*
>   * Local variables:
>   * mode: C
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -446,6 +446,8 @@ UNLIKELY_START(g, create_bounce_frame_ba
>          jmp   asm_domain_crash_synchronous  /* Does not return */
>  __UNLIKELY_END(create_bounce_frame_bad_sp)
>  
> +        guest_access_mask_ptr %rsi, %rax, %rcx
> +
>  #define STORE_GUEST_STACK(reg, n) \
>  0:      movq  %reg,(n)*8(%rsi); \
>          _ASM_EXTABLE(0b, domain_crash_page_fault_ ## n ## x8)
> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -114,6 +114,24 @@ config SPECULATIVE_HARDEN_BRANCH
>  
>  	  If unsure, say Y.
>  
> +config SPECULATIVE_HARDEN_GUEST_ACCESS
> +	bool "Speculative PV Guest Memory Access Hardening"
> +	default y
> +	depends on PV
> +	help
> +	  Contemporary processors may use speculative execution as a
> +	  performance optimisation, but this can potentially be abused by an
> +	  attacker to leak data via speculative sidechannels.
> +
> +	  One source of data leakage is via speculative accesses to hypervisor
> +	  memory through guest controlled values used to access guest memory.
> +
> +	  When enabled, code paths accessing PV guest memory will have guest
> +	  controlled addresses massaged such that memory accesses through them
> +	  won't touch hypervisor address space.
> +
> +	  If unsure, say Y.
> +
>  endmenu
>  
>  config HYPFS
> --- a/xen/include/asm-x86/asm-defns.h
> +++ b/xen/include/asm-x86/asm-defns.h
> @@ -44,3 +44,16 @@
>  .macro INDIRECT_JMP arg:req
>      INDIRECT_BRANCH jmp \arg
>  .endm
> +
> +.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
> +#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
> +    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
> +    mov $~0, \scratch2
> +    cmp \ptr, \scratch1
> +    rcr $1, \scratch2
> +    and \scratch2, \ptr

If my understanding is correct, that's equivalent to:

ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);

It might be helpful to add this as a comment, to clarify the indented
functionality of the assembly bit.

I wonder if the C code above can generate any jumps? As you pointed
out, we already use something similar in array_index_mask_nospec and
that's fine to do in C.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-12 10:41   ` Roger Pau Monné
@ 2021-02-12 12:48     ` Jan Beulich
  2021-02-12 13:02       ` Roger Pau Monné
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Beulich @ 2021-02-12 12:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 12.02.2021 11:41, Roger Pau Monné wrote:
> On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
>> @@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
>>      return n;
>>  }
>>  
>> +#if GUARD(1) + 0
> 
> Why do you need the '+ 0' here? I guess it's to prevent the
> preprocessor from complaining when GUARD(1) gets replaced by nothing?

Yes. "#if" with nothing after it is an error from all I know.

>> --- a/xen/include/asm-x86/asm-defns.h
>> +++ b/xen/include/asm-x86/asm-defns.h
>> @@ -44,3 +44,16 @@
>>  .macro INDIRECT_JMP arg:req
>>      INDIRECT_BRANCH jmp \arg
>>  .endm
>> +
>> +.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
>> +#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
>> +    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
>> +    mov $~0, \scratch2
>> +    cmp \ptr, \scratch1
>> +    rcr $1, \scratch2
>> +    and \scratch2, \ptr
> 
> If my understanding is correct, that's equivalent to:
> 
> ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);
> 
> It might be helpful to add this as a comment, to clarify the indented
> functionality of the assembly bit.
> 
> I wonder if the C code above can generate any jumps? As you pointed
> out, we already use something similar in array_index_mask_nospec and
> that's fine to do in C.

Note how array_index_mask_nospec() gets away without any use of
relational operators. They're what poses the risk of getting
translated to branches. (Quite likely the compiler wouldn't use
any in the case here, as the code can easily get away without,
but we don't want to chance it. Afaict it would instead use a
3rd scratch register, so register pressure might still lead to
using a branch instead in some exceptional case.)

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-12 12:48     ` Jan Beulich
@ 2021-02-12 13:02       ` Roger Pau Monné
  2021-02-12 13:15         ` Jan Beulich
  0 siblings, 1 reply; 46+ messages in thread
From: Roger Pau Monné @ 2021-02-12 13:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On Fri, Feb 12, 2021 at 01:48:43PM +0100, Jan Beulich wrote:
> On 12.02.2021 11:41, Roger Pau Monné wrote:
> > On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
> >> @@ -94,6 +106,8 @@ unsigned __copy_from_user_ll(void *to, c
> >>      return n;
> >>  }
> >>  
> >> +#if GUARD(1) + 0
> > 
> > Why do you need the '+ 0' here? I guess it's to prevent the
> > preprocessor from complaining when GUARD(1) gets replaced by nothing?
> 
> Yes. "#if" with nothing after it is an error from all I know.
> 
> >> --- a/xen/include/asm-x86/asm-defns.h
> >> +++ b/xen/include/asm-x86/asm-defns.h
> >> @@ -44,3 +44,16 @@
> >>  .macro INDIRECT_JMP arg:req
> >>      INDIRECT_BRANCH jmp \arg
> >>  .endm
> >> +
> >> +.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
> >> +#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
> >> +    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
> >> +    mov $~0, \scratch2
> >> +    cmp \ptr, \scratch1
> >> +    rcr $1, \scratch2
> >> +    and \scratch2, \ptr
> > 
> > If my understanding is correct, that's equivalent to:
> > 
> > ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);
> > 
> > It might be helpful to add this as a comment, to clarify the indented
> > functionality of the assembly bit.
> > 
> > I wonder if the C code above can generate any jumps? As you pointed
> > out, we already use something similar in array_index_mask_nospec and
> > that's fine to do in C.
> 
> Note how array_index_mask_nospec() gets away without any use of
> relational operators. They're what poses the risk of getting
> translated to branches. (Quite likely the compiler wouldn't use
> any in the case here, as the code can easily get away without,
> but we don't want to chance it. Afaict it would instead use a
> 3rd scratch register, so register pressure might still lead to
> using a branch instead in some exceptional case.)

I see, it's not easy to build such construct without using any
relational operator. Would you mind adding the C equivalent next to
assembly chunk?

I don't think I have further comments:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse
  2021-02-12 13:02       ` Roger Pau Monné
@ 2021-02-12 13:15         ` Jan Beulich
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Beulich @ 2021-02-12 13:15 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan, George Dunlap

On 12.02.2021 14:02, Roger Pau Monné wrote:
> On Fri, Feb 12, 2021 at 01:48:43PM +0100, Jan Beulich wrote:
>> On 12.02.2021 11:41, Roger Pau Monné wrote:
>>> On Thu, Jan 14, 2021 at 04:04:57PM +0100, Jan Beulich wrote:
>>>> --- a/xen/include/asm-x86/asm-defns.h
>>>> +++ b/xen/include/asm-x86/asm-defns.h
>>>> @@ -44,3 +44,16 @@
>>>>  .macro INDIRECT_JMP arg:req
>>>>      INDIRECT_BRANCH jmp \arg
>>>>  .endm
>>>> +
>>>> +.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
>>>> +#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
>>>> +    mov $(HYPERVISOR_VIRT_END - 1), \scratch1
>>>> +    mov $~0, \scratch2
>>>> +    cmp \ptr, \scratch1
>>>> +    rcr $1, \scratch2
>>>> +    and \scratch2, \ptr
>>>
>>> If my understanding is correct, that's equivalent to:
>>>
>>> ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);
>>>
>>> It might be helpful to add this as a comment, to clarify the indented
>>> functionality of the assembly bit.
>>>
>>> I wonder if the C code above can generate any jumps? As you pointed
>>> out, we already use something similar in array_index_mask_nospec and
>>> that's fine to do in C.
>>
>> Note how array_index_mask_nospec() gets away without any use of
>> relational operators. They're what poses the risk of getting
>> translated to branches. (Quite likely the compiler wouldn't use
>> any in the case here, as the code can easily get away without,
>> but we don't want to chance it. Afaict it would instead use a
>> 3rd scratch register, so register pressure might still lead to
>> using a branch instead in some exceptional case.)
> 
> I see, it's not easy to build such construct without using any
> relational operator. Would you mind adding the C equivalent next to
> assembly chunk?

Sure:

    /*
     * Here we want
     *
     * ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);
     *
     * but guaranteed without any conditional branches (hence in assembly).
     */

> I don't think I have further comments:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks much!

Jan


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-02-12 13:15 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14 15:01 [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Jan Beulich
2021-01-14 15:03 ` [PATCH 01/17] x86/shadow: use __put_user() instead of __copy_to_user() Jan Beulich
2021-01-14 15:04 ` [PATCH 02/17] x86: split __{get,put}_user() into "guest" and "unsafe" variants Jan Beulich
2021-02-05 15:43   ` Roger Pau Monné
2021-02-05 16:13     ` Jan Beulich
2021-02-05 16:18       ` Roger Pau Monné
2021-02-05 16:26         ` Jan Beulich
2021-02-09 13:07           ` Roger Pau Monné
2021-02-09 13:15             ` Jan Beulich
2021-02-09 14:46               ` Roger Pau Monné
2021-02-09 14:57                 ` Jan Beulich
2021-02-09 15:23                   ` Roger Pau Monné
2021-02-09 14:55   ` Roger Pau Monné
2021-02-09 15:14     ` Jan Beulich
2021-02-09 15:27       ` Roger Pau Monné
2021-01-14 15:04 ` [PATCH 03/17] x86: split __copy_{from,to}_user() " Jan Beulich
2021-02-09 16:06   ` Roger Pau Monné
2021-02-09 17:03     ` Jan Beulich
2021-01-14 15:04 ` [PATCH 04/17] x86/PV: harden guest memory accesses against speculative abuse Jan Beulich
2021-02-09 16:26   ` Roger Pau Monné
2021-02-10 16:55     ` Jan Beulich
2021-02-11  8:11       ` Roger Pau Monné
2021-02-11 11:28         ` Jan Beulich
2021-02-12 10:41   ` Roger Pau Monné
2021-02-12 12:48     ` Jan Beulich
2021-02-12 13:02       ` Roger Pau Monné
2021-02-12 13:15         ` Jan Beulich
2021-01-14 15:05 ` [PATCH 05/17] x86: rename {get,put}_user() to {get,put}_guest() Jan Beulich
2021-01-14 15:05 ` [PATCH 06/17] x86/gdbsx: convert "user" to "guest" accesses Jan Beulich
2021-01-14 15:06 ` [PATCH 07/17] x86: rename copy_{from,to}_user() to copy_{from,to}_guest_pv() Jan Beulich
2021-01-14 15:07 ` [PATCH 08/17] x86: move stac()/clac() from {get,put}_unsafe_asm() Jan Beulich
2021-01-14 15:07 ` [PATCH 09/17] x86/PV: use get_unsafe() instead of copy_from_unsafe() Jan Beulich
2021-01-14 15:08 ` [PATCH 10/17] x86/shadow: " Jan Beulich
2021-01-14 15:08 ` [PATCH 11/17] x86/shadow: polish shadow_write_entries() Jan Beulich
2021-01-14 15:09 ` [PATCH 12/17] x86/shadow: move shadow_set_l<N>e() to their own source file Jan Beulich
2021-01-14 15:09 ` [PATCH 13/17] x86/shadow: don't open-code SHF_* shorthands Jan Beulich
2021-01-14 15:10 ` [PATCH 14/17] x86/shadow: SH_type_l2h_shadow is PV-only Jan Beulich
2021-01-14 15:10 ` [PATCH 15/17] x86/shadow: drop SH_type_l2h_pae_shadow Jan Beulich
2021-01-22 13:11   ` Tim Deegan
2021-01-22 16:31     ` Jan Beulich
2021-01-22 20:02       ` Tim Deegan
2021-01-25 11:09         ` Jan Beulich
2021-01-25 11:33         ` Jan Beulich
2021-01-14 15:10 ` [PATCH 16/17] x86/shadow: only 4-level guest code needs building when !HVM Jan Beulich
2021-01-14 15:11 ` [PATCH 17/17] x86/shadow: adjust is_pv_*() checks Jan Beulich
2021-01-22 13:18 ` [PATCH 00/17] x86/PV: avoid speculation abuse through guest accessors plus Tim Deegan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).