[PATCH 0/2][4.15?] x86/shadow: further refinements to "fast fault path" suppression

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2][4.15?] x86/shadow: further refinements to "fast fault path" suppression
@ 2021-03-05 15:36 Jan Beulich
  2021-03-05 15:37 ` [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized Jan Beulich
  2021-03-05 15:37 ` [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries Jan Beulich
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-05 15:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

Andrew points out that 'x86/shadow: suppress "fast fault path"
optimization without reserved bits' assumes firm knowledge of the
physical machine's address width. When we run virtualized
ourselves, we can't reasonably assume that we do, the more that
the property may change as we may get migrated.

Since we want to further refine this logic anyway, I thought I'd
also include the other change that I've previously mentioned, at
least for consideration whether to take for 4.15. This reduces
the performance impact (albeit perhaps only for very large or
exotic guests) that the earlier patch has.

1: suppress "fast fault path" optimization when running virtualized
2: encode full GFN in magic MMIO entries

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 15:36 [PATCH 0/2][4.15?] x86/shadow: further refinements to "fast fault path" suppression Jan Beulich
@ 2021-03-05 15:37 ` Jan Beulich
  2021-03-05 15:47   ` Andrew Cooper
  2021-03-08  9:25   ` Tim Deegan
  2021-03-05 15:37 ` [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries Jan Beulich
  1 sibling, 2 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-05 15:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

We can't make correctness of our own behavior dependent upon a
hypervisor underneath us correctly telling us the true physical address
with hardware uses. Without knowing this, we can't be certain reserved
bit faults can actually be observed. Therefore, besides evaluating the
number of address bits when deciding whether to use the optimization,
also check whether we're running virtualized ourselves.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -282,10 +282,16 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
  *
  * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
  * have reserved bits that we can use for this.  And even there it can only
- * be used if the processor doesn't use all 52 address bits.
+ * be used if we can be certain the processor doesn't use all 52 address bits.
  */
 
 #define SH_L1E_MAGIC 0xffffffff00000001ULL
+
+static inline bool sh_have_pte_rsvd_bits(void)
+{
+    return paddr_bits < PADDR_BITS && !cpu_has_hypervisor;
+}
+
 static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
 {
     return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
@@ -303,7 +309,7 @@ static inline shadow_l1e_t sh_l1e_gnp(vo
      * On systems with no reserved physical address bits we can't engage the
      * fast fault path.
      */
-    return paddr_bits < PADDR_BITS ? sh_l1e_gnp_raw()
+    return sh_have_pte_rsvd_bits() ? sh_l1e_gnp_raw()
                                    : shadow_l1e_empty();
 }
 
@@ -326,7 +332,7 @@ static inline shadow_l1e_t sh_l1e_mmio(g
 {
     unsigned long gfn_val = MASK_INSR(gfn_x(gfn), SH_L1E_MMIO_GFN_MASK);
 
-    if ( paddr_bits >= PADDR_BITS ||
+    if ( !sh_have_pte_rsvd_bits() ||
          gfn_x(gfn) != MASK_EXTR(gfn_val, SH_L1E_MMIO_GFN_MASK) )
         return shadow_l1e_empty();
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries
  2021-03-05 15:36 [PATCH 0/2][4.15?] x86/shadow: further refinements to "fast fault path" suppression Jan Beulich
  2021-03-05 15:37 ` [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized Jan Beulich
@ 2021-03-05 15:37 ` Jan Beulich
  2021-03-05 16:32   ` Andrew Cooper
  2021-03-08  9:39   ` Tim Deegan
  1 sibling, 2 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-05 15:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

Since we don't need to encode all of the PTE flags, we have enough bits
in the shadow entry to store the full GFN. Don't use literal numbers -
instead derive the involved values. Or, where derivation would become
too ugly, sanity-check the result (invoking #error to identify failure).

This then allows dropping from sh_l1e_mmio() again the guarding against
too large GFNs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder if the respective check in sh_audit_l1_table() is actually
useful to retain with these changes.

--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
  * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
  * have reserved bits that we can use for this.  And even there it can only
  * be used if we can be certain the processor doesn't use all 52 address bits.
+ *
+ * For the MMIO encoding (see below) we need the bottom 4 bits for
+ * identifying the kind of entry and a full GFN's worth of bits to encode
+ * the originating frame number.  Set all remaining bits to trigger
+ * reserved bit faults, if (see above) the hardware permits triggering such.
  */
 
-#define SH_L1E_MAGIC 0xffffffff00000001ULL
+#define SH_L1E_MAGIC_NR_META_BITS 4
+#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
+                                      SH_L1E_MAGIC_NR_META_BITS)) | \
+                           _PAGE_PRESENT)
 
 static inline bool sh_have_pte_rsvd_bits(void)
 {
@@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
 
 static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
 {
-    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
+    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
+    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
 }
 
 /* Guest not present: a single magic value */
@@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
 
 /*
  * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
- * We store 28 bits of GFN in bits 4:32 of the entry.
+ * We store the GFN in bits 4:43 of the entry.
  * The present bit is set, and the U/S and R/W bits are taken from the guest.
  * Bit 3 is always 0, to differentiate from gnp above.
  */
-#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
-#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
-#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
+#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
+#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
+#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
+# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
+#endif
+#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
+# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
+#endif
+#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
+#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
 
 static inline shadow_l1e_t sh_l1e_mmio(gfn_t gfn, u32 gflags)
 {
     unsigned long gfn_val = MASK_INSR(gfn_x(gfn), SH_L1E_MMIO_GFN_MASK);
 
-    if ( !sh_have_pte_rsvd_bits() ||
-         gfn_x(gfn) != MASK_EXTR(gfn_val, SH_L1E_MMIO_GFN_MASK) )
+    if ( !sh_have_pte_rsvd_bits() )
         return shadow_l1e_empty();
 
     return (shadow_l1e_t) { (SH_L1E_MMIO_MAGIC | gfn_val |



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 15:37 ` [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized Jan Beulich
@ 2021-03-05 15:47   ` Andrew Cooper
  2021-03-05 16:32     ` Jan Beulich
  2021-03-05 16:40     ` Ian Jackson
  2021-03-08  9:25   ` Tim Deegan
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Cooper @ 2021-03-05 15:47 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Tim Deegan, George Dunlap, Wei Liu, Roger Pau Monné, Ian Jackson

On 05/03/2021 15:37, Jan Beulich wrote:
> We can't make correctness of our own behavior dependent upon a
> hypervisor underneath us correctly telling us the true physical address
> with hardware uses. Without knowing this, we can't be certain reserved
> bit faults can actually be observed. Therefore, besides evaluating the
> number of address bits when deciding whether to use the optimization,
> also check whether we're running virtualized ourselves.

I think it would be helpful to point out why we can't even probe at boot
- the behaviour may genuinely change as we migrate, and if we ever end
up on an IceLake system levelled down for compatibility with older CPUs,
the "paddr_bits < PADDR_BITS" check will malfunction in an unsafe way.

>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This wants backporting to stable releases, so I would recommend for 4.15
even at this point.

~Andrew



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries
  2021-03-05 15:37 ` [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries Jan Beulich
@ 2021-03-05 16:32   ` Andrew Cooper
  2021-03-08 12:42     ` Jan Beulich
  2021-03-08  9:39   ` Tim Deegan
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2021-03-05 16:32 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Tim Deegan, George Dunlap, Wei Liu, Roger Pau Monné, Ian Jackson

On 05/03/2021 15:37, Jan Beulich wrote:
> Since we don't need to encode all of the PTE flags, we have enough bits
> in the shadow entry to store the full GFN. Don't use literal numbers -
> instead derive the involved values. Or, where derivation would become
> too ugly, sanity-check the result (invoking #error to identify failure).
>
> This then allows dropping from sh_l1e_mmio() again the guarding against
> too large GFNs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder if the respective check in sh_audit_l1_table() is actually
> useful to retain with these changes.
>
> --- a/xen/arch/x86/mm/shadow/types.h
> +++ b/xen/arch/x86/mm/shadow/types.h
> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>   * have reserved bits that we can use for this.  And even there it can only
>   * be used if we can be certain the processor doesn't use all 52 address bits.
> + *
> + * For the MMIO encoding (see below) we need the bottom 4 bits for
> + * identifying the kind of entry and a full GFN's worth of bits to encode
> + * the originating frame number.  Set all remaining bits to trigger
> + * reserved bit faults, if (see above) the hardware permits triggering such.
>   */
>  
> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
> +#define SH_L1E_MAGIC_NR_META_BITS 4
> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
> +                           _PAGE_PRESENT)
>  
>  static inline bool sh_have_pte_rsvd_bits(void)
>  {
> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>  
>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>  {
> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>  }
>  
>  /* Guest not present: a single magic value */
> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>  
>  /*
>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
> - * We store 28 bits of GFN in bits 4:32 of the entry.
> + * We store the GFN in bits 4:43 of the entry.
>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>   * Bit 3 is always 0, to differentiate from gnp above.
>   */
> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
> +#endif
> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
> +#endif
> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)

In practice, it is 4:36, because we have to limit shadow guests to 32
bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

Also, this property is important for L1TF.  The more guest-controllable
bits we permit in here, the greater the chance of being vulnerable to
L1TF on massive machines.

(I'm a little concerned that I can't spot an L1TF comment which has been
made stale by these changes...  Probably the fault of whichever numpty
prepared the L1TF patches, because I'm certain we discussed this at the
time)

Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
TOP-64G but I recall us agreed that that was ok, especially as the main
safety guestimate is "no RAM in the top quarter of the address space".

However, I don't think we want to accidentally creep beyond bit 36, so
I'd suggest that the easy fix here is just adjusting a nibble in the
MMIO masks.

~Andrew



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 15:47   ` Andrew Cooper
@ 2021-03-05 16:32     ` Jan Beulich
  2021-03-05 16:40     ` Ian Jackson
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-05 16:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tim Deegan, George Dunlap, Wei Liu, Roger Pau Monné,
	Ian Jackson, xen-devel

On 05.03.2021 16:47, Andrew Cooper wrote:
> On 05/03/2021 15:37, Jan Beulich wrote:
>> We can't make correctness of our own behavior dependent upon a
>> hypervisor underneath us correctly telling us the true physical address
>> with hardware uses. Without knowing this, we can't be certain reserved
>> bit faults can actually be observed. Therefore, besides evaluating the
>> number of address bits when deciding whether to use the optimization,
>> also check whether we're running virtualized ourselves.
> 
> I think it would be helpful to point out why we can't even probe at boot
> - the behaviour may genuinely change as we migrate, and if we ever end
> up on an IceLake system levelled down for compatibility with older CPUs,
> the "paddr_bits < PADDR_BITS" check will malfunction in an unsafe way.

I've added a sentence to this effect.

>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 15:47   ` Andrew Cooper
  2021-03-05 16:32     ` Jan Beulich
@ 2021-03-05 16:40     ` Ian Jackson
  2021-03-05 16:47       ` Andrew Cooper
  1 sibling, 1 reply; 19+ messages in thread
From: Ian Jackson @ 2021-03-05 16:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, xen-devel, Tim Deegan, George Dunlap, Wei Liu,
	Roger Pau Monné

Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> This wants backporting to stable releases, so I would recommend for 4.15
> even at this point.

Can someone explain to me the implications of not taking these patch,
and the risks of taking them ?

AFIACT the implications of not taking 1/ are that we would misbehave
in a security relevant way, sometimes, when we are running under
another hypervisor ?

And the implications of not taking 2/ is a performance problem ?

As to the risks, 1/ looks obviously correct even to me.

2/ seems complex.  What would go wrong if there were a misplaced ) or
confused bit-twiddling or something ?

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 16:40     ` Ian Jackson
@ 2021-03-05 16:47       ` Andrew Cooper
  2021-03-05 16:58         ` Ian Jackson
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2021-03-05 16:47 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Jan Beulich, xen-devel, Tim Deegan, George Dunlap, Wei Liu,
	Roger Pau Monné

On 05/03/2021 16:40, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
>> This wants backporting to stable releases, so I would recommend for 4.15
>> even at this point.
> Can someone explain to me the implications of not taking these patch,
> and the risks of taking them ?
>
> AFIACT the implications of not taking 1/ are that we would misbehave
> in a security relevant way, sometimes, when we are running under
> another hypervisor ?

Correct.  Specifically if you've got a migration pool containing an
IceLake server and something older.

> And the implications of not taking 2/ is a performance problem ?

Correct (I believe).

> As to the risks, 1/ looks obviously correct even to me.

I agree, although Tim has the deciding maintainer vote.

> 2/ seems complex.  What would go wrong if there were a misplaced ) or
> confused bit-twiddling or something ?

The bit twiddling can be independency checked by disassembling the binary.

However, I have some concerns with the patch as-is, in relation to L1TF
/ XSA-273.

~Andrew



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 16:47       ` Andrew Cooper
@ 2021-03-05 16:58         ` Ian Jackson
  0 siblings, 0 replies; 19+ messages in thread
From: Ian Jackson @ 2021-03-05 16:58 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, xen-devel, Tim Deegan, George Dunlap, Wei Liu,
	Roger Pau Monné

Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> On 05/03/2021 16:40, Ian Jackson wrote:
> > Andrew Cooper writes ("Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized"):
> >> This wants backporting to stable releases, so I would recommend for 4.15
> >> even at this point.
> > Can someone explain to me the implications of not taking these patch,
> > and the risks of taking them ?
> >
> > AFIACT the implications of not taking 1/ are that we would misbehave
> > in a security relevant way, sometimes, when we are running under
> > another hypervisor ?
> 
> Correct.  Specifically if you've got a migration pool containing an
> IceLake server and something older.
> 
> > As to the risks, 1/ looks obviously correct even to me.
> 
> I agree, although Tim has the deciding maintainer vote.

Right, well, for patch 1 then

Release-Acked-by: Ian Jackson <iwj@xenproject.org>

> > And the implications of not taking 2/ is a performance problem ?
> 
> Correct (I believe).
> 
> > 2/ seems complex.  What would go wrong if there were a misplaced ) or
> > confused bit-twiddling or something ?
> 
> The bit twiddling can be independency checked by disassembling the binary.
> 
> However, I have some concerns with the patch as-is, in relation to L1TF
> / XSA-273.

I'm going to hold off on this for now.  I think to give it a
release-ack I would want someone to argue the case.  Concerns would
include Andy's comments (which I saw earlier but do not fully
understand) and me wanting to to know (i) how bad is the perf impact
without it (ii) how has this bit-twiddling been checked.

I hope that makes sense.

Thanks,
Ian.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-05 15:37 ` [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized Jan Beulich
  2021-03-05 15:47   ` Andrew Cooper
@ 2021-03-08  9:25   ` Tim Deegan
  2021-03-08  9:40     ` Jan Beulich
  2021-03-08 13:47     ` Andrew Cooper
  1 sibling, 2 replies; 19+ messages in thread
From: Tim Deegan @ 2021-03-08  9:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
> We can't make correctness of our own behavior dependent upon a
> hypervisor underneath us correctly telling us the true physical address
> with hardware uses. Without knowing this, we can't be certain reserved
> bit faults can actually be observed. Therefore, besides evaluating the
> number of address bits when deciding whether to use the optimization,
> also check whether we're running virtualized ourselves.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Tim Deegan <tim@xen.org>

I would consider this to be a bug in the underlying hypervisor, but I
agree than in practice it won't be safe to rely on it being correct.

These checks are getting fiddly now.  I think that if we end up adding
any more to them it might be good to set a read-mostly variable at boot
time rather than do them on every MMIO/NP fault.

Cheers,

Tim.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries
  2021-03-05 15:37 ` [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries Jan Beulich
  2021-03-05 16:32   ` Andrew Cooper
@ 2021-03-08  9:39   ` Tim Deegan
  2021-03-08 12:05     ` Jan Beulich
  1 sibling, 1 reply; 19+ messages in thread
From: Tim Deegan @ 2021-03-08  9:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

At 16:37 +0100 on 05 Mar (1614962265), Jan Beulich wrote:
> Since we don't need to encode all of the PTE flags, we have enough bits
> in the shadow entry to store the full GFN. Don't use literal numbers -
> instead derive the involved values. Or, where derivation would become
> too ugly, sanity-check the result (invoking #error to identify failure).
> 
> This then allows dropping from sh_l1e_mmio() again the guarding against
> too large GFNs.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Tim Deegan <tim@xen.org>

> I wonder if the respective check in sh_audit_l1_table() is actually
> useful to retain with these changes.

Yes, I think so.  We care about these PTEs being bogus for any reason,
not just the ones that this could address.

> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
> +#define SH_L1E_MAGIC_NR_META_BITS 4
> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
> +                           _PAGE_PRESENT)

I don't think this makes the code any more readable, TBH, but if you
prefer it that's OK.  I'd be happier with it if you added a
BUILD_BUG_ON that checks that 1ULL << (PADDR_BITS - 1) is set in the
mask, since that's the main thing we care about.

> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)

IMO this would be more readable as a straight 0x8 (or even _PAGE_PWT).
The ack stands either way.

Cheers,

Tim.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08  9:25   ` Tim Deegan
@ 2021-03-08  9:40     ` Jan Beulich
  2021-03-08 13:47     ` Andrew Cooper
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-08  9:40 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

On 08.03.2021 10:25, Tim Deegan wrote:
> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>> We can't make correctness of our own behavior dependent upon a
>> hypervisor underneath us correctly telling us the true physical address
>> with hardware uses. Without knowing this, we can't be certain reserved
>> bit faults can actually be observed. Therefore, besides evaluating the
>> number of address bits when deciding whether to use the optimization,
>> also check whether we're running virtualized ourselves.
>>
>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Tim Deegan <tim@xen.org>

Thanks.

> I would consider this to be a bug in the underlying hypervisor, but I
> agree than in practice it won't be safe to rely on it being correct.

Suffice it to say that I don't think we present a correct value to
our guests. Plus, as said elsewhere, what would you suggest to hand
to the guest in case it may need migrating (to a host with a
different number of PA bits)?

> These checks are getting fiddly now.  I think that if we end up adding
> any more to them it might be good to set a read-mostly variable at boot
> time rather than do them on every MMIO/NP fault.

Maybe, but I'd like to point out that the fault path uses only the
sh_l1e_is_*() functions (plus sh_l1e_mmio_get_gfn()), and hence
isn't affected by the added fiddly-ness.

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries
  2021-03-08  9:39   ` Tim Deegan
@ 2021-03-08 12:05     ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-08 12:05 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, George Dunlap, Andrew Cooper, Wei Liu,
	Roger Pau Monné,
	Ian Jackson

On 08.03.2021 10:39, Tim Deegan wrote:
> At 16:37 +0100 on 05 Mar (1614962265), Jan Beulich wrote:
>> Since we don't need to encode all of the PTE flags, we have enough bits
>> in the shadow entry to store the full GFN. Don't use literal numbers -
>> instead derive the involved values. Or, where derivation would become
>> too ugly, sanity-check the result (invoking #error to identify failure).
>>
>> This then allows dropping from sh_l1e_mmio() again the guarding against
>> too large GFNs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Tim Deegan <tim@xen.org>

Thanks.

>> I wonder if the respective check in sh_audit_l1_table() is actually
>> useful to retain with these changes.
> 
> Yes, I think so.  We care about these PTEs being bogus for any reason,
> not just the ones that this could address.
> 
>> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
>> +#define SH_L1E_MAGIC_NR_META_BITS 4
>> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
>> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
>> +                           _PAGE_PRESENT)
> 
> I don't think this makes the code any more readable, TBH, but if you
> prefer it that's OK.  I'd be happier with it if you added a
> BUILD_BUG_ON that checks that 1ULL << (PADDR_BITS - 1) is set in the
> mask, since that's the main thing we care about.
> 
>> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
>> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
>> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
>> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
>> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
> 
> IMO this would be more readable as a straight 0x8 (or even _PAGE_PWT).
> The ack stands either way.

For both of your remarks, I agree that readability suffers by doing
it this way, so I'll undo some of it since you'd prefer it to be
more readable. My goal wasn't so much readability though but ease
of changes down the road (no need to change multiple related
definitions) and documentation of why these values actually are the
way they are.

But let's first see anyway what further changes are needed in
response to Andrew's remarks.

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries
  2021-03-05 16:32   ` Andrew Cooper
@ 2021-03-08 12:42     ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2021-03-08 12:42 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tim Deegan, George Dunlap, Wei Liu, Roger Pau Monné,
	Ian Jackson, xen-devel

On 05.03.2021 17:32, Andrew Cooper wrote:
> On 05/03/2021 15:37, Jan Beulich wrote:
>> Since we don't need to encode all of the PTE flags, we have enough bits
>> in the shadow entry to store the full GFN. Don't use literal numbers -
>> instead derive the involved values. Or, where derivation would become
>> too ugly, sanity-check the result (invoking #error to identify failure).
>>
>> This then allows dropping from sh_l1e_mmio() again the guarding against
>> too large GFNs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I wonder if the respective check in sh_audit_l1_table() is actually
>> useful to retain with these changes.
>>
>> --- a/xen/arch/x86/mm/shadow/types.h
>> +++ b/xen/arch/x86/mm/shadow/types.h
>> @@ -283,9 +283,17 @@ shadow_put_page_from_l1e(shadow_l1e_t sl
>>   * This is only feasible for PAE and 64bit Xen: 32-bit non-PAE PTEs don't
>>   * have reserved bits that we can use for this.  And even there it can only
>>   * be used if we can be certain the processor doesn't use all 52 address bits.
>> + *
>> + * For the MMIO encoding (see below) we need the bottom 4 bits for
>> + * identifying the kind of entry and a full GFN's worth of bits to encode
>> + * the originating frame number.  Set all remaining bits to trigger
>> + * reserved bit faults, if (see above) the hardware permits triggering such.
>>   */
>>  
>> -#define SH_L1E_MAGIC 0xffffffff00000001ULL
>> +#define SH_L1E_MAGIC_NR_META_BITS 4
>> +#define SH_L1E_MAGIC_MASK ((~0ULL << (PADDR_BITS - PAGE_SHIFT + \
>> +                                      SH_L1E_MAGIC_NR_META_BITS)) | \
>> +                           _PAGE_PRESENT)
>>  
>>  static inline bool sh_have_pte_rsvd_bits(void)
>>  {
>> @@ -294,7 +302,8 @@ static inline bool sh_have_pte_rsvd_bits
>>  
>>  static inline bool sh_l1e_is_magic(shadow_l1e_t sl1e)
>>  {
>> -    return (sl1e.l1 & SH_L1E_MAGIC) == SH_L1E_MAGIC;
>> +    BUILD_BUG_ON(!(PADDR_MASK & SH_L1E_MAGIC_MASK));
>> +    return (sl1e.l1 & SH_L1E_MAGIC_MASK) == SH_L1E_MAGIC_MASK;
>>  }
>>  
>>  /* Guest not present: a single magic value */
>> @@ -320,20 +329,26 @@ static inline bool sh_l1e_is_gnp(shadow_
>>  
>>  /*
>>   * MMIO: an invalid PTE that contains the GFN of the equivalent guest l1e.
>> - * We store 28 bits of GFN in bits 4:32 of the entry.
>> + * We store the GFN in bits 4:43 of the entry.
>>   * The present bit is set, and the U/S and R/W bits are taken from the guest.
>>   * Bit 3 is always 0, to differentiate from gnp above.
>>   */
>> -#define SH_L1E_MMIO_MAGIC       0xffffffff00000001ULL
>> -#define SH_L1E_MMIO_MAGIC_MASK  0xffffffff00000009ULL
>> -#define SH_L1E_MMIO_GFN_MASK    0x00000000fffffff0ULL
>> +#define SH_L1E_MMIO_MAGIC       SH_L1E_MAGIC_MASK
>> +#define SH_L1E_MMIO_MAGIC_BIT   ((_PAGE_PRESENT | _PAGE_RW | _PAGE_USER) + 1)
>> +#if SH_L1E_MMIO_MAGIC_BIT & (SH_L1E_MMIO_MAGIC_BIT - 1)
>> +# error SH_L1E_MMIO_MAGIC_BIT needs to be a power of 2
>> +#endif
>> +#if SH_L1E_MMIO_MAGIC_BIT >> SH_L1E_MAGIC_NR_META_BITS
>> +# error SH_L1E_MMIO_MAGIC_BIT and SH_L1E_MAGIC_NR_META_BITS are out of sync
>> +#endif
>> +#define SH_L1E_MMIO_MAGIC_MASK  (SH_L1E_MAGIC_MASK | SH_L1E_MMIO_MAGIC_BIT)
>> +#define SH_L1E_MMIO_GFN_MASK    ~(SH_L1E_MMIO_MAGIC_MASK | _PAGE_RW | _PAGE_USER)
> 
> In practice, it is 4:36, because we have to limit shadow guests to 32
> bits of gfn for XSA-173 (width of the superpage backpointer IIRC).

When !BIGMEM - yes.

> Also, this property is important for L1TF.  The more guest-controllable
> bits we permit in here, the greater the chance of being vulnerable to
> L1TF on massive machines.
> 
> (I'm a little concerned that I can't spot an L1TF comment which has been
> made stale by these changes...  Probably the fault of whichever numpty
> prepared the L1TF patches, because I'm certain we discussed this at the
> time)
> 
> Going from 32 to 36 bits moves the upper safety barrier from TOP-4G to
> TOP-64G but I recall us agreed that that was ok, especially as the main
> safety guestimate is "no RAM in the top quarter of the address space".
> 
> However, I don't think we want to accidentally creep beyond bit 36, so
> I'd suggest that the easy fix here is just adjusting a nibble in the
> MMIO masks.

With BIGMEM I'm not sure we want to be this strict. Nor do we need
to as long as we only need 4 bits at the bottom - we only go up to
bit 43 with what we allow guests control over. IOW we will need to
be careful on old hardware when l1d_maxphysaddr == 44, but on
anything newer we're still far enough away I would think. So I
guess instead of outright dropping the GFN check from sh_l1e_mmio()
I want to replace it by use of is_l1tf_safe_maddr() (on the
produced shadow_l1e_t).

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08  9:25   ` Tim Deegan
  2021-03-08  9:40     ` Jan Beulich
@ 2021-03-08 13:47     ` Andrew Cooper
  2021-03-08 13:51       ` Jan Beulich
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2021-03-08 13:47 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich
  Cc: xen-devel, George Dunlap, Wei Liu, Roger Pau Monné, Ian Jackson

On 08/03/2021 09:25, Tim Deegan wrote:
> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>> We can't make correctness of our own behavior dependent upon a
>> hypervisor underneath us correctly telling us the true physical address
>> with hardware uses. Without knowing this, we can't be certain reserved
>> bit faults can actually be observed. Therefore, besides evaluating the
>> number of address bits when deciding whether to use the optimization,
>> also check whether we're running virtualized ourselves.
>>
>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>
>
> I would consider this to be a bug in the underlying hypervisor, but I
> agree than in practice it won't be safe to rely on it being correct.

I'd argue against this being a hypervisor bug.  If anything, it is a
weakness in how x86 virtualisation works.

For booting on a single host, then yes - vMAXPHYSADDR really ought to be
the same as MAXPHYSADDR, and is what happens in the common case.

For booting in a heterogeneous pool, the only safe value is the min of
MAXPHYSADDR across the resource pool.  Anything higher, and the VM will
malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest
pool member(s).

Address widths vary greatly between generations and SKUs, so blocking
migrate on a MAXPHYSADDR mismatch isn't a viable option.  VM migration
works in practice because native kernels don't tend to use reserved bit
optimisations in the first place.

The fault lies with Xen.  We're using a property of reserved bit
behaviour which was always going to change eventually, and can't be
levelled in common heterogeneous scenarios.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08 13:47     ` Andrew Cooper
@ 2021-03-08 13:51       ` Jan Beulich
  2021-03-08 13:59         ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2021-03-08 13:51 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan
  Cc: xen-devel, George Dunlap, Wei Liu, Roger Pau Monné, Ian Jackson

On 08.03.2021 14:47, Andrew Cooper wrote:
> On 08/03/2021 09:25, Tim Deegan wrote:
>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>>> We can't make correctness of our own behavior dependent upon a
>>> hypervisor underneath us correctly telling us the true physical address
>>> with hardware uses. Without knowing this, we can't be certain reserved
>>> bit faults can actually be observed. Therefore, besides evaluating the
>>> number of address bits when deciding whether to use the optimization,
>>> also check whether we're running virtualized ourselves.
>>>
>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Tim Deegan <tim@xen.org>
>>
>> I would consider this to be a bug in the underlying hypervisor, but I
>> agree than in practice it won't be safe to rely on it being correct.
> 
> I'd argue against this being a hypervisor bug.  If anything, it is a
> weakness in how x86 virtualisation works.
> 
> For booting on a single host, then yes - vMAXPHYSADDR really ought to be
> the same as MAXPHYSADDR, and is what happens in the common case.
> 
> For booting in a heterogeneous pool, the only safe value is the min of
> MAXPHYSADDR across the resource pool.  Anything higher, and the VM will
> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest
> pool member(s).

Except that min isn't safe either - the guest may then expect reserved
bit faults where none surface.

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08 13:51       ` Jan Beulich
@ 2021-03-08 13:59         ` Andrew Cooper
  2021-03-08 14:29           ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2021-03-08 13:59 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: xen-devel, George Dunlap, Wei Liu, Roger Pau Monné, Ian Jackson

On 08/03/2021 13:51, Jan Beulich wrote:
> On 08.03.2021 14:47, Andrew Cooper wrote:
>> On 08/03/2021 09:25, Tim Deegan wrote:
>>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>>>> We can't make correctness of our own behavior dependent upon a
>>>> hypervisor underneath us correctly telling us the true physical address
>>>> with hardware uses. Without knowing this, we can't be certain reserved
>>>> bit faults can actually be observed. Therefore, besides evaluating the
>>>> number of address bits when deciding whether to use the optimization,
>>>> also check whether we're running virtualized ourselves.
>>>>
>>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Acked-by: Tim Deegan <tim@xen.org>
>>>
>>> I would consider this to be a bug in the underlying hypervisor, but I
>>> agree than in practice it won't be safe to rely on it being correct.
>> I'd argue against this being a hypervisor bug.  If anything, it is a
>> weakness in how x86 virtualisation works.
>>
>> For booting on a single host, then yes - vMAXPHYSADDR really ought to be
>> the same as MAXPHYSADDR, and is what happens in the common case.
>>
>> For booting in a heterogeneous pool, the only safe value is the min of
>> MAXPHYSADDR across the resource pool.  Anything higher, and the VM will
>> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest
>> pool member(s).
> Except that min isn't safe either - the guest may then expect reserved
> bit faults where none surface.

Such a guest is buggy, and in clear violation of the rules set out in
the architecture.  All reserved behaviour is subject to change in the
future.

Any software (Xen included) deliberately choosing to depend on the
specifics of reserved behaviour, get to keep all resulting pieces.

~Andrew



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08 13:59         ` Andrew Cooper
@ 2021-03-08 14:29           ` Jan Beulich
  2021-03-08 14:47             ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2021-03-08 14:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, George Dunlap, Wei Liu, Roger Pau Monné,
	Ian Jackson, Tim Deegan

On 08.03.2021 14:59, Andrew Cooper wrote:
> On 08/03/2021 13:51, Jan Beulich wrote:
>> On 08.03.2021 14:47, Andrew Cooper wrote:
>>> On 08/03/2021 09:25, Tim Deegan wrote:
>>>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>>>>> We can't make correctness of our own behavior dependent upon a
>>>>> hypervisor underneath us correctly telling us the true physical address
>>>>> with hardware uses. Without knowing this, we can't be certain reserved
>>>>> bit faults can actually be observed. Therefore, besides evaluating the
>>>>> number of address bits when deciding whether to use the optimization,
>>>>> also check whether we're running virtualized ourselves.
>>>>>
>>>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Acked-by: Tim Deegan <tim@xen.org>
>>>>
>>>> I would consider this to be a bug in the underlying hypervisor, but I
>>>> agree than in practice it won't be safe to rely on it being correct.
>>> I'd argue against this being a hypervisor bug.  If anything, it is a
>>> weakness in how x86 virtualisation works.
>>>
>>> For booting on a single host, then yes - vMAXPHYSADDR really ought to be
>>> the same as MAXPHYSADDR, and is what happens in the common case.
>>>
>>> For booting in a heterogeneous pool, the only safe value is the min of
>>> MAXPHYSADDR across the resource pool.  Anything higher, and the VM will
>>> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest
>>> pool member(s).
>> Except that min isn't safe either - the guest may then expect reserved
>> bit faults where none surface.
> 
> Such a guest is buggy, and in clear violation of the rules set out in
> the architecture.  All reserved behaviour is subject to change in the
> future.
> 
> Any software (Xen included) deliberately choosing to depend on the
> specifics of reserved behaviour, get to keep all resulting pieces.

While I could understand what you're saying when considering our
prior unconditional relying on getting reserved bit faults, are you
suggesting the recently adjusted behavior is still "in clear
violation of the rules set out in the architecture"? And hence are
you suggesting we should have outright dropped that optimization?

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized
  2021-03-08 14:29           ` Jan Beulich
@ 2021-03-08 14:47             ` Andrew Cooper
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Cooper @ 2021-03-08 14:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, George Dunlap, Wei Liu, Roger Pau Monné,
	Ian Jackson, Tim Deegan

On 08/03/2021 14:29, Jan Beulich wrote:
> On 08.03.2021 14:59, Andrew Cooper wrote:
>> On 08/03/2021 13:51, Jan Beulich wrote:
>>> On 08.03.2021 14:47, Andrew Cooper wrote:
>>>> On 08/03/2021 09:25, Tim Deegan wrote:
>>>>> At 16:37 +0100 on 05 Mar (1614962224), Jan Beulich wrote:
>>>>>> We can't make correctness of our own behavior dependent upon a
>>>>>> hypervisor underneath us correctly telling us the true physical address
>>>>>> with hardware uses. Without knowing this, we can't be certain reserved
>>>>>> bit faults can actually be observed. Therefore, besides evaluating the
>>>>>> number of address bits when deciding whether to use the optimization,
>>>>>> also check whether we're running virtualized ourselves.
>>>>>>
>>>>>> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> Acked-by: Tim Deegan <tim@xen.org>
>>>>>
>>>>> I would consider this to be a bug in the underlying hypervisor, but I
>>>>> agree than in practice it won't be safe to rely on it being correct.
>>>> I'd argue against this being a hypervisor bug.  If anything, it is a
>>>> weakness in how x86 virtualisation works.
>>>>
>>>> For booting on a single host, then yes - vMAXPHYSADDR really ought to be
>>>> the same as MAXPHYSADDR, and is what happens in the common case.
>>>>
>>>> For booting in a heterogeneous pool, the only safe value is the min of
>>>> MAXPHYSADDR across the resource pool.  Anything higher, and the VM will
>>>> malfunction (get #PF[rsvd] for apparently-legal PTEs) on the smallest
>>>> pool member(s).
>>> Except that min isn't safe either - the guest may then expect reserved
>>> bit faults where none surface.
>> Such a guest is buggy, and in clear violation of the rules set out in
>> the architecture.  All reserved behaviour is subject to change in the
>> future.
>>
>> Any software (Xen included) deliberately choosing to depend on the
>> specifics of reserved behaviour, get to keep all resulting pieces.
> While I could understand what you're saying when considering our
> prior unconditional relying on getting reserved bit faults, are you
> suggesting the recently adjusted behavior is still "in clear
> violation of the rules set out in the architecture"? And hence are
> you suggesting we should have outright dropped that optimization?

Strictly speaking, we shouldn't use reserved bits at all.  That is the
most compatible option available.

Given that we have chosen to use reserved bits, it is our responsibility
for ensuring that they are safe to use.  That is why we elect not to use
them when virtualised (for this bug), or on native when we think they
won't work (the previous change for IceLake).

From that point of view, I think we're fine.  We're (knowingly)
operating outside of the rules, and now taking appropriate precautions
to cover the corner cases we are aware of.

The behaviour of already-shipped CPUs, while not architecturally
guarenteed, is known and can reasonably be depended upon.

~Andrew



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-03-08 14:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 15:36 [PATCH 0/2][4.15?] x86/shadow: further refinements to "fast fault path" suppression Jan Beulich
2021-03-05 15:37 ` [PATCH 1/2][4.15?] x86/shadow: suppress "fast fault path" optimization when running virtualized Jan Beulich
2021-03-05 15:47   ` Andrew Cooper
2021-03-05 16:32     ` Jan Beulich
2021-03-05 16:40     ` Ian Jackson
2021-03-05 16:47       ` Andrew Cooper
2021-03-05 16:58         ` Ian Jackson
2021-03-08  9:25   ` Tim Deegan
2021-03-08  9:40     ` Jan Beulich
2021-03-08 13:47     ` Andrew Cooper
2021-03-08 13:51       ` Jan Beulich
2021-03-08 13:59         ` Andrew Cooper
2021-03-08 14:29           ` Jan Beulich
2021-03-08 14:47             ` Andrew Cooper
2021-03-05 15:37 ` [PATCH 2/2][4.15?] x86/shadow: encode full GFN in magic MMIO entries Jan Beulich
2021-03-05 16:32   ` Andrew Cooper
2021-03-08 12:42     ` Jan Beulich
2021-03-08  9:39   ` Tim Deegan
2021-03-08 12:05     ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).