xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] x86/xstate: Fixes to size calculations
@ 2021-05-03 15:39 Andrew Cooper
  2021-05-03 15:39 ` [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0() Andrew Cooper
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

Various fixes and improvements to xsave image size calculations.

 * Skip redundant xcr0 writes
 * Don't reconfigure xcr0 to query hardware when we can trivially calculate
   the answer ourselves.
 * Fix latent bug with CPUID.0xD[0].ebx.
 * Rework CPUID.0xD[1].ebx to behave correctly when supervisor states are in
   use.

Results from AMD Milan with some prototype CET handling, as well as the debug
cross checks, in place:
  (d1) xstates 0x0001, uncomp 0x240, comp 0x240
  (d1) xstates 0x0003, uncomp 0x240, comp 0x240
  (d1) xstates 0x0007, uncomp 0x340, comp 0x340
  (d1) xstates 0x0207, uncomp 0x988, comp 0x348
  (d1) xstates 0x0a07, uncomp 0x988, comp 0x358
  (d1) xstates 0x1a07, uncomp 0x988, comp 0x370

Andrew Cooper (5):
  x86/xstate: Elide redundant writes in set_xcr0()
  x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size()
  x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  x86/cpuid: Simplify recalculate_xstate()
  x86/cpuid: Fix handling of xsave dynamic leaves

 xen/arch/x86/cpuid.c         |  75 +++++++++------------------
 xen/arch/x86/domain.c        |   4 +-
 xen/arch/x86/domctl.c        |   2 +-
 xen/arch/x86/hvm/hvm.c       |   2 +-
 xen/arch/x86/xstate.c        | 117 +++++++++++++++++++++++++++++++++++--------
 xen/include/asm-x86/xstate.h |   3 +-
 6 files changed, 126 insertions(+), 77 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0()
  2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
@ 2021-05-03 15:39 ` Andrew Cooper
  2021-05-04 11:51   ` Jan Beulich
  2021-05-03 15:39 ` [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size() Andrew Cooper
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

XSETBV is an expensive instruction as, amongst other things, it involves
reconfiguring the instruction decode at the frontend of the pipeline.

We have several paths which reconfigure %xcr0 in quick succession (the context
switch path has 5, including the fpu save/restore helpers), and only a single
caller takes any care to try to skip redundant writes.

Update set_xcr0() to perform amortisation automatically, and simplify the
__context_switch() path as a consequence.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/domain.c |  4 +---
 xen/arch/x86/xstate.c | 15 +++++++++++----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4dc27f798e..50a27197b5 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1977,9 +1977,7 @@ static void __context_switch(void)
         memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES);
         if ( cpu_has_xsave )
         {
-            u64 xcr0 = n->arch.xcr0 ?: XSTATE_FP_SSE;
-
-            if ( xcr0 != get_xcr0() && !set_xcr0(xcr0) )
+            if ( !set_xcr0(n->arch.xcr0 ?: XSTATE_FP_SSE) )
                 BUG();
 
             if ( cpu_has_xsaves && is_hvm_vcpu(n) )
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 3794d9a5a5..f82dae8053 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -55,11 +55,18 @@ static inline bool xsetbv(u32 index, u64 xfeatures)
     return lo != 0;
 }
 
-bool set_xcr0(u64 xfeatures)
+bool set_xcr0(u64 val)
 {
-    if ( !xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures) )
-        return false;
-    this_cpu(xcr0) = xfeatures;
+    uint64_t *this_xcr0 = &this_cpu(xcr0);
+
+    if ( *this_xcr0 != val )
+    {
+        if ( !xsetbv(XCR_XFEATURE_ENABLED_MASK, val) )
+            return false;
+
+        *this_xcr0 = val;
+    }
+
     return true;
 }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size()
  2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
  2021-05-03 15:39 ` [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0() Andrew Cooper
@ 2021-05-03 15:39 ` Andrew Cooper
  2021-05-04 11:53   ` Jan Beulich
  2021-05-03 15:39 ` [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size() Andrew Cooper
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

The latter is a more descriptive name, as it explicitly highlights the query
from hardware.

Simplify the internal logic using cpuid_count_ebx(), and drop the curr/max
assertion as this property is guaranteed by the x86 ISA.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/xstate.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index f82dae8053..e6c225a16b 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -554,19 +554,18 @@ void xstate_free_save_area(struct vcpu *v)
     v->arch.xsave_area = NULL;
 }
 
-static unsigned int _xstate_ctxt_size(u64 xcr0)
+static unsigned int hw_uncompressed_size(uint64_t xcr0)
 {
     u64 act_xcr0 = get_xcr0();
-    u32 eax, ebx = 0, ecx, edx;
+    unsigned int size;
     bool ok = set_xcr0(xcr0);
 
     ASSERT(ok);
-    cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-    ASSERT(ebx <= ecx);
+    size = cpuid_count_ebx(XSTATE_CPUID, 0);
     ok = set_xcr0(act_xcr0);
     ASSERT(ok);
 
-    return ebx;
+    return size;
 }
 
 /* Fastpath for common xstate size requests, avoiding reloads of xcr0. */
@@ -578,7 +577,7 @@ unsigned int xstate_ctxt_size(u64 xcr0)
     if ( xcr0 == 0 )
         return 0;
 
-    return _xstate_ctxt_size(xcr0);
+    return hw_uncompressed_size(xcr0);
 }
 
 /* Collect the information of processor's extended state */
@@ -635,14 +634,14 @@ void xstate_init(struct cpuinfo_x86 *c)
          * xsave_cntxt_size is the max size required by enabled features.
          * We know FP/SSE and YMM about eax, and nothing about edx at present.
          */
-        xsave_cntxt_size = _xstate_ctxt_size(feature_mask);
+        xsave_cntxt_size = hw_uncompressed_size(feature_mask);
         printk("xstate: size: %#x and states: %#"PRIx64"\n",
                xsave_cntxt_size, xfeature_mask);
     }
     else
     {
         BUG_ON(xfeature_mask != feature_mask);
-        BUG_ON(xsave_cntxt_size != _xstate_ctxt_size(feature_mask));
+        BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
     }
 
     if ( setup_xstate_features(bsp) && bsp )
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
  2021-05-03 15:39 ` [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0() Andrew Cooper
  2021-05-03 15:39 ` [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size() Andrew Cooper
@ 2021-05-03 15:39 ` Andrew Cooper
  2021-05-03 18:17   ` Andrew Cooper
  2021-05-04 12:20   ` Jan Beulich
  2021-05-03 15:39 ` [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate() Andrew Cooper
  2021-05-03 15:39 ` [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves Andrew Cooper
  4 siblings, 2 replies; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

We're soon going to need a compressed helper of the same form.

The size of the uncompressed image is a strictly a property of the highest
user state.  This can be calculated trivially with xstate_offsets/sizes, and
is much faster than a CPUID instruction in the first place, let alone the two
XCR0 writes surrounding it.

Retain the cross-check with hardware in debug builds, but forgo it normal
builds.  In particular, this means that the migration paths don't need to mess
with XCR0 just to sanity check the buffer size.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/domctl.c        |  2 +-
 xen/arch/x86/hvm/hvm.c       |  2 +-
 xen/arch/x86/xstate.c        | 40 +++++++++++++++++++++++++++++++---------
 xen/include/asm-x86/xstate.h |  2 +-
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index e440bd021e..8c3552410d 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -899,7 +899,7 @@ long arch_do_domctl(
         uint32_t offset = 0;
 
 #define PV_XSAVE_HDR_SIZE (2 * sizeof(uint64_t))
-#define PV_XSAVE_SIZE(xcr0) (PV_XSAVE_HDR_SIZE + xstate_ctxt_size(xcr0))
+#define PV_XSAVE_SIZE(xcr0) (PV_XSAVE_HDR_SIZE + xstate_uncompressed_size(xcr0))
 
         ret = -ESRCH;
         if ( (evc->vcpu >= d->max_vcpus) ||
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 28beacc45b..e5fda6b387 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1203,7 +1203,7 @@ HVM_REGISTER_SAVE_RESTORE(CPU, hvm_save_cpu_ctxt, hvm_load_cpu_ctxt, 1,
 
 #define HVM_CPU_XSAVE_SIZE(xcr0) (offsetof(struct hvm_hw_cpu_xsave, \
                                            save_area) + \
-                                  xstate_ctxt_size(xcr0))
+                                  xstate_uncompressed_size(xcr0))
 
 static int hvm_save_cpu_xsave_states(struct vcpu *v, hvm_domain_context_t *h)
 {
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index e6c225a16b..d4c01da574 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -184,7 +184,7 @@ void expand_xsave_states(struct vcpu *v, void *dest, unsigned int size)
     /* Check there is state to serialise (i.e. at least an XSAVE_HDR) */
     BUG_ON(!v->arch.xcr0_accum);
     /* Check there is the correct room to decompress into. */
-    BUG_ON(size != xstate_ctxt_size(v->arch.xcr0_accum));
+    BUG_ON(size != xstate_uncompressed_size(v->arch.xcr0_accum));
 
     if ( !(xsave->xsave_hdr.xcomp_bv & XSTATE_COMPACTION_ENABLED) )
     {
@@ -246,7 +246,7 @@ void compress_xsave_states(struct vcpu *v, const void *src, unsigned int size)
     u64 xstate_bv, valid;
 
     BUG_ON(!v->arch.xcr0_accum);
-    BUG_ON(size != xstate_ctxt_size(v->arch.xcr0_accum));
+    BUG_ON(size != xstate_uncompressed_size(v->arch.xcr0_accum));
     ASSERT(!xsave_area_compressed(src));
 
     xstate_bv = ((const struct xsave_struct *)src)->xsave_hdr.xstate_bv;
@@ -568,16 +568,38 @@ static unsigned int hw_uncompressed_size(uint64_t xcr0)
     return size;
 }
 
-/* Fastpath for common xstate size requests, avoiding reloads of xcr0. */
-unsigned int xstate_ctxt_size(u64 xcr0)
+unsigned int xstate_uncompressed_size(uint64_t xcr0)
 {
-    if ( xcr0 == xfeature_mask )
-        return xsave_cntxt_size;
+    unsigned int size;
+    int idx = flsl(xcr0) - 1;
 
-    if ( xcr0 == 0 )
-        return 0;
+    /*
+     * The maximum size of an uncompressed XSAVE area is determined by the
+     * highest user state, as the size and offset of each component is fixed.
+     */
+    if ( idx >= 2 )
+    {
+        ASSERT(xstate_offsets[idx] && xstate_sizes[idx]);
+        size = xstate_offsets[idx] + xstate_sizes[idx];
+    }
+    else
+        size = XSTATE_AREA_MIN_SIZE;
 
-    return hw_uncompressed_size(xcr0);
+    /* In debug builds, cross-check our calculation with hardware. */
+    if ( IS_ENABLED(CONFIG_DEBUG) )
+    {
+        unsigned int hwsize;
+
+        xcr0 |= XSTATE_FP_SSE;
+        hwsize = hw_uncompressed_size(xcr0);
+
+        if ( size != hwsize )
+            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
+                        __func__, xcr0, size, hwsize);
+        size = hwsize;
+    }
+
+    return size;
 }
 
 /* Collect the information of processor's extended state */
diff --git a/xen/include/asm-x86/xstate.h b/xen/include/asm-x86/xstate.h
index 7ab0bdde89..02d6f171b8 100644
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -107,7 +107,7 @@ void compress_xsave_states(struct vcpu *v, const void *src, unsigned int size);
 void xstate_free_save_area(struct vcpu *v);
 int xstate_alloc_save_area(struct vcpu *v);
 void xstate_init(struct cpuinfo_x86 *c);
-unsigned int xstate_ctxt_size(u64 xcr0);
+unsigned int xstate_uncompressed_size(uint64_t xcr0);
 
 static inline uint64_t xgetbv(unsigned int index)
 {
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
                   ` (2 preceding siblings ...)
  2021-05-03 15:39 ` [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size() Andrew Cooper
@ 2021-05-03 15:39 ` Andrew Cooper
  2021-05-04 12:43   ` Jan Beulich
  2021-05-03 15:39 ` [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves Andrew Cooper
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

Make use of the new xstate_uncompressed_size() helper rather than maintaining
the running calculation while accumulating feature components.

The rest of the CPUID data can come direct from the raw cpuid policy.  All
per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
instructions, and are constant.

Use for_each_set_bit() rather than opencoding a slightly awkward version of
it.  Mask the attributes in ecx down based on the visible features.  This
isn't actually necessary for any components or attributes defined at the time
of writing (up to AMX), but is added out of an abundance of caution.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

Using min() in for_each_set_bit() leads to awful code generation, as it
prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
---
 xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
 1 file changed, 17 insertions(+), 35 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 752bf244ea..c7f8388e5d 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
 static void recalculate_xstate(struct cpuid_policy *p)
 {
     uint64_t xstates = XSTATE_FP_SSE;
-    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
-    unsigned int i, Da1 = p->xstate.Da1;
+    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
 
     /*
      * The Da1 leaf is the only piece of information preserved in the common
@@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
         return;
 
     if ( p->basic.avx )
-    {
         xstates |= X86_XCR0_YMM;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_YMM_POS] +
-                          xstate_sizes[X86_XCR0_YMM_POS]);
-    }
 
     if ( p->feat.mpx )
-    {
         xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
-                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
-    }
 
     if ( p->feat.avx512f )
-    {
         xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
-                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
-    }
 
     if ( p->feat.pku )
-    {
         xstates |= X86_XCR0_PKRU;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_PKRU_POS] +
-                          xstate_sizes[X86_XCR0_PKRU_POS]);
-    }
 
-    p->xstate.max_size  =  xstate_size;
+    /* Subleaf 0 */
+    p->xstate.max_size =
+        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
     p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
     p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
 
+    /* Subleaf 1 */
     p->xstate.Da1 = Da1;
     if ( p->xstate.xsaves )
     {
+        ecx_bits |= 3; /* Align64, XSS */
         p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
         p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
     }
-    else
-        xstates &= ~XSTATE_XSAVES_ONLY;
 
-    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
+    /* Subleafs 2+ */
+    xstates &= ~XSTATE_FP_SSE;
+    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
+    for_each_set_bit ( i, &xstates, 63 )
     {
-        uint64_t curr_xstate = 1ul << i;
-
-        if ( !(xstates & curr_xstate) )
-            continue;
-
-        p->xstate.comp[i].size   = xstate_sizes[i];
-        p->xstate.comp[i].offset = xstate_offsets[i];
-        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
-        p->xstate.comp[i].align  = curr_xstate & xstate_align;
+        /*
+         * Pass through size (eax) and offset (ebx) directly.  Visbility of
+         * attributes in ecx limited by visible features in Da1.
+         */
+        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
+        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
+        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
     }
 }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
                   ` (3 preceding siblings ...)
  2021-05-03 15:39 ` [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate() Andrew Cooper
@ 2021-05-03 15:39 ` Andrew Cooper
  2021-05-04 12:56   ` Jan Beulich
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 15:39 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

If max leaf is greater than 0xd but xsave not available to the guest, then the
current XSAVE size should not be filled in.  This is a latent bug for now as
the guest max leaf is 0xd, but will become problematic in the future.

The comment concerning XSS state is wrong.  VT-x doesn't manage host/guest
state automatically, but there is provision for "host only" bits to be set, so
the implications are still accurate.

Introduce {xstate,hw}_compressed_size() helpers to mirror the uncompressed
ones.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/cpuid.c         | 23 +++++++--------------
 xen/arch/x86/xstate.c        | 49 ++++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/xstate.h |  1 +
 3 files changed, 57 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index c7f8388e5d..92745aa63f 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -1041,24 +1041,15 @@ void guest_cpuid(const struct vcpu *v, uint32_t leaf,
     case XSTATE_CPUID:
         switch ( subleaf )
         {
+        case 0:
+            if ( p->basic.xsave )
+                res->b = xstate_uncompressed_size(v->arch.xcr0);
+            break;
+
         case 1:
             if ( p->xstate.xsaves )
-            {
-                /*
-                 * TODO: Figure out what to do for XSS state.  VT-x manages
-                 * host vs guest MSR_XSS automatically, so as soon as we start
-                 * supporting any XSS states, the wrong XSS will be in
-                 * context.
-                 */
-                BUILD_BUG_ON(XSTATE_XSAVES_ONLY != 0);
-
-                /*
-                 * Read CPUID[0xD,0/1].EBX from hardware.  They vary with
-                 * enabled XSTATE, and appropraite XCR0|XSS are in context.
-                 */
-        case 0:
-                res->b = cpuid_count_ebx(leaf, subleaf);
-            }
+                res->b = xstate_compressed_size(v->arch.xcr0 |
+                                                v->arch.msrs->xss.raw);
             break;
         }
         break;
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index d4c01da574..03489f0cf4 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -602,6 +602,55 @@ unsigned int xstate_uncompressed_size(uint64_t xcr0)
     return size;
 }
 
+static unsigned int hw_compressed_size(uint64_t xstates)
+{
+    uint64_t curr_xcr0 = get_xcr0(), curr_xss = get_msr_xss();
+    unsigned int size;
+    bool ok;
+
+    ok = set_xcr0(xstates & ~XSTATE_XSAVES_ONLY);
+    ASSERT(ok);
+    set_msr_xss(xstates & XSTATE_XSAVES_ONLY);
+
+    size = cpuid_count_ebx(XSTATE_CPUID, 1);
+
+    ok = set_xcr0(curr_xcr0);
+    ASSERT(ok);
+    set_msr_xss(curr_xss);
+
+    return size;
+}
+
+unsigned int xstate_compressed_size(uint64_t xstates)
+{
+    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
+
+    xstates &= ~XSTATE_FP_SSE;
+    for_each_set_bit ( i, &xstates, 63 )
+    {
+        if ( test_bit(i, &xstate_align) )
+            size = ROUNDUP(size, 64);
+
+        size += xstate_sizes[i];
+    }
+
+    /* In debug builds, cross-check our calculation with hardware. */
+    if ( IS_ENABLED(CONFIG_DEBUG) )
+    {
+        unsigned int hwsize;
+
+        xstates |= XSTATE_FP_SSE;
+        hwsize = hw_compressed_size(xstates);
+
+        if ( size != hwsize )
+            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
+                        __func__, xstates, size, hwsize);
+        size = hwsize;
+    }
+
+    return size;
+}
+
 /* Collect the information of processor's extended state */
 void xstate_init(struct cpuinfo_x86 *c)
 {
diff --git a/xen/include/asm-x86/xstate.h b/xen/include/asm-x86/xstate.h
index 02d6f171b8..ecf7bbc5cd 100644
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -108,6 +108,7 @@ void xstate_free_save_area(struct vcpu *v);
 int xstate_alloc_save_area(struct vcpu *v);
 void xstate_init(struct cpuinfo_x86 *c);
 unsigned int xstate_uncompressed_size(uint64_t xcr0);
+unsigned int xstate_compressed_size(uint64_t states);
 
 static inline uint64_t xgetbv(unsigned int index)
 {
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-03 15:39 ` [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size() Andrew Cooper
@ 2021-05-03 18:17   ` Andrew Cooper
  2021-05-04 12:08     ` Jan Beulich
  2021-05-04 12:20   ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-03 18:17 UTC (permalink / raw)
  To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné, Wei Liu

On 03/05/2021 16:39, Andrew Cooper wrote:
> We're soon going to need a compressed helper of the same form.
>
> The size of the uncompressed image is a strictly a property of the highest
> user state.  This can be calculated trivially with xstate_offsets/sizes, and
> is much faster than a CPUID instruction in the first place, let alone the two
> XCR0 writes surrounding it.
>
> Retain the cross-check with hardware in debug builds, but forgo it normal
> builds.  In particular, this means that the migration paths don't need to mess
> with XCR0 just to sanity check the buffer size.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

The Qemu smoketests have actually found a bug here.

https://gitlab.com/xen-project/patchew/xen/-/jobs/1232118510/artifacts/file/smoke.serial

We call into xstate_uncompressed_size() from
hvm_register_CPU_save_and_restore() so the previous "xcr0 == 0" path was
critical to Xen not exploding on non-xsave platforms.

This is straight up buggy - we shouldn't be registering xsave handlers
on non-xsave platforms, but the calculation is also wrong (in the safe
directly at least) when we use compressed formats.  Yet another
unexpected surprise for the todo list.

~Andrew



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0()
  2021-05-03 15:39 ` [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0() Andrew Cooper
@ 2021-05-04 11:51   ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 11:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 17:39, Andrew Cooper wrote:
> XSETBV is an expensive instruction as, amongst other things, it involves
> reconfiguring the instruction decode at the frontend of the pipeline.
> 
> We have several paths which reconfigure %xcr0 in quick succession (the context
> switch path has 5, including the fpu save/restore helpers), and only a single
> caller takes any care to try to skip redundant writes.
> 
> Update set_xcr0() to perform amortisation automatically, and simplify the
> __context_switch() path as a consequence.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size()
  2021-05-03 15:39 ` [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size() Andrew Cooper
@ 2021-05-04 11:53   ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 11:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 17:39, Andrew Cooper wrote:
> The latter is a more descriptive name, as it explicitly highlights the query
> from hardware.
> 
> Simplify the internal logic using cpuid_count_ebx(), and drop the curr/max
> assertion as this property is guaranteed by the x86 ISA.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-03 18:17   ` Andrew Cooper
@ 2021-05-04 12:08     ` Jan Beulich
  2021-05-04 12:15       ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 12:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 20:17, Andrew Cooper wrote:
> On 03/05/2021 16:39, Andrew Cooper wrote:
>> We're soon going to need a compressed helper of the same form.
>>
>> The size of the uncompressed image is a strictly a property of the highest
>> user state.  This can be calculated trivially with xstate_offsets/sizes, and
>> is much faster than a CPUID instruction in the first place, let alone the two
>> XCR0 writes surrounding it.
>>
>> Retain the cross-check with hardware in debug builds, but forgo it normal
>> builds.  In particular, this means that the migration paths don't need to mess
>> with XCR0 just to sanity check the buffer size.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> The Qemu smoketests have actually found a bug here.
> 
> https://gitlab.com/xen-project/patchew/xen/-/jobs/1232118510/artifacts/file/smoke.serial
> 
> We call into xstate_uncompressed_size() from
> hvm_register_CPU_save_and_restore() so the previous "xcr0 == 0" path was
> critical to Xen not exploding on non-xsave platforms.
> 
> This is straight up buggy - we shouldn't be registering xsave handlers
> on non-xsave platforms, but the calculation is also wrong (in the safe
> directly at least) when we use compressed formats.  Yet another
> unexpected surprise for the todo list.

I don't view this as buggy at all - it was an implementation choice.
Perhaps not the best one, but still correct afaict. Then again I'm
afraid I don't understand "in the safe directly at least", so I may
well be overlooking something. Will wait for your updated patch ...

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-04 12:08     ` Jan Beulich
@ 2021-05-04 12:15       ` Andrew Cooper
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2021-05-04 12:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04/05/2021 13:08, Jan Beulich wrote:
> On 03.05.2021 20:17, Andrew Cooper wrote:
>> On 03/05/2021 16:39, Andrew Cooper wrote:
>>> We're soon going to need a compressed helper of the same form.
>>>
>>> The size of the uncompressed image is a strictly a property of the highest
>>> user state.  This can be calculated trivially with xstate_offsets/sizes, and
>>> is much faster than a CPUID instruction in the first place, let alone the two
>>> XCR0 writes surrounding it.
>>>
>>> Retain the cross-check with hardware in debug builds, but forgo it normal
>>> builds.  In particular, this means that the migration paths don't need to mess
>>> with XCR0 just to sanity check the buffer size.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> The Qemu smoketests have actually found a bug here.
>>
>> https://gitlab.com/xen-project/patchew/xen/-/jobs/1232118510/artifacts/file/smoke.serial
>>
>> We call into xstate_uncompressed_size() from
>> hvm_register_CPU_save_and_restore() so the previous "xcr0 == 0" path was
>> critical to Xen not exploding on non-xsave platforms.
>>
>> This is straight up buggy - we shouldn't be registering xsave handlers
>> on non-xsave platforms, but the calculation is also wrong (in the safe
>> directly at least) when we use compressed formats.  Yet another
>> unexpected surprise for the todo list.
> I don't view this as buggy at all - it was an implementation choice.
> Perhaps not the best one, but still correct afaict. Then again I'm
> afraid I don't understand "in the safe directly at least", so I may
> well be overlooking something. Will wait for your updated patch ...

For now, it is a patch 2.5/5 which just puts a cpu_has_xsave guard
around the registration.  Everything to do with xsave record processing
is unnecessary overhead on a non-xsave platform.

I don't intend to alter patch 3 as a consequence.

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-03 15:39 ` [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size() Andrew Cooper
  2021-05-03 18:17   ` Andrew Cooper
@ 2021-05-04 12:20   ` Jan Beulich
  2021-05-04 12:22     ` Andrew Cooper
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 12:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 17:39, Andrew Cooper wrote:
> @@ -568,16 +568,38 @@ static unsigned int hw_uncompressed_size(uint64_t xcr0)
>      return size;
>  }
>  
> -/* Fastpath for common xstate size requests, avoiding reloads of xcr0. */
> -unsigned int xstate_ctxt_size(u64 xcr0)
> +unsigned int xstate_uncompressed_size(uint64_t xcr0)

Since you rewrite the function anyway, and since taking into account
the XSS-controlled features here is going to be necessary as well
(even if just down the road, but that's what your ultimate goal is
from all I can tell), how about renaming the parameter to "xstates"
or "states" at the same time?

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-04 12:20   ` Jan Beulich
@ 2021-05-04 12:22     ` Andrew Cooper
  2021-05-04 12:45       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-04 12:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04/05/2021 13:20, Jan Beulich wrote:
> On 03.05.2021 17:39, Andrew Cooper wrote:
>> @@ -568,16 +568,38 @@ static unsigned int hw_uncompressed_size(uint64_t xcr0)
>>      return size;
>>  }
>>  
>> -/* Fastpath for common xstate size requests, avoiding reloads of xcr0. */
>> -unsigned int xstate_ctxt_size(u64 xcr0)
>> +unsigned int xstate_uncompressed_size(uint64_t xcr0)
> Since you rewrite the function anyway, and since taking into account
> the XSS-controlled features here is going to be necessary as well
> (even if just down the road, but that's what your ultimate goal is
> from all I can tell), how about renaming the parameter to "xstates"
> or "states" at the same time?

I'm working on some cleanup of terminology, which I haven't posted yet.

For this one, I'm not sure.  For uncompressed size, we genuinely mean
user states only.  When there's a suitable constant to use, this will
gain an assertion.

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-03 15:39 ` [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate() Andrew Cooper
@ 2021-05-04 12:43   ` Jan Beulich
  2021-05-04 13:58     ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 12:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 17:39, Andrew Cooper wrote:
> Make use of the new xstate_uncompressed_size() helper rather than maintaining
> the running calculation while accumulating feature components.
> 
> The rest of the CPUID data can come direct from the raw cpuid policy.  All
> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
> instructions, and are constant.
> 
> Use for_each_set_bit() rather than opencoding a slightly awkward version of
> it.  Mask the attributes in ecx down based on the visible features.  This
> isn't actually necessary for any components or attributes defined at the time
> of writing (up to AMX), but is added out of an abundance of caution.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> 
> Using min() in for_each_set_bit() leads to awful code generation, as it
> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
> ---
>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>  1 file changed, 17 insertions(+), 35 deletions(-)
> 
> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
> index 752bf244ea..c7f8388e5d 100644
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>  static void recalculate_xstate(struct cpuid_policy *p)
>  {
>      uint64_t xstates = XSTATE_FP_SSE;
> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
> -    unsigned int i, Da1 = p->xstate.Da1;
> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>  
>      /*
>       * The Da1 leaf is the only piece of information preserved in the common
> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>          return;
>  
>      if ( p->basic.avx )
> -    {
>          xstates |= X86_XCR0_YMM;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_YMM_POS] +
> -                          xstate_sizes[X86_XCR0_YMM_POS]);
> -    }
>  
>      if ( p->feat.mpx )
> -    {
>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
> -    }
>  
>      if ( p->feat.avx512f )
> -    {
>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
> -    }
>  
>      if ( p->feat.pku )
> -    {
>          xstates |= X86_XCR0_PKRU;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
> -    }
>  
> -    p->xstate.max_size  =  xstate_size;
> +    /* Subleaf 0 */
> +    p->xstate.max_size =
> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>  
> +    /* Subleaf 1 */
>      p->xstate.Da1 = Da1;
>      if ( p->xstate.xsaves )
>      {
> +        ecx_bits |= 3; /* Align64, XSS */

Align64 is also needed for p->xstate.xsavec afaict. I'm not really
convinced to tie one to the other either. I would rather think this
is a per-state-component attribute independent of other features.
Those state components could in turn have a dependency (like XSS
ones on XSAVES).

I'm also not happy at all to see you use a literal 3 here. We have
a struct for this, after all.

>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>      }
> -    else
> -        xstates &= ~XSTATE_XSAVES_ONLY;
>  
> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
> +    /* Subleafs 2+ */
> +    xstates &= ~XSTATE_FP_SSE;
> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
> +    for_each_set_bit ( i, &xstates, 63 )
>      {
> -        uint64_t curr_xstate = 1ul << i;
> -
> -        if ( !(xstates & curr_xstate) )
> -            continue;
> -
> -        p->xstate.comp[i].size   = xstate_sizes[i];
> -        p->xstate.comp[i].offset = xstate_offsets[i];
> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
> +        /*
> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
> +         * attributes in ecx limited by visible features in Da1.
> +         */
> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;

To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
honest. Both this and the literal 3 above make it harder to locate
all the places that need changing if a new bit (like xfd) is to be
added. It would be better if grep-ing for an existing field name
(say "xss") would easily turn up all involved places.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
  2021-05-04 12:22     ` Andrew Cooper
@ 2021-05-04 12:45       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 12:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04.05.2021 14:22, Andrew Cooper wrote:
> On 04/05/2021 13:20, Jan Beulich wrote:
>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>> @@ -568,16 +568,38 @@ static unsigned int hw_uncompressed_size(uint64_t xcr0)
>>>      return size;
>>>  }
>>>  
>>> -/* Fastpath for common xstate size requests, avoiding reloads of xcr0. */
>>> -unsigned int xstate_ctxt_size(u64 xcr0)
>>> +unsigned int xstate_uncompressed_size(uint64_t xcr0)
>> Since you rewrite the function anyway, and since taking into account
>> the XSS-controlled features here is going to be necessary as well
>> (even if just down the road, but that's what your ultimate goal is
>> from all I can tell), how about renaming the parameter to "xstates"
>> or "states" at the same time?
> 
> I'm working on some cleanup of terminology, which I haven't posted yet.
> 
> For this one, I'm not sure.  For uncompressed size, we genuinely mean
> user states only.

Ah, yes - fair point.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-03 15:39 ` [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves Andrew Cooper
@ 2021-05-04 12:56   ` Jan Beulich
  2021-05-04 14:17     ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-04 12:56 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 03.05.2021 17:39, Andrew Cooper wrote:
> If max leaf is greater than 0xd but xsave not available to the guest, then the
> current XSAVE size should not be filled in.  This is a latent bug for now as
> the guest max leaf is 0xd, but will become problematic in the future.
> 
> The comment concerning XSS state is wrong.  VT-x doesn't manage host/guest
> state automatically, but there is provision for "host only" bits to be set, so
> the implications are still accurate.
> 
> Introduce {xstate,hw}_compressed_size() helpers to mirror the uncompressed
> ones.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit with a remark:

> +unsigned int xstate_compressed_size(uint64_t xstates)
> +{
> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
> +
> +    xstates &= ~XSTATE_FP_SSE;
> +    for_each_set_bit ( i, &xstates, 63 )
> +    {
> +        if ( test_bit(i, &xstate_align) )
> +            size = ROUNDUP(size, 64);
> +
> +        size += xstate_sizes[i];
> +    }
> +
> +    /* In debug builds, cross-check our calculation with hardware. */
> +    if ( IS_ENABLED(CONFIG_DEBUG) )
> +    {
> +        unsigned int hwsize;
> +
> +        xstates |= XSTATE_FP_SSE;
> +        hwsize = hw_compressed_size(xstates);
> +
> +        if ( size != hwsize )
> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
> +                        __func__, xstates, size, hwsize);
> +        size = hwsize;

To be honest, already on the earlier patch I was wondering whether
it does any good to override size here: That'll lead to different
behavior on debug vs release builds. If the log message is not
paid attention to, we'd then end up with longer term breakage.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-04 12:43   ` Jan Beulich
@ 2021-05-04 13:58     ` Andrew Cooper
  2021-05-05  8:19       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-04 13:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04/05/2021 13:43, Jan Beulich wrote:
> On 03.05.2021 17:39, Andrew Cooper wrote:
>> Make use of the new xstate_uncompressed_size() helper rather than maintaining
>> the running calculation while accumulating feature components.
>>
>> The rest of the CPUID data can come direct from the raw cpuid policy.  All
>> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
>> instructions, and are constant.
>>
>> Use for_each_set_bit() rather than opencoding a slightly awkward version of
>> it.  Mask the attributes in ecx down based on the visible features.  This
>> isn't actually necessary for any components or attributes defined at the time
>> of writing (up to AMX), but is added out of an abundance of caution.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>>
>> Using min() in for_each_set_bit() leads to awful code generation, as it
>> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
>> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
>> ---
>>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>>  1 file changed, 17 insertions(+), 35 deletions(-)
>>
>> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
>> index 752bf244ea..c7f8388e5d 100644
>> --- a/xen/arch/x86/cpuid.c
>> +++ b/xen/arch/x86/cpuid.c
>> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>>  static void recalculate_xstate(struct cpuid_policy *p)
>>  {
>>      uint64_t xstates = XSTATE_FP_SSE;
>> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
>> -    unsigned int i, Da1 = p->xstate.Da1;
>> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>>  
>>      /*
>>       * The Da1 leaf is the only piece of information preserved in the common
>> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>>          return;
>>  
>>      if ( p->basic.avx )
>> -    {
>>          xstates |= X86_XCR0_YMM;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_YMM_POS] +
>> -                          xstate_sizes[X86_XCR0_YMM_POS]);
>> -    }
>>  
>>      if ( p->feat.mpx )
>> -    {
>>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
>> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
>> -    }
>>  
>>      if ( p->feat.avx512f )
>> -    {
>>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
>> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
>> -    }
>>  
>>      if ( p->feat.pku )
>> -    {
>>          xstates |= X86_XCR0_PKRU;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
>> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
>> -    }
>>  
>> -    p->xstate.max_size  =  xstate_size;
>> +    /* Subleaf 0 */
>> +    p->xstate.max_size =
>> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>>  
>> +    /* Subleaf 1 */
>>      p->xstate.Da1 = Da1;
>>      if ( p->xstate.xsaves )
>>      {
>> +        ecx_bits |= 3; /* Align64, XSS */
> Align64 is also needed for p->xstate.xsavec afaict. I'm not really
> convinced to tie one to the other either. I would rather think this
> is a per-state-component attribute independent of other features.
> Those state components could in turn have a dependency (like XSS
> ones on XSAVES).

There is no such thing as a system with xsavec != xsaves (although there
does appear to be one line of AMD CPU with xsaves and not xgetbv1).

Through some (likely unintentional) coupling of data in CPUID, the
compressed dynamic size (CPUID.0xd[1].ebx) is required for xsavec, and
is strictly defined as XCR0|XSS, which forces xsaves into the mix.

In fact, an error with the spec is that userspace can calculate the
kernel's choice of MSR_XSS using CPUID data alone - there is not
currently an ambiguous combination of sizes of supervisor state
components.  This fact also makes XSAVEC suboptimal even for userspace
to use, because it is forced to allocate larger-than-necessary buffers.

In principle, we could ignore the coupling and support xsavec without
xsaves, but given that XSAVES is strictly more useful than XSAVEC, I'm
not sure it is worth trying to support.

>
> I'm also not happy at all to see you use a literal 3 here. We have
> a struct for this, after all.
>
>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>      }
>> -    else
>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>  
>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>> +    /* Subleafs 2+ */
>> +    xstates &= ~XSTATE_FP_SSE;
>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>> +    for_each_set_bit ( i, &xstates, 63 )
>>      {
>> -        uint64_t curr_xstate = 1ul << i;
>> -
>> -        if ( !(xstates & curr_xstate) )
>> -            continue;
>> -
>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>> +        /*
>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>> +         * attributes in ecx limited by visible features in Da1.
>> +         */
>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
> honest. Both this and the literal 3 above make it harder to locate
> all the places that need changing if a new bit (like xfd) is to be
> added. It would be better if grep-ing for an existing field name
> (say "xss") would easily turn up all involved places.

It's specifically to reduce the number of areas needing editing when a
new state, and therefore the number of opportunities to screw things up.

As said in the commit message, I'm not even convinced that the ecx_bits
mask is necessary, as new attributes only come in with new behaviours of
new state components.

If we choose to skip the ecx masking, then this loop body becomes even
more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].

Even if Intel do break with tradition, and retrofit new attributes into
existing subleafs, leaking them to guests won't cause anything to
explode (the bits are still reserved after all), and we can fix anything
necessary at that point.

~Andrew



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-04 12:56   ` Jan Beulich
@ 2021-05-04 14:17     ` Andrew Cooper
  2021-05-05  8:33       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-04 14:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04/05/2021 13:56, Jan Beulich wrote:
> On 03.05.2021 17:39, Andrew Cooper wrote:
>> If max leaf is greater than 0xd but xsave not available to the guest, then the
>> current XSAVE size should not be filled in.  This is a latent bug for now as
>> the guest max leaf is 0xd, but will become problematic in the future.
>>
>> The comment concerning XSS state is wrong.  VT-x doesn't manage host/guest
>> state automatically, but there is provision for "host only" bits to be set, so
>> the implications are still accurate.
>>
>> Introduce {xstate,hw}_compressed_size() helpers to mirror the uncompressed
>> ones.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> albeit with a remark:
>
>> +unsigned int xstate_compressed_size(uint64_t xstates)
>> +{
>> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
>> +
>> +    xstates &= ~XSTATE_FP_SSE;
>> +    for_each_set_bit ( i, &xstates, 63 )
>> +    {
>> +        if ( test_bit(i, &xstate_align) )
>> +            size = ROUNDUP(size, 64);
>> +
>> +        size += xstate_sizes[i];
>> +    }
>> +
>> +    /* In debug builds, cross-check our calculation with hardware. */
>> +    if ( IS_ENABLED(CONFIG_DEBUG) )
>> +    {
>> +        unsigned int hwsize;
>> +
>> +        xstates |= XSTATE_FP_SSE;
>> +        hwsize = hw_compressed_size(xstates);
>> +
>> +        if ( size != hwsize )
>> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
>> +                        __func__, xstates, size, hwsize);
>> +        size = hwsize;
> To be honest, already on the earlier patch I was wondering whether
> it does any good to override size here: That'll lead to different
> behavior on debug vs release builds. If the log message is not
> paid attention to, we'd then end up with longer term breakage.

Well - our options are pass hardware size, or BUG(), because getting
this wrong will cause memory corruption.

The BUG() option is a total pain when developing new support - the first
version of this patch did use BUG(), but after hitting it 4 times in a
row (caused by issues with constants elsewhere), I decided against it.

If we had something which was a mix between WARN_ONCE() and a proper
printk() explaining what was going on, then I would have used that. 
Maybe it's time to introduce one...

~Andrew



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-04 13:58     ` Andrew Cooper
@ 2021-05-05  8:19       ` Jan Beulich
  2021-05-05 14:53         ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-05  8:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04.05.2021 15:58, Andrew Cooper wrote:
> On 04/05/2021 13:43, Jan Beulich wrote:
>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>> Make use of the new xstate_uncompressed_size() helper rather than maintaining
>>> the running calculation while accumulating feature components.
>>>
>>> The rest of the CPUID data can come direct from the raw cpuid policy.  All
>>> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
>>> instructions, and are constant.
>>>
>>> Use for_each_set_bit() rather than opencoding a slightly awkward version of
>>> it.  Mask the attributes in ecx down based on the visible features.  This
>>> isn't actually necessary for any components or attributes defined at the time
>>> of writing (up to AMX), but is added out of an abundance of caution.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>> CC: Wei Liu <wl@xen.org>
>>>
>>> Using min() in for_each_set_bit() leads to awful code generation, as it
>>> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
>>> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
>>> ---
>>>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>>>  1 file changed, 17 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
>>> index 752bf244ea..c7f8388e5d 100644
>>> --- a/xen/arch/x86/cpuid.c
>>> +++ b/xen/arch/x86/cpuid.c
>>> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>>>  static void recalculate_xstate(struct cpuid_policy *p)
>>>  {
>>>      uint64_t xstates = XSTATE_FP_SSE;
>>> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
>>> -    unsigned int i, Da1 = p->xstate.Da1;
>>> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>>>  
>>>      /*
>>>       * The Da1 leaf is the only piece of information preserved in the common
>>> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>>>          return;
>>>  
>>>      if ( p->basic.avx )
>>> -    {
>>>          xstates |= X86_XCR0_YMM;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_YMM_POS] +
>>> -                          xstate_sizes[X86_XCR0_YMM_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.mpx )
>>> -    {
>>>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
>>> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.avx512f )
>>> -    {
>>>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
>>> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.pku )
>>> -    {
>>>          xstates |= X86_XCR0_PKRU;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
>>> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
>>> -    }
>>>  
>>> -    p->xstate.max_size  =  xstate_size;
>>> +    /* Subleaf 0 */
>>> +    p->xstate.max_size =
>>> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>>>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>>>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>>>  
>>> +    /* Subleaf 1 */
>>>      p->xstate.Da1 = Da1;
>>>      if ( p->xstate.xsaves )
>>>      {
>>> +        ecx_bits |= 3; /* Align64, XSS */
>> Align64 is also needed for p->xstate.xsavec afaict. I'm not really
>> convinced to tie one to the other either. I would rather think this
>> is a per-state-component attribute independent of other features.
>> Those state components could in turn have a dependency (like XSS
>> ones on XSAVES).
> 
> There is no such thing as a system with xsavec != xsaves (although there
> does appear to be one line of AMD CPU with xsaves and not xgetbv1).

If we believed there was such a dependency, gen-cpuid.py should imo
already express it. The latest when we make ourselves depend on such
(which I remain not fully convinced of), such a dependency would
need adding, such that it becomes impossible to turn off xsaves
without also turning off xsavec. (Of course, a way to express this
symbolically doesn't currently exist, and is only being added as a
"side effect" of "x86: XFD enabling".)

> Through some (likely unintentional) coupling of data in CPUID, the
> compressed dynamic size (CPUID.0xd[1].ebx) is required for xsavec, and
> is strictly defined as XCR0|XSS, which forces xsaves into the mix.
> 
> In fact, an error with the spec is that userspace can calculate the
> kernel's choice of MSR_XSS using CPUID data alone - there is not
> currently an ambiguous combination of sizes of supervisor state
> components.  This fact also makes XSAVEC suboptimal even for userspace
> to use, because it is forced to allocate larger-than-necessary buffers.

But space-wise it's still better that way than using the uncompressed
format.

> In principle, we could ignore the coupling and support xsavec without
> xsaves, but given that XSAVES is strictly more useful than XSAVEC, I'm
> not sure it is worth trying to support.

I think we should, but I'm not going to object to the alternative as
long as dependencies are properly (put) in place.

>> I'm also not happy at all to see you use a literal 3 here. We have
>> a struct for this, after all.
>>
>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>      }
>>> -    else
>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>  
>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>> +    /* Subleafs 2+ */
>>> +    xstates &= ~XSTATE_FP_SSE;
>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>      {
>>> -        uint64_t curr_xstate = 1ul << i;
>>> -
>>> -        if ( !(xstates & curr_xstate) )
>>> -            continue;
>>> -
>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>> +        /*
>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>> +         * attributes in ecx limited by visible features in Da1.
>>> +         */
>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>> honest. Both this and the literal 3 above make it harder to locate
>> all the places that need changing if a new bit (like xfd) is to be
>> added. It would be better if grep-ing for an existing field name
>> (say "xss") would easily turn up all involved places.
> 
> It's specifically to reduce the number of areas needing editing when a
> new state, and therefore the number of opportunities to screw things up.
> 
> As said in the commit message, I'm not even convinced that the ecx_bits
> mask is necessary, as new attributes only come in with new behaviours of
> new state components.
> 
> If we choose to skip the ecx masking, then this loop body becomes even
> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
> 
> Even if Intel do break with tradition, and retrofit new attributes into
> existing subleafs, leaking them to guests won't cause anything to
> explode (the bits are still reserved after all), and we can fix anything
> necessary at that point.

I don't think this would necessarily go without breakage. What if,
assuming XFD support is in, an existing component got XFD sensitivity
added to it? If, like you were suggesting elsewhere, and like I had
it initially, we used a build-time constant for XFD-affected
components, we'd break consuming guests. The per-component XFD bit
(just to again take as example) also isn't strictly speaking tied to
the general XFD feature flag (but to me it makes sense for us to
enforce respective consistency). Plus, in general, the moment a flag
is no longer reserved in the spec, it is not reserved anywhere
anymore: An aware (newer) guest running on unaware (older) Xen ought
to still function correctly.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-04 14:17     ` Andrew Cooper
@ 2021-05-05  8:33       ` Jan Beulich
  2021-05-05 16:59         ` Andrew Cooper
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2021-05-05  8:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 04.05.2021 16:17, Andrew Cooper wrote:
> On 04/05/2021 13:56, Jan Beulich wrote:
>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>> +unsigned int xstate_compressed_size(uint64_t xstates)
>>> +{
>>> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
>>> +
>>> +    xstates &= ~XSTATE_FP_SSE;
>>> +    for_each_set_bit ( i, &xstates, 63 )
>>> +    {
>>> +        if ( test_bit(i, &xstate_align) )
>>> +            size = ROUNDUP(size, 64);
>>> +
>>> +        size += xstate_sizes[i];
>>> +    }
>>> +
>>> +    /* In debug builds, cross-check our calculation with hardware. */
>>> +    if ( IS_ENABLED(CONFIG_DEBUG) )
>>> +    {
>>> +        unsigned int hwsize;
>>> +
>>> +        xstates |= XSTATE_FP_SSE;
>>> +        hwsize = hw_compressed_size(xstates);
>>> +
>>> +        if ( size != hwsize )
>>> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
>>> +                        __func__, xstates, size, hwsize);
>>> +        size = hwsize;
>> To be honest, already on the earlier patch I was wondering whether
>> it does any good to override size here: That'll lead to different
>> behavior on debug vs release builds. If the log message is not
>> paid attention to, we'd then end up with longer term breakage.
> 
> Well - our options are pass hardware size, or BUG(), because getting
> this wrong will cause memory corruption.

I'm afraid I'm lost: Neither passing hardware size nor BUG() would
happen in a release build, so getting this wrong does mean memory
corruption there. And I'm of the clear opinion that debug builds
shouldn't differ in behavior in such regards.

If there wasn't an increasing number of possible combinations I
would be inclined to suggest that in all builds we check during
boot that calculation and hardware provided values match for all
possible (valid) combinations.

> The BUG() option is a total pain when developing new support - the first
> version of this patch did use BUG(), but after hitting it 4 times in a
> row (caused by issues with constants elsewhere), I decided against it.

I can fully understand this aspect. I'm not opposed to printk_once()
at all. My comment was purely towards the override.

> If we had something which was a mix between WARN_ONCE() and a proper
> printk() explaining what was going on, then I would have used that. 
> Maybe it's time to introduce one...

I don't think there's a need here - what you have in terms of info
put into the log is imo sufficient.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-05  8:19       ` Jan Beulich
@ 2021-05-05 14:53         ` Andrew Cooper
  2021-05-05 15:14           ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-05 14:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 05/05/2021 09:19, Jan Beulich wrote:
> On 04.05.2021 15:58, Andrew Cooper wrote:
>> On 04/05/2021 13:43, Jan Beulich wrote:
>>> I'm also not happy at all to see you use a literal 3 here. We have
>>> a struct for this, after all.
>>>
>>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>>      }
>>>> -    else
>>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>>  
>>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>>> +    /* Subleafs 2+ */
>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>>      {
>>>> -        uint64_t curr_xstate = 1ul << i;
>>>> -
>>>> -        if ( !(xstates & curr_xstate) )
>>>> -            continue;
>>>> -
>>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>>> +        /*
>>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>>> +         * attributes in ecx limited by visible features in Da1.
>>>> +         */
>>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>>> honest. Both this and the literal 3 above make it harder to locate
>>> all the places that need changing if a new bit (like xfd) is to be
>>> added. It would be better if grep-ing for an existing field name
>>> (say "xss") would easily turn up all involved places.
>> It's specifically to reduce the number of areas needing editing when a
>> new state, and therefore the number of opportunities to screw things up.
>>
>> As said in the commit message, I'm not even convinced that the ecx_bits
>> mask is necessary, as new attributes only come in with new behaviours of
>> new state components.
>>
>> If we choose to skip the ecx masking, then this loop body becomes even
>> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
>>
>> Even if Intel do break with tradition, and retrofit new attributes into
>> existing subleafs, leaking them to guests won't cause anything to
>> explode (the bits are still reserved after all), and we can fix anything
>> necessary at that point.
> I don't think this would necessarily go without breakage. What if,
> assuming XFD support is in, an existing component got XFD sensitivity
> added to it?

I think that is exceedingly unlikely to happen.

>  If, like you were suggesting elsewhere, and like I had
> it initially, we used a build-time constant for XFD-affected
> components, we'd break consuming guests. The per-component XFD bit
> (just to again take as example) also isn't strictly speaking tied to
> the general XFD feature flag (but to me it makes sense for us to
> enforce respective consistency). Plus, in general, the moment a flag
> is no longer reserved in the spec, it is not reserved anywhere
> anymore: An aware (newer) guest running on unaware (older) Xen ought
> to still function correctly.

They're still technically reserved, because of the masking of the XFD
bit in the feature leaf.

However, pondered this for some time, we do need to retain the
attributes masking, because e.g. AMX && !XSAVEC is a legal (if very
unwise) combination to expose, and the align64 bits want to disappear
from the TILE state attributes.

Also, in terms of implementation, the easiest way to do something
plausible here is a dependency chain of XSAVE => XSAVEC (implies align64
masking) => XSAVES (implies xss masking) => CET_*.

XFD would depend on XSAVE, and would imply masking the xfd attribute.

This still leaves the broken corner case of the dynamic compressed size,
but I think I can live with that to avoid the added complexity of trying
to force XSAVEC == XSAVES.

~Andrew



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
  2021-05-05 14:53         ` Andrew Cooper
@ 2021-05-05 15:14           ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2021-05-05 15:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 05.05.2021 16:53, Andrew Cooper wrote:
> On 05/05/2021 09:19, Jan Beulich wrote:
>> On 04.05.2021 15:58, Andrew Cooper wrote:
>>> On 04/05/2021 13:43, Jan Beulich wrote:
>>>> I'm also not happy at all to see you use a literal 3 here. We have
>>>> a struct for this, after all.
>>>>
>>>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>>>      }
>>>>> -    else
>>>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>>>  
>>>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>>>> +    /* Subleafs 2+ */
>>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>>>      {
>>>>> -        uint64_t curr_xstate = 1ul << i;
>>>>> -
>>>>> -        if ( !(xstates & curr_xstate) )
>>>>> -            continue;
>>>>> -
>>>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>>>> +        /*
>>>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>>>> +         * attributes in ecx limited by visible features in Da1.
>>>>> +         */
>>>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>>>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>>>> honest. Both this and the literal 3 above make it harder to locate
>>>> all the places that need changing if a new bit (like xfd) is to be
>>>> added. It would be better if grep-ing for an existing field name
>>>> (say "xss") would easily turn up all involved places.
>>> It's specifically to reduce the number of areas needing editing when a
>>> new state, and therefore the number of opportunities to screw things up.
>>>
>>> As said in the commit message, I'm not even convinced that the ecx_bits
>>> mask is necessary, as new attributes only come in with new behaviours of
>>> new state components.
>>>
>>> If we choose to skip the ecx masking, then this loop body becomes even
>>> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
>>>
>>> Even if Intel do break with tradition, and retrofit new attributes into
>>> existing subleafs, leaking them to guests won't cause anything to
>>> explode (the bits are still reserved after all), and we can fix anything
>>> necessary at that point.
>> I don't think this would necessarily go without breakage. What if,
>> assuming XFD support is in, an existing component got XFD sensitivity
>> added to it?
> 
> I think that is exceedingly unlikely to happen.
> 
>>  If, like you were suggesting elsewhere, and like I had
>> it initially, we used a build-time constant for XFD-affected
>> components, we'd break consuming guests. The per-component XFD bit
>> (just to again take as example) also isn't strictly speaking tied to
>> the general XFD feature flag (but to me it makes sense for us to
>> enforce respective consistency). Plus, in general, the moment a flag
>> is no longer reserved in the spec, it is not reserved anywhere
>> anymore: An aware (newer) guest running on unaware (older) Xen ought
>> to still function correctly.
> 
> They're still technically reserved, because of the masking of the XFD
> bit in the feature leaf.
> 
> However, pondered this for some time, we do need to retain the
> attributes masking, because e.g. AMX && !XSAVEC is a legal (if very
> unwise) combination to expose, and the align64 bits want to disappear
> from the TILE state attributes.
> 
> Also, in terms of implementation, the easiest way to do something
> plausible here is a dependency chain of XSAVE => XSAVEC (implies align64
> masking) => XSAVES (implies xss masking) => CET_*.
> 
> XFD would depend on XSAVE, and would imply masking the xfd attribute.

Okay, let's go with this then.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-05  8:33       ` Jan Beulich
@ 2021-05-05 16:59         ` Andrew Cooper
  2021-05-06  6:17           ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2021-05-05 16:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 05/05/2021 09:33, Jan Beulich wrote:
> On 04.05.2021 16:17, Andrew Cooper wrote:
>> On 04/05/2021 13:56, Jan Beulich wrote:
>>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>>> +unsigned int xstate_compressed_size(uint64_t xstates)
>>>> +{
>>>> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
>>>> +
>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>> +    {
>>>> +        if ( test_bit(i, &xstate_align) )
>>>> +            size = ROUNDUP(size, 64);
>>>> +
>>>> +        size += xstate_sizes[i];
>>>> +    }
>>>> +
>>>> +    /* In debug builds, cross-check our calculation with hardware. */
>>>> +    if ( IS_ENABLED(CONFIG_DEBUG) )
>>>> +    {
>>>> +        unsigned int hwsize;
>>>> +
>>>> +        xstates |= XSTATE_FP_SSE;
>>>> +        hwsize = hw_compressed_size(xstates);
>>>> +
>>>> +        if ( size != hwsize )
>>>> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
>>>> +                        __func__, xstates, size, hwsize);
>>>> +        size = hwsize;
>>> To be honest, already on the earlier patch I was wondering whether
>>> it does any good to override size here: That'll lead to different
>>> behavior on debug vs release builds. If the log message is not
>>> paid attention to, we'd then end up with longer term breakage.
>> Well - our options are pass hardware size, or BUG(), because getting
>> this wrong will cause memory corruption.
> I'm afraid I'm lost: Neither passing hardware size nor BUG() would
> happen in a release build, so getting this wrong does mean memory
> corruption there. And I'm of the clear opinion that debug builds
> shouldn't differ in behavior in such regards.

The point of not cross-checking with hardware in release builds is to
remove a bunch of very expensive operations from path which is hit
several times on every fork(), so isn't exactly rare.

But yes - the consequence of being wrong, for whatever reason, is memory
corruption (especially as the obvious way it goes wrong is having an
xsave_size[] of 0, so we end up under-reporting).

So one way or another, we need to ensure that every newly exposed option
is tested in a debug build first.  The integration is either complete,
or it isn't, so I don't think this terribly onerous to do.


As for actually having a behaviour difference between debug and release,
I'm not concerned.

Hitting this failure means "there is definitely a serious error in Xen,
needing fixing", but it will only be encountered the during development
of a new feature, so is for all practical concerns, limited to
development of the new feature in question.

BUG() gets the point across very obviously, but is unhelpful when in
practice the test system wont explode because the dom0 kernel won't be
using this new state yet.

> If there wasn't an increasing number of possible combinations I
> would be inclined to suggest that in all builds we check during
> boot that calculation and hardware provided values match for all
> possible (valid) combinations.

I was actually considering an XTF test on the matter, which would be a
cleaning up of the one generating the example in the cover letter.

As above, logic is only liable to be wrong during development of support
for a new state component, which is why it is reasonable to take away
the overhead in release builds.

>> The BUG() option is a total pain when developing new support - the first
>> version of this patch did use BUG(), but after hitting it 4 times in a
>> row (caused by issues with constants elsewhere), I decided against it.
> I can fully understand this aspect. I'm not opposed to printk_once()
> at all. My comment was purely towards the override.
>
>> If we had something which was a mix between WARN_ONCE() and a proper
>> printk() explaining what was going on, then I would have used that. 
>> Maybe it's time to introduce one...
> I don't think there's a need here - what you have in terms of info
> put into the log is imo sufficient.

Well - it needs to be sufficiently obvious to people who aren't you and
me.  There are also other areas in Xen which would benefit from changing
their diagnostics to be as described.

~Andrew



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
  2021-05-05 16:59         ` Andrew Cooper
@ 2021-05-06  6:17           ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2021-05-06  6:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 05.05.2021 18:59, Andrew Cooper wrote:
> On 05/05/2021 09:33, Jan Beulich wrote:
>> On 04.05.2021 16:17, Andrew Cooper wrote:
>>> On 04/05/2021 13:56, Jan Beulich wrote:
>>>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>>>> +unsigned int xstate_compressed_size(uint64_t xstates)
>>>>> +{
>>>>> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
>>>>> +
>>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>>> +    {
>>>>> +        if ( test_bit(i, &xstate_align) )
>>>>> +            size = ROUNDUP(size, 64);
>>>>> +
>>>>> +        size += xstate_sizes[i];
>>>>> +    }
>>>>> +
>>>>> +    /* In debug builds, cross-check our calculation with hardware. */
>>>>> +    if ( IS_ENABLED(CONFIG_DEBUG) )
>>>>> +    {
>>>>> +        unsigned int hwsize;
>>>>> +
>>>>> +        xstates |= XSTATE_FP_SSE;
>>>>> +        hwsize = hw_compressed_size(xstates);
>>>>> +
>>>>> +        if ( size != hwsize )
>>>>> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
>>>>> +                        __func__, xstates, size, hwsize);
>>>>> +        size = hwsize;
>>>> To be honest, already on the earlier patch I was wondering whether
>>>> it does any good to override size here: That'll lead to different
>>>> behavior on debug vs release builds. If the log message is not
>>>> paid attention to, we'd then end up with longer term breakage.
>>> Well - our options are pass hardware size, or BUG(), because getting
>>> this wrong will cause memory corruption.
>> I'm afraid I'm lost: Neither passing hardware size nor BUG() would
>> happen in a release build, so getting this wrong does mean memory
>> corruption there. And I'm of the clear opinion that debug builds
>> shouldn't differ in behavior in such regards.
> 
> The point of not cross-checking with hardware in release builds is to
> remove a bunch of very expensive operations from path which is hit
> several times on every fork(), so isn't exactly rare.
> 
> But yes - the consequence of being wrong, for whatever reason, is memory
> corruption (especially as the obvious way it goes wrong is having an
> xsave_size[] of 0, so we end up under-reporting).
> 
> So one way or another, we need to ensure that every newly exposed option
> is tested in a debug build first.  The integration is either complete,
> or it isn't, so I don't think this terribly onerous to do.
> 
> 
> As for actually having a behaviour difference between debug and release,
> I'm not concerned.
> 
> Hitting this failure means "there is definitely a serious error in Xen,
> needing fixing", but it will only be encountered the during development
> of a new feature, so is for all practical concerns, limited to
> development of the new feature in question.
> 
> BUG() gets the point across very obviously, but is unhelpful when in
> practice the test system wont explode because the dom0 kernel won't be
> using this new state yet.
> 
>> If there wasn't an increasing number of possible combinations I
>> would be inclined to suggest that in all builds we check during
>> boot that calculation and hardware provided values match for all
>> possible (valid) combinations.
> 
> I was actually considering an XTF test on the matter, which would be a
> cleaning up of the one generating the example in the cover letter.
> 
> As above, logic is only liable to be wrong during development of support
> for a new state component, which is why it is reasonable to take away
> the overhead in release builds.

Well, okay then - let's hope all bugs here are obviously exposed
during initial development, and no corner cases get missed.

>>> The BUG() option is a total pain when developing new support - the first
>>> version of this patch did use BUG(), but after hitting it 4 times in a
>>> row (caused by issues with constants elsewhere), I decided against it.
>> I can fully understand this aspect. I'm not opposed to printk_once()
>> at all. My comment was purely towards the override.
>>
>>> If we had something which was a mix between WARN_ONCE() and a proper
>>> printk() explaining what was going on, then I would have used that. 
>>> Maybe it's time to introduce one...
>> I don't think there's a need here - what you have in terms of info
>> put into the log is imo sufficient.
> 
> Well - it needs to be sufficiently obvious to people who aren't you and
> me.  There are also other areas in Xen which would benefit from changing
> their diagnostics to be as described.

I generally disagree: A log message of this kind needs to be detailed
enough to easily find where it originates in source. Then the source
there should have enough information. Things are different when a log
message implies an admin may be able to take some corrective action
without actually changing source code in any way. That's not the case
here afaict.

Jan



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-05-06  6:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 15:39 [PATCH 0/5] x86/xstate: Fixes to size calculations Andrew Cooper
2021-05-03 15:39 ` [PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0() Andrew Cooper
2021-05-04 11:51   ` Jan Beulich
2021-05-03 15:39 ` [PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size() Andrew Cooper
2021-05-04 11:53   ` Jan Beulich
2021-05-03 15:39 ` [PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size() Andrew Cooper
2021-05-03 18:17   ` Andrew Cooper
2021-05-04 12:08     ` Jan Beulich
2021-05-04 12:15       ` Andrew Cooper
2021-05-04 12:20   ` Jan Beulich
2021-05-04 12:22     ` Andrew Cooper
2021-05-04 12:45       ` Jan Beulich
2021-05-03 15:39 ` [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate() Andrew Cooper
2021-05-04 12:43   ` Jan Beulich
2021-05-04 13:58     ` Andrew Cooper
2021-05-05  8:19       ` Jan Beulich
2021-05-05 14:53         ` Andrew Cooper
2021-05-05 15:14           ` Jan Beulich
2021-05-03 15:39 ` [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves Andrew Cooper
2021-05-04 12:56   ` Jan Beulich
2021-05-04 14:17     ` Andrew Cooper
2021-05-05  8:33       ` Jan Beulich
2021-05-05 16:59         ` Andrew Cooper
2021-05-06  6:17           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).